I've been debating whether the data lies on a manifold, or whether the data attributes that are task-relevant (and of our interest) lie on a manifold?
I suspect it is the latter, but I've seen Platonic Representation Hypothesis that seems to hint it is the former.
When people talk about models "just predicting the next word", this is a popularization of the fact that modern LLMs are "autoregressive" models. This actually has two components: an architectural component (the model generates words one at a time), and a loss component (it maximizes probability).
As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.
This brings us to a debate which goes back many, many years: what does it mean to predict the next word? Many researchers, including myself, have believed that if you want to predict the next word really well, you need to do a lot more. (And with this paper, we're able to see this mechanistically!)
Here's an example, which we didn't put in the paper: How does Claude answer "What do you call someone who studies the stars?" with "An astronomer"? In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards. This is a kind of very, very small scale planning – but you can see how even just a pure autoregressive model is incentivized to do it.