It never was "just predicting the next word", in that that was always a reductive argument about artifacts that are plainly more than what the phrase implies.
And also, they are still "just predicting the next word", literally in terms of how they function and are trained. And there are still cases where it's useful to remember this.
I'm thinking specifically of chat psychosis, where people go down a rabbit hole with these things, thinking they're gaining deep insights because they don't understand the nature of the thing they're interacting with.
They're interacting with something that does really good - but fallible - autocomplete based on 3 major inputs.
1) They are predicting the next word based on the pre-training data, internet data, which makes them fairly useful on general knowledge.
2) They are predicting the next word based on RL training data, which causes them to be able to perform conversational responses rather than autocomplete style responses, because they are autocompleting conversational data. This also causes them to be extremely obsequious and agreeable, to try to go along with what you give them and to try to mimic it.
3) They are autocompleting the conversation based on your own inputs and the entire history of the conversation. This, combined with 2), means you are, to a large extent, talking yourself, or rather something that is very adept at mimicing and going along with your inputs.
Who, or what, are you talking to when you interact with these? Something that predicts the next word, with varying accuracy, based on a corpus of general knowledge plus a corpus of agreeable question/answer format plus yourself. The general knowledge is great as long as it's fairly accurate, the sycophantic mirror of yourself sucks.
I always hear this "AI solved a crazy math problem that no math teams could solve" and think why can it never solve the math problems that I need it to solve when they should be easily doable by any high school student.
Because what it really means is "we directed the AI, it translated our ideas into Lean, the Lean tool then acted as an oracle determining if anything as incorrect, doing literally all the hard work of of correctness, and this process looped until the prompter gave up or Lean sent back an all clear"
I usually explain it in english. Usually more complicated but phrased the same as this example "Give me the per captia rate for a population of 10000000 who hold $100000 each etc. I think in hindsight it might be because they're searching the web for an answer instead of just calculating it like i'd asked.
I somewhat take issue with the second math example (the geometry problem); that is solvable routinely by computer algebra systems, and being able to translate problems into inputs, hit run and transcribe the proof back to English prose (which for all we know was what it did, since OpenAI and Google have confirmed their entrants received these tools which human candidates did not) is not so astonishing as the blog post makes it out to be
A few things I think are useful to emphasize on the training side, beyond what the article says:
1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).
2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.
3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.
1. The Model Architecture. Calculation of outputs from inputs.
2. The Training Algorithm, that alters parameters in the architecture based on training data, often input, outputs vs. targets, but can be more complex than that. I.e. gradient descent, etc.
3. The Class of Problem being solved, i.e. approximation, prediction, etc.
4. The Instance of Problem of the problem being solved, i.e. approximation of chemical reaction completion vs. temperature, or prediction of textual responses.
5. The Data Embodiment of the problem, i.e. the actual data. How much, how complex, how general, how noisey, how accurate, how variable, how biased, ...?
And only after all those,
6. The Learned Algorithm that emerges from continual exposure to (5) in the basic form of (3), in order to specifically perform (4), by applying algorithm (2), to the model's parameters that control its input-output algorithm (1).
The latter, (6), has no limit in complexity or quality, or types of sub-problems, that must also be solved, to solve the umbrella problem successfully.
Data can be unbounded in complexity. Therefore, actual (successful) solutions are necessarily unbounded in complexity.
The "no limit, unbounded, any kind" of sub-problem part of (6) is missed by many people. To perform accurate predictions, of say the whole stock market, would require a model to learn everything from economic theory, geopolitics, human psychology, natural resources and their extraction, crime, electronic information systems and their optimizations, game theory, ...
That isn't a model I would call "just a stock price predictor".
Human language is an artifact created by complex beings. A high level of understanding of how those complex beings operate in conversation, writing, speeches, legal theory, their knowledge of 1000's of topics, psychologies, cultures, assumptions, motives, lifetime development, their modeling of each other, ... on and on ... becomes necessary to mimic general written artifacts between people with any resemblance at all.
LLM's, at the first point of being useful, were never "just" prediction machines.
I am still astonished there were ever technical people saying such a thing.
And also, they are still "just predicting the next word", literally in terms of how they function and are trained. And there are still cases where it's useful to remember this.
I'm thinking specifically of chat psychosis, where people go down a rabbit hole with these things, thinking they're gaining deep insights because they don't understand the nature of the thing they're interacting with.
They're interacting with something that does really good - but fallible - autocomplete based on 3 major inputs.
1) They are predicting the next word based on the pre-training data, internet data, which makes them fairly useful on general knowledge.
2) They are predicting the next word based on RL training data, which causes them to be able to perform conversational responses rather than autocomplete style responses, because they are autocompleting conversational data. This also causes them to be extremely obsequious and agreeable, to try to go along with what you give them and to try to mimic it.
3) They are autocompleting the conversation based on your own inputs and the entire history of the conversation. This, combined with 2), means you are, to a large extent, talking yourself, or rather something that is very adept at mimicing and going along with your inputs.
Who, or what, are you talking to when you interact with these? Something that predicts the next word, with varying accuracy, based on a corpus of general knowledge plus a corpus of agreeable question/answer format plus yourself. The general knowledge is great as long as it's fairly accurate, the sycophantic mirror of yourself sucks.
Dead Comment
1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).
2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.
3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.
It works well and can be used for a lot of things, but still.
1. The Model Architecture. Calculation of outputs from inputs.
2. The Training Algorithm, that alters parameters in the architecture based on training data, often input, outputs vs. targets, but can be more complex than that. I.e. gradient descent, etc.
3. The Class of Problem being solved, i.e. approximation, prediction, etc.
4. The Instance of Problem of the problem being solved, i.e. approximation of chemical reaction completion vs. temperature, or prediction of textual responses.
5. The Data Embodiment of the problem, i.e. the actual data. How much, how complex, how general, how noisey, how accurate, how variable, how biased, ...?
And only after all those,
6. The Learned Algorithm that emerges from continual exposure to (5) in the basic form of (3), in order to specifically perform (4), by applying algorithm (2), to the model's parameters that control its input-output algorithm (1).
The latter, (6), has no limit in complexity or quality, or types of sub-problems, that must also be solved, to solve the umbrella problem successfully.
Data can be unbounded in complexity. Therefore, actual (successful) solutions are necessarily unbounded in complexity.
The "no limit, unbounded, any kind" of sub-problem part of (6) is missed by many people. To perform accurate predictions, of say the whole stock market, would require a model to learn everything from economic theory, geopolitics, human psychology, natural resources and their extraction, crime, electronic information systems and their optimizations, game theory, ...
That isn't a model I would call "just a stock price predictor".
Human language is an artifact created by complex beings. A high level of understanding of how those complex beings operate in conversation, writing, speeches, legal theory, their knowledge of 1000's of topics, psychologies, cultures, assumptions, motives, lifetime development, their modeling of each other, ... on and on ... becomes necessary to mimic general written artifacts between people with any resemblance at all.
LLM's, at the first point of being useful, were never "just" prediction machines.
I am still astonished there were ever technical people saying such a thing.