I don't know how you get here from “predict the next word”

A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.

That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.

Hope this is usefull to someone.

317070 · 17 days ago

As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

justinator · 17 days ago

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

TuringTest · 17 days ago

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

apexalpha · 17 days ago

Are you referring to this one?: https://github.com/karpathy/build-nanogpt

sasjaws · 15 days ago

Thats the one, lots of fun and a great entrypoint for experimentation.

croon · 17 days ago

Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.

throw310822 · 17 days ago

> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language...

sasjaws · 15 days ago

Thats actually an interesting way to look at it. However i just posted that because i often see articles expressing amazement at how training an llm at next token prediction can take it so far. Seemingly ontrasting the simplicity of the training task to the complexity of the outcome. The insight is that the training task was in fact 'predict the next book', just as much as it is 'predict the next token'. So every time i see that 'predict the next token' representation of the training task it rubs me the wrong way. Its not wrong, but misleading.

I didnt mean to suggest that is how it 'thinks ahead' but i believe you can see it like that in a way. Because it has been trained to 'predict all the following tokens'. So it learned to guess the end of a phrase just as much as the beginning. I consider the mechanism of feeding each output token back in to be an implementation detail that distracts from what it actually learned to do.

I hope this makes sense. Fyi im no expert in any way, just dabbling.

sputknick · 17 days ago

I'd like to explore this idea, did you make a blog post about it? is it simple enough to post in the reply?

sasjaws · 15 days ago

No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.

I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.

Sibling commenter also mentions:

> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."

Hope that helps.

WithinReason · 17 days ago

Look up attention masks

krackers · 16 days ago

Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.

Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.

Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.

So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).

Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.

When it comes to the statement

>its not unreasonable to say llms are trained to predict the next book instead of single token.

You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.

It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).

The whole next word thing is interesting isn't it. I like to see it with Dennett's "Competence and comprehension" lens. You can predict the next word competently with shallow understanding. But you could also do it well with understanding or comprehension of the full picture. A mental model that allows you to predict better. Are the AIs stumbling into these mental models? Seems like it. However, because these are such black boxes, we do not know how they are stringing these mental models together. Is it a random pick from 10 models built up inside the weights? Is there any system-wide cohesive understanding, whatever that means? Exploring what a model can articualate using self-reflection would be interesting. Can it point to internal cognitive dissonance because it has been fed both evolution and intelligent design, for example? Or these exist as separate models to invoke depending on the prompt context, because all that matters is being rewarded by the current user?

grey-area · 17 days ago

Given their failure on novel logic problems, generation of meaningless text, tendency to do things like delete tests and incompetence at simple mathematics, it seems very unlikely they have built any sort of world model. It’s remarkable how competent they are given the way they work.

Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.

‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.

The Turing test had too much confidence in humans it seems.

steve1977 · 17 days ago

> Predict the next word is a terrible summary of what these machines do though, they certainly do more than that

What would that be?

Kim_Bruning · 17 days ago

So that might depend on model, how long ago you lasted tested it, etc. I've seen llms solve novel logic problems, generate meaningful text, retain tests just fine, and simple mathematics on newer models is a lot better.

Btw if you read the actual paper that proposes the Turing test, Turing actually rejects the framing of "can machines think"; preferring to go for the more practical "can you tell them apart in practice".

shakna · 17 days ago

Probably worth remembering that ELIZA passed Turing tests, and was the definition of shallow prediction.

red75prime · 17 days ago

> there are significant limitations

Where can we read about those significant limitations?

mzhaase · 17 days ago

It always occurred to me that LLMs may be like the language center of the brain. And there should be a "whole damn rest of the brain" behind it to steer it.

LLMs miss very important concepts, like the concept of a fact. There is no "true", just consensus text on the internet given a certain context. Like that study recently where LLMs gave wrong info if there was the biography of a poor person in the context.

steve1977 · 17 days ago

I think much along the same lines. LLMs are probably even just a part of the language center.

And of course they also miss things like embodiment, mirror neurons etc.

If an LLM makes a mistake, it will tell you it is sorry. But does it really feel sorry?

joquarky · 17 days ago

Ever practiced meditation of the form where you just witness your thoughts? It seems just like LLM generated words, both factual and confabulated nonsense.

dnautics · 17 days ago

thats unlikely. but they are awfully lot like turing machines (k/v cache ~ turing tape) so their architecture is strongly predisposed to be able to find any algorithm, possibly including reasoning

throw310822 · 17 days ago

> You can predict the next word competently with shallow understanding.

I don't get this. When you say "predict the next word" what you mean is "predict the word that someone who understands would write next". This cannot be done without an understanding that is as complete as that of the human whose behaviour you are trying to predict. Otherwise you'd have the paradox that understanding doesn't influence behaviour.

js8 · 17 days ago

Dennett also came to my mind, reading the title, but in a different sense. When people came up with theory of evolution, it was hard to conceive for many people, how do we get from "subtly selecting from random changes" to "build a complex mechanism such as human". I think Dennett offers a nice analogy with a skyscraper, how it can be built if cranes are only so tall?

In a similar way, LLMs build small abstractions, first on words, how to subtly rearrange them without changing meaning, then they start to understand logic patterns such as "If A follows B, and we're given A, then B", and eventually they learn to reason in various ways.

It's the scale of the whole process that defies human understanding.

(Also modern LLMs are not just next word predictors anymore, there is reinforcement learning component as well.)

mekoka · 17 days ago

> Are the AIs stumbling into these mental models? Seems like it.

Since nature decided to deprive me of telepathic abilities, when I want to externalize my thoughts to share with others, I'm bound to this joke of a substitute we call language. I must either produce sounds that encode my meaning, or gesture, or write symbols, or basically find some way to convey my inner world by using bodily senses as peripherals. Those who receive my output must do the work in reverse to extract my meaning, the understanding in my message. Language is what we call a medium that carries our meaning to one another's psyche.

LLMs, as their name alludes, are trained on language, the medium, and they're LARGE. They're not trained on the meaning, like a child would be, for instance. Saying that by their sole analysis of the structure and patterns in the medium they're somehow capable of stumbling upon the encoded meaning is like saying that it's possible to become an engineer, by simply mindlessly memorizing many perfectly relevant scripted lines whose meaning you haven't the foggiest.

Yes, on the surface the illusion may be complete, but can the medium somehow become interchangeable with the meaning it carries? Nothing indicates this. Everything an LLM does still very much falls within the parameters of "analyze humongous quantity of texts for patterns with massive amount of resources, then based on all that precious training, when I feed you some text, output something as if you know what you're talking about".

I think the seeming crossover we perceive is just us becoming neglectful in our reflection of the scale and significance of the required resources to get them to fool us.

halyconWays · 17 days ago

Searle's Chinese Room experiment but without knowing what's in the room, and when you try to peek in you just see a cloud of fog and are left to wonder if it's just a guy with that really big dictionary or something more intelligent.

selridge · 17 days ago

It's an octopus, perhaps: https://aclanthology.org/2020.acl-main.463.pdf

There's also this blog post: https://julianmichael.org/blog/2020/07/23/to-dissect-an-octo... (which IMO is better to read than the paper)

basch · 17 days ago

It's honestly disheartening and a bit shocking how everyone has started repeating the predict the next syllable criticism.

The language model predicts the next syllable by FIRST arriving in a point in space that represents UNDERSTANDING of the input language. This was true all the way back in 2017 at the time of Attention Is All You Need. Google had a beautiful explainer page of how transformers worked, which I am struggling to find. Found it. https://research.google/blog/transformer-a-novel-neural-netw...

The example was and is simple and perfect. The word bank exists. You can tell what bank means by its proximity to words, such as river or vault. You compare bank to every word in a sentence to decide which bank it is. Rinse, repeat. A lot. You then add all the meanings together. Language models are making a frequency association of every word to every other word, and then summing it to create understanding of complex ideas, even if it doesn't understand what it is understanding and has never seen it before.

That all happens BEFORE "autocompleting the next syllable."

The magic part of LLMs is understanding the input. Being able to use that to make an educated guess of what comes next is really a lucky side effect. The fact that you can chain that together indefinitely with some random number generator thrown in and keep saying new things is pretty nifty, but a bit of a show stealer.

What really amazes me about transformers is that they completely ignored prescriptive linguistic trees and grammar rules and let the process decode the semantic structure fluidly and on the fly. (I know google uses encode/decode backwards from what I am saying here.) This lets people create crazy run on sentences that break every rule of english (or your favorite language) but instructions that are still parsable.

It is really helpful to remember that transformers origins are language translation. They are designed to take text and apply a modification to it, while keeping the meaning static. They accomplish this by first decoding meaning. The fact that they then pivoted from translation to autocomplete is a useful thing to remember when talking to them. A task a language model excels at is taking text, reducing it to meaning, and applying a template. So a good test might be "take Frankenstein, and turn it into a magic school bus episode." Frankenstein is reduced to meaning, the Magic School Bus format is the template, the meaning is output in the form of the template. This is a translation, although from English to English, represented as two completely different forms. Saying "find all the Wild Rice recipes you can, normalize their ingredients to 2 cups of broth, and create a table with ingredient ranges (min-max) for each ingredient option" is closer to a translation than it is to "autocomplete." Input -> Meaning -> Template -> Output. With my last example the template itself is also generated from its own meaning calculation.

A lot has changed since 2017, but the interpreter being the real technical achievement still holds true imho. I am more impressed with AI's ability to parse what I am saying than I am by it's output (image models not withstanding.)

qsera · 17 days ago

>represents UNDERSTANDING of the input language.

It does not have an understanding, it pattern matches the "idea shape" of words in the "idea space" of training data and calculates the "idea shape" that is likely to follow considering all the "idea shape" patterns in its training data.

It mimics understanding. It feels mysterious to us because we cannot imagine the mapping of a corpus of text to this "idea space".

It is quite similar to how mysterious a computer playing a movie can appear, if you are not aware of mapping of movie to a set of pictures, pictures to pixels, and pixels to co-ordinates and colors codes.

steve1977 · 17 days ago

From what I understand, it's more like "input is 1, 3, 5, 7" so "output is likely to be 9".

Understanding would be a bit generous of a term for that I guess, but that also depends on the definition of understanding.

joquarky · 17 days ago

Even if it gets the output wrong, it always seems to provide some output that indicates that it got the input right. This is the first thing that really surprised me about this tech.

wavemode · 17 days ago

> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).

It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).

jjmarr · 17 days ago

I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.

We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).

So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.

aktau · 17 days ago

Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.

smallpipe · 17 days ago

> This has completely replaced human code review for anything that isn't functional correctness

Isn’t functional correctness pretty much the only thing that matters though?

Deleted Comment

> You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?

Good luck finding THAT in the training data. :-P

(just to be sure, I then had it write actual programs in that new assembly language)

withinboredom · 17 days ago

But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted.

Claude 4.5: not overfitted too much -- does the right thing 6/10 times.

Claude 4.6: overfitted -- does the right thing 2/10 times.

OpenAI 5.3: overfitted -- does the right thing 3/10 times.

These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.

My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.

>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.

This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

anon7725 · 17 days ago

“Represented in the training data” does not mean “represented as a whole in the training data”. If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space.

sasjaws · 17 days ago

pushedx · 17 days ago

Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).

There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.

The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.

auraham · 17 days ago

I highly recommend Build a large language model from scratch [1] by Sebastian Raschka. It provides a clear explanation of the building blocks used in the first versions of ChatGPT (GPT 2 if I recall correctly). The output of the model is a huge vector of n elements, where n is the number of tokens in the vocabulary. We use that huge vector as a probability distribution to sample the next token given an input sequence (i.e., a prompt). Under the hood, the model has several building blocks like tokenization, skip connections, self attention, masking, etc. The author makes a great job explaining all the concepts. It is very useful to understand how LLMs works.

[1] https://www.manning.com/books/build-a-large-language-model-f...

phreeza · 17 days ago

But this is missing exactly the gap which OP seems to have, which is going from a next token predictor (a language model in the classical sense) to an instruction finetuned, RLHF-ed and "harnessed" tool?

measurablefunc · 17 days ago

What's the latest novel insight you have encountered?

brookst · 17 days ago

Not the person you asked, and “novel” is a minefield. What’s the last novel anything, in the sense you can’t trace a precursor or reference?

But.. I recently had a LLM suggest an approach to negative mold-making that was novel to me. Long story, but basically isolating the gross geometry and using NURBS booleans for that, plus mesh addition/subtraction for details.

I’m sure there’s prior art out there, but that’s true for pretty much everything.

ChaitanyaSai · 17 days ago

GodelNumbering · 17 days ago

It is probably the first-time aha moment the author is talking about. But under the hood, it is probably not as magical as it appears to be.

Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.

If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.

LLMs are good at 8 out of 10 tasks, but you don't know which 8.

In your model, explain the old trick "think step by step"

It simply forces the model to adopt an output style known to conduce systematic thinking without actually thinking. At no point has it through through the thing (unless there are separate thinking tokens)

teekert · 17 days ago

I think this is a thing not often discussed here, but I too have this experience. An LLM can be fantastic if you write a 25-pager then later need to incorporate a lot of comments with sometimes conflicting arguments/viewpoints.

LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".

If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.

Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.

callmeal · 17 days ago

The "predict the next word" to a current llm is at the same level as a "transistor" (or gate) is to a modern cpu. I don't understand llms enough to expand on that comparison, but I can see how having layers above that feed the layers below to "predict the next word" and use the output to modify the input leading to what we see today. It is turtles all the way down.

It’s a good comparison. It’s about abstraction and layers. Modern LLMs aren’t just models, they’re all the infrastructure around promoting and context management and mixtures of experts.

The next-word bit may be slightly higher than an individual transistor, possibly functional units.

ejolto · 17 days ago

There is a big difference, because I understand how those transistors produce a picture on a screen, I don’t understand how LLMs do what they do. The difference is so big that the comparison is useless.

jcul · 17 days ago

I understand how transistors work too, and how they can result in a picture on a screen. But I think most people outside the software / electronics areas don't and to them it's just magic.

echelon · 17 days ago

Humans are future predictors. Our vision systems, our mental models of our careers. People that predict the future tend to do well financially.

Now the machines are getting better than we are. It's exciting and a little bit terrifying.

We were polymers that evolved intelligence. Now the sand is becoming smart.

>Now the machines are getting better than we are

Then AI companies should stop looking for investors and instead play stock markets with all that predictive powers!

modeless · 17 days ago

It's clear that in the general case "predict the next word" requires arbitrarily good understanding of everything that can be described with language. That shouldn't be mysterious. What's mysterious is how a simple training procedure with that objective can in practice achieve that understanding. But then again, does it? The base model you get after that simple training procedure is not capable of doing the things described in the article. It is only useful as a starting point for a much more complex reinforcement learning procedure that teaches the skills an agent needs to achieve goals.

RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.