LLMs aren't world models

This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

yosefk · 17 days ago

Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

WillPostForFood · 15 days ago

Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

libraryofbabel · 17 days ago

I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.

I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.

A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:

* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.

* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.

* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)

* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.

I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.

Anyway thanks for writing this and responding!

AyyEye · 17 days ago

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.

BobbyJo · 17 days ago

The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

eru · 15 days ago

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.

yosefk · 17 days ago

Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think

andyjohnson0 · 17 days ago

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?

and it replied:

    There are 2 — the letter b appears twice in "blueberry".

I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.

Nevermark · 14 days ago

That was always a specious test.

LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.

A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".

Or how many red pixels are in a visible icon.

We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.

libraryofbabel · 17 days ago

> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

williamcotton · 15 days ago

I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

derdi · 15 days ago

Ask a kid that doesn't know how to read and write how many Bs there are in blueberry.

joe_the_user · 15 days ago

I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.

And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.

And, humans too can act without a consistent world model.

lossolo · 17 days ago

https://arxiv.org/abs/2508.01191

armchairhacker · 17 days ago

Any suggestions from this literature?

libraryofbabel · 17 days ago

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.

> LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect.

I don't believe that it is a fundamentally incorrect approach. I believe, that human mind does something like that all the time, the difference is our minds have some additional processes that can, for example, filter hallucinations.

Kids at a specific age range are afraid of their imagination. Their imagination can place a monster into any dark place where nothing can be seen. Adult mind can do the same easily, but the difference is kids have difficulties distinguishing imagination and perception, while adult generally manage.

I believe, the ability of human mind to see difference between imagination/hallucinations from one hand and perception and memory from the other is not a fundamental thing stemming from the architecture of brains but a learned skill. Moreover people can be tricked to acquire false memory[1]. If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.

What LLMs need is to learn some tricks to detect hallucinations. Probably they will not get 100% reliable detector, but to get to the level of humans they don't need 100% reliability.

TazeTSchnitzel · 15 days ago

I have recently lived through something called a psychotic break, which was an unimaginably horrible thing, but it did let me see from the inside what insanity does to your thinking. And what's fascinating, coming out the other side of this, is how similar LLMs are to someone in psychosis. Someone in psychosis can have all the ability LLMs have to recognise patterns and sound like they know what they're talking about, but their brain is not working well enough to have proper self-insight, to be able to check their thoughts actually fully make sense. (And “making sense” turns out to be a sliding scale — it is not as if you just wake up one day suddenly fully rational again, there's a sliding scale of irrational thinking and you have to gradually re-process your older thoughts into more and more coherent shapes as your brain starts to work more correctly again.) I believe this isn't actually a novel insight either, many have worried about this for years! Psychosis might be an interesting topic to read about if you want to get another angle to understand the AI models from. I won't claim that it's exactly the same thing, but I will say that most people probably have a very undeveloped idea of what mental illness actually is or how it works, and that leaves them badly prepared for interacting with a machine that has a strong resemblance to a mentally ill person who's learned to pretend to be normal.

rauljara · 14 days ago

Thank you for sharing, and sorry you had to go through that. I had a good friend go through a psychotic break and I spent a long time trying to understand what was going on in his brain. The only solid conclusion I could come to was that I could not relate to what he was going through, but that didn’t change that he was obviously suffering and needed whatever support I could offer. Thanks for giving me a little bit of insight into his brain. Hope you were/are able to find support out there.

johnisgood · 14 days ago

If we just take simply a panic attack, many people have no clue what or how it feels like, which is unfortunate, because they lack empathy for those who do experience it. My psychiatrists definitely need to experience it to understand.

mac-mc · 15 days ago

Do you have many memories of that time, around 3 to 5, and remember what your cognitive processes were?

When the child is afraid of the monster in the dark, they are not literally visually hallucinating a beast in the dark; they are worried that there could be a beast in the dark, and they are not sure that there is due to a lack of sensory information confirming a lack of the monster. They are not being hyper precise because they are 3, so they say "there is a monster under my bed"! Children have instincts to be afraid of the dark.

Similarly with imaginary friends and play, it's an instinct to practice through smaller stakes simulations. When they are emotionally attached to their imaginary friends, it's much like they are emotionally attached to their security blanket. They know that the "friend" is not perceptible.

It's much like the projected anxieties of adults or teenagers, who are worried that everyone thinks they are super lame and thus act like people do, because on the balance of no information, they choose the "safer path".

That is pretty different than the hallucinations of LLMs IMO.

bayindirh · 15 days ago

From my perspective, the fundamental problem arises from the assumption that brain's all functions are self contained, however there are feedback loops in the body which supports the functions of the brain.

The simplest one is fight/flight/freeze. Brain starts the process by being afraid, and hormones gets released, but next step is triggered by the nerve feedback coming from the body. If you are using beta-blockers and can't get panicked, the initial trigger fizzles and you return to your pre-panic state.

an LLM doesn't model a complete body. It just models the language. It's just a very small part of what brain handles, so assuming that modelling the language, even the whole brain gonna answer all the questions we have is a flawed approach.

Latest research shows body is a much more complicated and interconnected system than we learnt in school 30 years ago.

mft_ · 15 days ago

Sure, your points about the body aren’t wrong, but (as you say) LLMs are only modelling a small subset of a brain’s functions at the moment: applied knowledge, language/communication, and recently interpretation of visual data. There’s no need or opportunity for an LLM (as they currently exist) to do anything further. Further, just because additional inputs exist in the human body (gut-brain axis, for example) it doesn’t mean that they are especially (or at all) relevant for knowledge/language work.

shkkmo · 15 days ago

> If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.

She's strongly oversold how and when false memories can be created. She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.

She's spent a career over-generalizing data about implanting false minor memories to make money discrediting victims' traumatic memories and defend abusers.

You conflate "hallucination" with "imagination" but the former has much more in common with lieing than it does with imagining.

taneq · 15 days ago

> She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.

Did she have financial incentives? Was this a live demonstration? :P

Mikhail_Edoshin · 14 days ago

You probably know the Law of Archimedes. Many people do. But do you know it in the same way Archimedes did? No. You were told the law, then taught how to apply it. But Archimedes discovered it without any of that.

Can we repeat the feat of Archimedes? Yes, we can, but first we'd have to forget what we were told and taught.

The way we actually discover things is very different from amassing lots of hearsay. Indeed, we do have an internal part that behaves the same way LLM does. But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate. That part does not help us think; it helps us to avoid thinking. The reason it exists is that it is useful: thinking is hard and slow, but recalling is easy and fast. But it not thinking; it is the opposite.

ordu · 14 days ago

> But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate.

Close, but not exactly. To start from a clean slate is not very difficult, the trick is to reject some chosen parts of existing knowledge, or more specifically the difficulty is to choose what to reject. Starting from a clean slate you'll end up spending millennia to get the knowledge you've just rejected.

So the overall process of generating knowledge is to look under the streetlight till finding something new becomes impossible or too hard, and then you start experimenting with rejecting some bits of your knowledge to rethink them. I was taught to read works of Great Masters of the past critically, trying to reproduce their path while looking for forks where you can try to go the other way. It is a little bit like starting from a clean slate, but not exactly.

otabdeveloper4 · 15 days ago

> I believe, that human mind does something like that all the time

Absolutely not. Human brains have online one-shot training. LLMs weights are fixed and fine-tuning them is a huge multi-year enterprise.

Fundamentally it's two completely different architectures.

ordu · 14 days ago

I really don't like how you rejecting the idea completely. People have online one-shot training, but have you tried to learn how to play on piano? To learn it you need a lot of repetitions. Really a lot. You need a lot of repetitions to learn how to walk, or how to do arithmetic, or how to read English. This is very similar to LLMs, isn't it? So they are not completely different architectures, aren't they? It is more like human brains have something on top of "LLM" that allows it to do tricks that LLMs couldn't do.

ameliaquining · 15 days ago

One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.

For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.

moduspol · 15 days ago

"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.

yosefk · 15 days ago

Cursor fails miserably for me even just trying to replace function calls with method calls consistently, like I said in the post. This I would hope is fixable. By dealing autonomously I mean "you don't need a programmer - a PM talks to an LLM and that's how the code base is maintained, and this happens a lot (rather than on one or two famous cases where it's pretty well known how they are special and different from most work)"

By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)

I don't shy away from looking stupid in the future, you've got to give me this much

jononor · 15 days ago

"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.

True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.

shinycode · 15 days ago

« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning. Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off

wizzwizz4 · 15 days ago

Those spec sheets exist: they're called software.

whoknowsidont · 15 days ago

I don't think that statement is falsifiable until you define "deal with" and "large code bases."

exe34 · 15 days ago

How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?

Deleted Comment

p1necone · 15 days ago

That feels like a statement that's far too loosely defined to be meaningful to me.

I work on codebases that you could describe as 'large', and you could describe some of the LLM driven work being done on them as 'autonomous' today.

> LLMs will never manage to deal with large code bases 'autonomously'

Absolutely nothing about that statement is concrete or falsifiable.

Hell, you can already deal with large code bases 'autonomously' without LLMs - grep and find and sed goes a long way!

drdeca · 14 days ago

Seems falsifiable to me? If an LLM (+harness) is fully maintaining a project, updating things when dependencies update, handling bug reports, etc., in a way that is considered decent quality by consumers of the project, then that seems like it would falsify it.

Now, that’s a very high bar, and I don’t anticipate it being cleared any time soon.

But I do think if it happened, it would pretty clearly falsify the hypothesis .

Mars008 · 15 days ago

In two years there will be probably no new 'autonomous' LLMs. They will be most likely integrated into 'products', trained and designed for this. We see the beginning of it today as agents and tools.

slt2021 · 15 days ago

>LLMs will never manage to deal

time to prove hypothesis: infinity years

bootsmann · 14 days ago

The whole of modern science is based on the idea that we can never prove a theory about the world to be true, but that we can provide experiments which allow us to show that some theories are closer to the truth than others.

Eh, if the hypothesis remains unfalsified for longer and longer, we can have increased confidence.

Similar, Newton's laws say that bodies always stay at rest unless acted upon by a force. Strictly speaking, if a billiard ball jumps up without cause tomorrow that would disprove Newton. So we'd have to wait an infinite amount of time to prove Newton right.

However no one has to wait so long, and we found ways to express how Newton's ideas are _better_ than those of Aristotle without waiting an eternity.

keeda · 17 days ago

That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.

I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)

And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.

geraneum · 15 days ago

> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency

Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)

> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.

> is a purely philosophical question

It is indeed. A question we need to ask ourselves.

Uehreka · 15 days ago

> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.

If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).

antirez · 15 days ago

The post is based on a misconception. If you read the blog post linked at the end of this message, you'll see how a very small GPT-2 alike transformer (Karpathy nano-gpt trained to a very small size) after seeing just PGN games and nothing more develops an 8x8 internal representation with which chess piece is where. This representation can be extracted by linear probing (and can be even altered by using the probe in reverse). LLMs are decent but not very good chess players for other reasons, not because they don't have a world model of the chess board.

https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gp...

thecupisblue · 14 days ago

Ironically, that lesswrong article is more wrong than right.

First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one

Second, the probing is quite... a subjective thing.

We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.

Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.

This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".

flender · 14 days ago

And if it knew every possible board configuration and optimal move, it could potentially do as well as it could, but instead if it were to just recognize “this looks like a chess game” and use an optimized tool to determine the next move, that would be a better use of training, it would seem.

yosefk · 14 days ago

The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)

From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"

These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")

og_kalu · 14 days ago

>These things don't play at 1800 ELO

Why are you saying 'these things'?. That statement is about a specific model which did play at that level and did not lose track of the pieces. There's no cheating or weirdness.

https://github.com/adamkarvonen/chess_gpt_eval

ej88 · 17 days ago

This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358

A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.

The examples are from all the major commercial American LLMs as listed in a sister comment.

You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.

tmnvdb · 15 days ago

If you train an LLM on chess, it will learn that too. You don't need to explain the rules, just feed it chess games, and it will stop making illegal moves at some point. It is a clear example of an inferred world model from training.

https://arxiv.org/abs/2501.17186

PS "Major commercial American LLM" is not very meaningful, you could be using GPT4o with that description.

codebastard · 14 days ago

I my opinion the author refers to a LLMs inability to create a inner world, a world model.

That means it does not build a mirror of a system based on its interactions.

It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.

It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.

ordu · 15 days ago

skeledrew · 17 days ago

Agree in general with most of the points, except

> but because I know you and I get by with less.

Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.

helloplanets · 15 days ago

Don't forget the millions of years of pre-training! ;)

o_nate · 16 days ago

What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.

yosefk · 16 days ago

thank you for your kind words!