But it is extremely silly to say that "large language models are language models" is a bad faith argument.
But it is extremely silly to say that "large language models are language models" is a bad faith argument.
But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
You said this: If this existing corpus includes useful information it can regurgitate that.It cannot, however, synthesize new facts by combining information from this corpus.
So I'm wondering if you think world models can synthesize new facts.I don’t think essentialist explanations about how LLMs work are very helpful. It doesn’t give any meaningful explanation of the high level nature of the pattern matching that LLMs are capable of. And it draws a dichotomic line between basic pattern matching and knowledge and reasoning, when it is much more complex than that.
What is least helpful is using misleading terms like this, because it makes reasoning about this more difficult. If we assume the model "knows" something, we might reasonably assume it will always act according to that knowledge. That's not true for an LLM, so it's a term that should clearly be a oided.
But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
With all the social science research and strategy books that LLMs have read, they actually know a LOT about outcomes and dynamics in adversarial situations.
The author does have a point though that LLMs can’t learn these from their human-in-the-loop reinforcement (which is too controlled or simplified to be meaningful).
Also, I suspect the _word_ models of LLMs are not inherently the problem, they are just inefficient representations of world models.
The articles will not be mutually consistent, and what output the LLM produces will therefore depend on what article the prompt most resembles in vector space and which numbers the RNG happens to produce on any particular prompt.
Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it
> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do
> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute
The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.
Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors
Let me illustrate the point using a different argument with the same structure: 1) The best professional chefs are excellent at cutting onions 2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef
2) clearly does not follow from 1)
That would be like saying studying mathematics can't lead to someone discovering new things in mathematics.
Nothing would ever be "novel" if studying the existing knowledge could not lead to novel solutions.
GPT 5.2 Thinking is solving Erdős Problems that had no prior solution - with a proof.
The point is that the LLM did not model maths to do this, made calls to a formal proof tool that did model maths, and was essentially working as the step function to a search algorithm, iterating until it found the zero in the function.
That's clever use of the LLM as a component in a search algorithm, but the secret sauce here is not the LLM but the middleware that operated both the LLM and the formal proof tool.
That middleware was the search tool that a human used to find the solution.
This is not the same as a synthesis of information from the corpus of text.
The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language
I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault
First, the subjectivity of language.
1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.
2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.
The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.
But that small semantic note aside, if an LLM is used to trigger other tools to find new facts, then those other tools are modeling the "world" or a particular domain. Alternatively you could say that the system as a whole, that the LLM is a part of, models the "world" or a particular domain.