This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.
Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.
Language seems to be confused with logic or common sense.
We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.
I'd suggest we call the LLM language fundament a "Word Model"(not a typo).
Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.
I am baffled that people have to continue making this argument over and over and over. Your rationale makes total sense to me, but the debate rages on whether or not LLMs are more than just words.
Articles like this only seem to confirm that any reasoning is an illusion based on probabilistic text generation. Humans are not carefully writing out all the words of this implicit reasoning, so the machine cant appear to mimic them.
What am I missing that makes this debatable at all?
Language is the tool we use to codify a heuristic understanding of reality. The world we interact with daily is not the physical one, but an ideological one constructed out of human ideas from human minds. This is the world we live in and the air we breath is made of our ideas about oxygenation and partly of our concept of being alive.
It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.
Bingo, great reply! This is what I've been trying to explain to my wife. LLM's use fancy math and our language examples to reproduce our language but have no thoughts are feelings.
Hi I’m just a random internet stranger passing by and was intrigued by Plato’s Cave as I’m not a fancy person who reads books. GPT-4o expanded for you quite well, but I’m not sure how I feel about it…
Using AI how I just did feels like cheating on an English class essay by using spark notes, getting a B+, and moving right on to the next homework assignment.
On one hand, I didn’t actually read Plato to learn and understand this connection, nor do I have a good authority to verify if this output is a good representation of his work in the context of your comment.
And yet, while I’m sure students could always buy or loan out reference books to common student texts in school, AI now makes this “spark notes” process effectively a commodity for almost any topic, like having a cross-domain low-cost tutor instantly available at all time.
I like the metaphor that calculators did to math what LLMs will do for language, but I don’t really know what that means yet
GPT output:
“““
The reference to Plato’s Cave here suggests that language models, like the shadows on the wall in Plato’s allegory, provide an imperfect and limited representation of reality. In Plato’s Cave, prisoners are chained in a way that they can only see shadows projected on the wall by objects behind them, mistaking these shadows for the whole of reality. The allegory highlights the difference between the superficial appearances (shadows) and the deeper truth (the actual objects casting the shadows).
In this analogy, large language models (LLMs) produce fluent and grammatically correct language—similar to shadows on the wall—but they do so without direct access to the true “world” beyond language. Their understanding is derived from patterns in language data (“Word Model”) rather than from real-world experiences or sensory information. As a result, the “reality” of the LLMs is limited to linguistic constructs, without spatial awareness, social context, or logic grounded in physical or mathematical truths.
The suggestion to call the LLM framework a “Word Model” underscores that LLMs are fundamentally limited to understanding language itself rather than the world the language describes. Reconstructing a true “world model” from this “word model” is as challenging as Plato’s prisoners trying to understand the real world from the shadows. This evokes the philosophical task of discerning reality from representation, making a case for a “modern remake of Plato’s Cave” where language, not shadows, limits our understanding of reality.
”””
This is a regression in the model's accuracy at certain tasks when using COT, not its speed:
> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.
In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.
Yeah! That's the danger with any kind of "model" whether it is CoT, CrewAI, or other ways to outsmart it. It is betting that a programmer/operator can break a large tasks up in a better way than an LLM can keep attention (assuming it can fit the info in the context window).
ChatGPT's o1 model could make a lot of those programming techniques less effective, but they may still be around as they are more manageable, and constrained.
I saw an LLM having this kind of problem when I was doing some testing a ways back. I asked it to order three fruits from largest to smallest. I think it was orange, blueberry and grapefruit. It could do that easily with a simple prompt. When the prompting included something to the effect of “think step by step”, it would try to talk through the problem and it would usually get it wrong.
How much does this align with how we learn math? We kind of instinctively learn the answers to simple math questions. We can even at some point develop an intuition for things like integrating and differentials. But the moment we are asked to explain why, or worse provide a proof, things become a lot harder. Even though the initial answer may be correct.
I definitely don’t learn math by means of gradient descents.
We can possibly say math is not learned, but a mental models of abstractions are developed. How? We dunno, but what we do know is we don’t learn by figuring the common features between all previously seen equations only to guess them later…
Mind operates on higher and higher levels of abstractions building on each other in a much fascinating way, very often not with words, but with structure and images.
Of course there are people with aphantasia, but i really fail to see how any reasoning happens in purely language level. Someone on this forum also noted - in order to reason one needs an ontology to facilitate the reasoning process. LLMs don’t do ontologies…
And finally, not least though, LLM and ML people in general seem to equate intuition to some sort biased.random(). Well intuition is not random, and is hard to describe in words. So are awe and inspiration. And these ARE part of (precondition to, fuel for) humanity’s thought process more that we like to admit.
Humans learn skills like basic mathematics by reasoning about their environment and building internal models of problems they’re trying to solve. LLMs do not reason and they cannot model their environment.
>It's not thinking
>it compressed the internet into a clever, lossy format with nice interface and it retrieves stuff from there.
Humans do both, why can't LLM's?
>Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
More like pulling out a deep-fried meme, looking for context, then searching google images until you find the most "original" JPG representation with the least amount of artifacts.
There is more data to add confidently, it just has to re-think about it with a renewed perspective, and an abstracted-away higher-level context/attention mechanism.
> Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.
If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?
I have no idea how accurate it actually is, But I've had the process used by LLM described as the following: "Think of if like a form of UV Mapping, applied to language constructs rather than 3D points in space, and the limitations and approximations you experience are similar to those emerging when having to project a 2D image over a 3D surface."
These kind of reductive thought-terminating cliches are not helpful. You are using a tautology (it doesn't think because it is retrieving data and retrieving data is not thinking) without addressing the why (why does this preclude thinking) or the how (is it doing anything else to generate results).
It would be interesting to think about how it got it wrong. My hunch is that in the "think step by step" section it made an early and incorrect conclusion (maybe even a subtly inferred conclusion) and LLMs are terrible at walking back mistakes so it made an internally consistent conclusion that was incorrect.
A lot of CoT to me is just slowing the LLM down and keeping it from making that premature conclusion... but it can backfire when it then accidentally makes a conclusion early on, often in a worse context than it would use without the CoT.
I always found it interesting how sorting problems can get different results when you add additional qualifiers like colors or smells or locations, etc.
Natively, I understand these to influence the probability space enough to weaken the emergence patterns we frequently overestimate.
The model is likely to had already seen the exact phrase from its last iteration. Adding variation generalizes the inference away from over-trained quotes.
Every model has the model before it, and it's academic papers, in it's training data.
Changing the qualifiers pulls the inference far away from quoting over-trained data, and back to generalization.
I am sure it has picked up on this mesa-optimization along the way, especially if I can summarize it.
Wonder why it hasn't been more generally intelligent, yet.
I'll rank those three fruits from largest to smallest:
1. Grapefruit
2. Orange
3. Blueberry
The grapefruit is definitely the largest of these three fruits - they're typically around 4-6 inches in diameter. Oranges are usually 2-3 inches in diameter, and blueberries are the smallest at roughly 0.5 inches in diameter.
Alternate framing: A powerful autocomplete algorithm is being used to iteratively extend an existing document based on its training set. Sometimes you get a less-desirable end-result when you intervene to change the style of the document away from question-and-answer to something less common.
Artificial brains in the verge of singularity show another sign of approaching consciousness. The chain of thought of process performance is exactly human, showing yet another proof of the arrival of AGI before 2030.
Let me give it a try...um...what about 'Star Trek' vs.: A delivering-service called Galaxyray?galaxyray brings wares and hot tasty meals galaxywide to recipients, even while they are 'traveling' with more-than-lightspeed in hyperspace?
> ..ordered by Imperium just to troll the retros!?
Sounds "less comon"...hu...?! P-:
Ok! Ok! let me try to explain it a bit more, the whole Universe projected as a beam, say... scalable, 100m, placed in a storage depot, a 'parralaxy' ...So delivery agents are grabbing the ordered stuff and...no? Not?
As reasonable like your answer is, do that sound very 'uncommon' while 'phrasing that with many questionmarks'?
Not to mention that chain of thought is computationally very expensive. Prohibitively expensive for sure to be served free like previous generation of Web 2.0 products.
Seems like repeated promoting can't juice AGI out of token probabilities.
Retrospectively, if you can pin point one paper that led to the bust and pop of the AI bubble, this would be it.
This isn't some innate ability that people have. As evidenced by how bad my kids are at catching things. :D
That said, I think this is a good example. We call it "muscle memory" in that you are good at what you have trained at. Change a parameter in it, though, and your execution will almost certainly suffer.
Or even more impressively, how you can pick up a random object and throw it with some accuracy.
Catching a ball is easy by comparison, also, my dog is better than I am at this game.
But throwing a random object not only requires an estimation of the trajectory, but also estimating the mass and aerodynamic properties in advance, to properly adjust the amount of force the throw will use as well as the release point with high accuracy. Doing it with baseballs is "easy", as the parameters are all well known and pitchers spend considerable time training. But picking an oddly shaped rock or stick you have never seen before and throw it not completely off target a second later, now we are talking.
I recall a study which suggested that we don't really calculate the trajectory as such, but use some kind of simple visual heuristic to continually align ourselves with where the ball is going to land.
They showed that people running to catch a ball would follow an inefficient curved path as a result of this, rather than actually calculating where the ball will land and moving there in a straight line to intercept it.
You can do this while you're staring up the whole time. Your brain can predict where the ball will end up even though it's on a curved trajectory and place your hand in the right spot to catch it without guidance from your eyes in the final phase of travel. I have very little experience playing any kind of sport that involves a ball and can reliably do this.
All those years of baseball as a kid gave me a deep intuition for where the ball would go, and that game doesn’t use real gravity (the ball is too floaty).
Well, by definition, thinking is always explicit reasoning, no?
And I'd hazard a guess that a well-thought through Fermi Estimation beats lizard-brain eyeballing every time, it's just that in the inbetween space the two interfere unfavourably.
My guess would be no. I have terrible face recognition ability and I can look into face for hour and still other people could easily beat me in less than a second.(I am assuming "well-thought through Fermi Estimation" would be similar for me and others in this case).
> Well, by definition, thinking is always explicit reasoning, no?
That doesn't feel right to me. (Heh, accidentally appropriate word choice.) There are a lot of tasks we do that are arguably "thinking" yet don't involve an internal "Oh, hey, I'm gonna solve this problem, I'm thinking right now."
For example, imagine you're at a park, and someone is feeding the ducks. Another person walks up behind them and sucker-punches them into the pond.
It should be almost a reflex [0] that you'll conclude "the puncher is bad" and "the person in the water needs help" without explicitly reasoning out. I think that task qualifies as "thinking", especially since it involves some kind of theory-of-mind about those other humans.
[0] An exception might be someone with a sociopathic disability, who would have to think more-explicitly to realize what reaction is expected of them.
This says something fascinating about information processing in both biological and AI systems. Both systems compress information: the brain creates efficient neural patterns through experience and AI develops internal representations through training. Forcing verbalization "decompresses" this efficient encoding, potentially losing subtle patterns. Hence, for a task like visual recognition, which is optimized to occur almost instantly in a parallel process, you will only degrade performance by running it in a serial chain of thought sequence.
> What are even the tasks where thinking makes humans worse?
Not really related, but athletes perform A LOT worse when they are thinking about their movements/strategies/tactics. A top performing athlete does best when they are in a flow state, where they don't think about anything and just let their body/muscle memory do the work.
Once you start thinking about micro-adjustments (e.g. I should lift my elbow higher), you start controlling your body in a conscious way, which is a magnitude slower and less coordinated than the automatic/subconscious way.
Also, same happens for creativity/new ideas. If you intentionally think about something, step by step, you won't likely find new, innovative solutions. There is a reason why the "a-ha!" moments come in the shower, your subconscious mind is thinking about the problem instead of trying to force your thinking on a specific path.
I would guess this happens in many other areas, where channelling the thought process through a specific template hinders the ability to use all the available resources/brain power.
I can think myself into forgetting strong passwords if I try to spell each character out in my head. But then I sit at a keyboard, relax, and automatically type it perfectly.
Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.
We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.
I'd suggest we call the LLM language fundament a "Word Model"(not a typo).
Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.
Articles like this only seem to confirm that any reasoning is an illusion based on probabilistic text generation. Humans are not carefully writing out all the words of this implicit reasoning, so the machine cant appear to mimic them.
What am I missing that makes this debatable at all?
It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.
Using AI how I just did feels like cheating on an English class essay by using spark notes, getting a B+, and moving right on to the next homework assignment.
On one hand, I didn’t actually read Plato to learn and understand this connection, nor do I have a good authority to verify if this output is a good representation of his work in the context of your comment.
And yet, while I’m sure students could always buy or loan out reference books to common student texts in school, AI now makes this “spark notes” process effectively a commodity for almost any topic, like having a cross-domain low-cost tutor instantly available at all time.
I like the metaphor that calculators did to math what LLMs will do for language, but I don’t really know what that means yet
GPT output:
“““ The reference to Plato’s Cave here suggests that language models, like the shadows on the wall in Plato’s allegory, provide an imperfect and limited representation of reality. In Plato’s Cave, prisoners are chained in a way that they can only see shadows projected on the wall by objects behind them, mistaking these shadows for the whole of reality. The allegory highlights the difference between the superficial appearances (shadows) and the deeper truth (the actual objects casting the shadows).
In this analogy, large language models (LLMs) produce fluent and grammatically correct language—similar to shadows on the wall—but they do so without direct access to the true “world” beyond language. Their understanding is derived from patterns in language data (“Word Model”) rather than from real-world experiences or sensory information. As a result, the “reality” of the LLMs is limited to linguistic constructs, without spatial awareness, social context, or logic grounded in physical or mathematical truths.
The suggestion to call the LLM framework a “Word Model” underscores that LLMs are fundamentally limited to understanding language itself rather than the world the language describes. Reconstructing a true “world model” from this “word model” is as challenging as Plato’s prisoners trying to understand the real world from the shadows. This evokes the philosophical task of discerning reality from representation, making a case for a “modern remake of Plato’s Cave” where language, not shadows, limits our understanding of reality. ”””
> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.
In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.
ChatGPT's o1 model could make a lot of those programming techniques less effective, but they may still be around as they are more manageable, and constrained.
We can possibly say math is not learned, but a mental models of abstractions are developed. How? We dunno, but what we do know is we don’t learn by figuring the common features between all previously seen equations only to guess them later…
Mind operates on higher and higher levels of abstractions building on each other in a much fascinating way, very often not with words, but with structure and images.
Of course there are people with aphantasia, but i really fail to see how any reasoning happens in purely language level. Someone on this forum also noted - in order to reason one needs an ontology to facilitate the reasoning process. LLMs don’t do ontologies…
And finally, not least though, LLM and ML people in general seem to equate intuition to some sort biased.random(). Well intuition is not random, and is hard to describe in words. So are awe and inspiration. And these ARE part of (precondition to, fuel for) humanity’s thought process more that we like to admit.
Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
There is more data to add confidently, it just has to re-think about it with a renewed perspective, and an abstracted-away higher-level context/attention mechanism.
Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.
If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?
Deleted Comment
A lot of CoT to me is just slowing the LLM down and keeping it from making that premature conclusion... but it can backfire when it then accidentally makes a conclusion early on, often in a worse context than it would use without the CoT.
Natively, I understand these to influence the probability space enough to weaken the emergence patterns we frequently overestimate.
Every model has the model before it, and it's academic papers, in it's training data.
Changing the qualifiers pulls the inference far away from quoting over-trained data, and back to generalization.
I am sure it has picked up on this mesa-optimization along the way, especially if I can summarize it.
Wonder why it hasn't been more generally intelligent, yet.
I'll rank those three fruits from largest to smallest:
1. Grapefruit 2. Orange 3. Blueberry
The grapefruit is definitely the largest of these three fruits - they're typically around 4-6 inches in diameter. Oranges are usually 2-3 inches in diameter, and blueberries are the smallest at roughly 0.5 inches in diameter.
Artificial brains in the verge of singularity show another sign of approaching consciousness. The chain of thought of process performance is exactly human, showing yet another proof of the arrival of AGI before 2030.
> ..ordered by Imperium just to troll the retros!?
Sounds "less comon"...hu...?! P-:
Ok! Ok! let me try to explain it a bit more, the whole Universe projected as a beam, say... scalable, 100m, placed in a storage depot, a 'parralaxy' ...So delivery agents are grabbing the ordered stuff and...no? Not?
As reasonable like your answer is, do that sound very 'uncommon' while 'phrasing that with many questionmarks'?
??
Enjoying my day off... (-: regards,
Seems like repeated promoting can't juice AGI out of token probabilities.
Retrospectively, if you can pin point one paper that led to the bust and pop of the AI bubble, this would be it.
Dead Comment
> Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions.
Fascinating that our lizard brains are better at implicit statistical reasoning
That said, I think this is a good example. We call it "muscle memory" in that you are good at what you have trained at. Change a parameter in it, though, and your execution will almost certainly suffer.
Catching a ball is easy by comparison, also, my dog is better than I am at this game.
But throwing a random object not only requires an estimation of the trajectory, but also estimating the mass and aerodynamic properties in advance, to properly adjust the amount of force the throw will use as well as the release point with high accuracy. Doing it with baseballs is "easy", as the parameters are all well known and pitchers spend considerable time training. But picking an oddly shaped rock or stick you have never seen before and throw it not completely off target a second later, now we are talking.
They showed that people running to catch a ball would follow an inefficient curved path as a result of this, rather than actually calculating where the ball will land and moving there in a straight line to intercept it.
https://arstechnica.com/information-technology/2024/08/man-v...
All those years of baseball as a kid gave me a deep intuition for where the ball would go, and that game doesn’t use real gravity (the ball is too floaty).
This kind of things make me think LLMs are quite far from AGI.
And I'd hazard a guess that a well-thought through Fermi Estimation beats lizard-brain eyeballing every time, it's just that in the inbetween space the two interfere unfavourably.
That doesn't feel right to me. (Heh, accidentally appropriate word choice.) There are a lot of tasks we do that are arguably "thinking" yet don't involve an internal "Oh, hey, I'm gonna solve this problem, I'm thinking right now."
For example, imagine you're at a park, and someone is feeding the ducks. Another person walks up behind them and sucker-punches them into the pond.
It should be almost a reflex [0] that you'll conclude "the puncher is bad" and "the person in the water needs help" without explicitly reasoning out. I think that task qualifies as "thinking", especially since it involves some kind of theory-of-mind about those other humans.
[0] An exception might be someone with a sociopathic disability, who would have to think more-explicitly to realize what reaction is expected of them.
Not really related, but athletes perform A LOT worse when they are thinking about their movements/strategies/tactics. A top performing athlete does best when they are in a flow state, where they don't think about anything and just let their body/muscle memory do the work.
Once you start thinking about micro-adjustments (e.g. I should lift my elbow higher), you start controlling your body in a conscious way, which is a magnitude slower and less coordinated than the automatic/subconscious way.
Also, same happens for creativity/new ideas. If you intentionally think about something, step by step, you won't likely find new, innovative solutions. There is a reason why the "a-ha!" moments come in the shower, your subconscious mind is thinking about the problem instead of trying to force your thinking on a specific path.
I would guess this happens in many other areas, where channelling the thought process through a specific template hinders the ability to use all the available resources/brain power.
Talking about religion and politics.