Readit News logoReadit News
moolimon · a year ago
The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.

Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.

Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

nonameiguess · a year ago
To me:

LLMs are trained, as others have mentioned, first to just learn the language at all costs. Ingest any and all strings of text generated by humans until you can learn how to generate text in a way that is indistinguishable.

As a happy side effect, this language you've now learned happens to embed quite a few statements of fact and examples of high-quality logical reasoning, but crucially, the language itself isn't a representation of reality or of good reasoning. It isn't meant to be. It's a way to store and communicate arbitrary ideas, which may be wrong or bad or both. Thus, the problem for these researchers now becomes how do we tease out and surface the parts of the model that can produce factually accurate and reasonable statements and dampen everything else?

Animal learning isn't like this. We don't require language at all to represent and reason about reality. We have multimodal sensory experience and direct interaction with the physical world, not just recorded images or writing about the world, from the beginning. Whatever it is humans do, I think we at least innately understand that language isn't truth or reason. It's just a way to encode arbitrary information.

Some way or another, we all grok that there is a hierarchy of evidence or even what evidence is and isn't in the first place. Going into the backyard to find where your dog left the ball or reading a physics textbook is fundamentally a different form of learning than reading the Odyssey or the published manifesto of a mass murderer. We're still "learning" in the sense that our brains now contain more information than they did before, but we know some of these things are representations of reality and some are not. We have access to the world beyond the shadows in the cave.

anon84873628 · a year ago
Humans can carve the world up into domains with a fixed set of rules and then do symbolic reasoning within it. LLMs can't see to do this in a formal way at all -- they just occasionally get it right when the domain happens to be encoded in their language learning.

You can't feed an LLM a formal language grammar (e.g. SQL) then have it only generate results with valid syntax.

It's awfully confusing to me that people think current LLMs (or multi-modal models etc) are "close" to AGI (for whatever various definitions of all those words you want to use) when they can't do real symbolic reasoning.

Though I'm not an expert and happy to be corrected...

mnky9800n · a year ago
Humans often do not have a clear definition of success and instead create a post-hoc narrative to describe whatever happened as success.
idiotsecant · a year ago
Yes, I've always thought that LLMs need the equivalent of a limbic system. This is how we solved this problem in organic computers. There is no static 'reward function'. Instead, we have a dynamic reward function computer. It decides from day to day and hour to hour what our basic objectives are. It also crucially handles emotional 'tagging' of memory. Memories that we store are proportionally more likely to be retrieved under similar emotional conditions. It helps to filter relevant memories, which is something LLMs definitely could use.

I think the equivalent of an LLM limbic system is more or less the missing piece for AGI. Now, how you'd go about making one of those I have no idea. How does one construct an emotional state space?

_heimdall · a year ago
Companies are bad about doing this on purpose. If they set out to build AGI and accomplish something novel, just call that AI and go on fund raising from people who don't know better (or more likely don't care and just want to gamble with others' money).
cadamsdotcom · a year ago
Continuous RL in a sense. There maybe an undiscovered additional scaling law around models doing what you describe; continuous LLM-as-self-judge, if you will.

Provided it can be determined why a user ended the chat, which may turn out to be possible in some subset of conversations.

ben_w · a year ago
And also sometimes write down the conclusion and work backwards, without considering that the reason most likely for the conclusion isn't necessarily going to have the conclusion as the most likely conclusion — I hope I phrased that broken symmetry correctly.
ahartmetz · a year ago
I'm not following. Do you have an example?
emsign · a year ago
This!
viccis · a year ago
>Interesting to think about what structures human intelligence has that these models don't.

Kant's Critique of Pure Reason has been a very influential way of examining this kind of epistemology. He put forth the argument that our ability to reason about objects comes through our apprehension of sensory input over time, schematizing these into an understanding of the objects, and finally, through reason (by way of the categories) into synthetic a priori knowledge (conclusions grounded in reason rather than empiricism).

If we look at this question in that sense, LLMs are good at symbolic manipulation that mimics our sensibility, as well as combining different encounters with concepts into an understanding of what those objects are relative to other sensed objects. What it lacks is the transcendental reasoning that can form novel and well grounded conclusions.

Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

corimaith · a year ago
>Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

This just goes back into the problems of that AI winter again though. First Order Logic isn't expressive enough to model the real world, while Second Order Logic dosen't have a complete proof system to truly verify all it'sstatements, and is too complex and unyieldy for practical uses. The number of people I would also imagine that are working on such problems would be very few, this isn't engineering that it is analytic philosophy and mathematics.

drakenot · a year ago
With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.

It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.

Deleted Comment

mountainriver · a year ago
It is a reward function it’s just a deterministic one. Reward models are often hacked preventing real reasoning from being discovered
huijzer · a year ago
> Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What is also a benefit for humans, I think, is that people are typically much more selective. LLMs train to predict anything on the internet, so for example for finance that includes clickbait articles which have a lifetime of about 2 hours. Experts would probably reject any information in these articles and instead try to focus on high quality sources only.

Similarly, a math researcher will probably have read a completely set of sources throughout the life than, say, a lawyer.

I’m not sure it’s a fundamental difference, but current models do seem to not specialize from the start unlike humans. And that might be in the way of learning the best representations. I know from ice hockey for example, that you can see within 3 seconds whether someone played ice hockey from young age or not. Same with language. People can usually hear an accent within seconds. Relatedly, I've used OpenAI's text to speech a while back and the Dutch voice had an American accent. What this means is that even if you ask LLMs about Buffett's strategy, maybe they have a "clickbait accent" too. So with the current approach to training, the models might never reach absolute expert performance.

andai · a year ago
When I was doing some NLP stuff a few years ago, I downloaded a few blobs of Common Crawl data, i.e. the kind of thing GPT was trained on. I was sort of horrified by the subject matter and quality: spam, advertisements, flame wars, porn... and that seems to be the vast majority of internet content. (If you've talked to a model without RLHF like one of the base Llama models, you may notice the personality is... different!)

I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)

i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.

What will we think of next...

gf000 · a year ago
I don't know, babies hear a lot of widely generic topics from multiple people before learning to speak.

I would rather put it that humans can additionally specialize much more, but we usually have a pretty okay generic understanding/model of a thing we consider as 'known'. I would even wager that being generic enough (ergo, has been sufficiently abstracted) is possibly the most important "feature" human's have? (In the context of learning)

ben_w · a year ago
> For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What we can integrate, we seem to integrate efficiently*; but compared to the quantities used to train AI, we humans may as well be literally vegetables.

* though people do argue about exactly how much input we get from vision etc., personally I doubt vision input is important to general human intelligence, because if it was then people born blind would have intellectual development difficulties that I've never heard suggested exist — David Blunket's success says human intelligence isn't just fine-tuning on top of a massive vision-grounded model.

Retric · a year ago
Hearing is also well into the terabytes worth of information per year. Add in touch, taste, smell, proprioception, etc and the brain gets a deluge.

The difference is we’re really focused on moving around in 3D space and more abstract work, where an LLM etc is optimized for a very narrow domain.

gf000 · a year ago
Is that a positive thing? If anything I would consider that as the reverse - LLMs have the "intelligence of vegetables" because even with literally the whole of human written knowledge they can at most regurgitate that back to us with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.

But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.

mnky9800n · a year ago
I feel like if you take the underlying transformer and apply to other topics, e.g., eqtransformer, nobody questions this assumption. It’s only when language is in the mix do people suggest they are something more and some kind of “artificial intelligence” akin to the beginnings of Data from Star Trek or C3P0 from Star Wars.
lubujackson · a year ago
Human processing is very interesting and should likely lead to more improvements (and more understanding of human thought!)

Seems to me humans are very good at pattern matching, as a core requirement for intelligence. Not only that, we are wired to enjoy it innately - see sudoku, find Waldo, etc.

We also massively distill input information into short summaries. This is easy to see by what humans are blind to: the guy in a gorilla suit walking through a bunch of people passing a ball around, or basically any human behavior magicians use to deceive or redirect attention. We are mombarded with information constantly. This is the biggest difference between us and LLMs as we have a lot more input data and also are constantly updating that information - with the added feature/limitation of time decay. It would be hard to navigate life without short term memory or a clear way to distinguish things that happened 10 minutes ago from 10 months ago. We don't fully recall each memory of washing the dishes but junk the vast, vast majority of our memories, which is probably the biggest shortcut our brains have over LLMs.

Then we also, crucially, store these summaries in memory as connected vignettes. And our memory is faulty but also quite rich for how "lossy" it must be. Think of a memory involving a ball from before the age of 10 and most people can drum up several relevant memories without much effort, no matter their age.

arkh · a year ago
> Interesting to think about what structures human intelligence has that these models don't.

Pain receptors. If you want to mimic human psyche you have to make your agent want to gather resources and reproduce. And make it painful to lack those resources.

Now, do we really have to mimic human intelligence to get intelligence? You could make the point the internet is now a living organism but does it have some intellect or is it just some human parasite / symbiote?

corimaith · a year ago
>Interesting to think about what structures human intelligence has that these models don't.

If we get to the gritty details of what gradient descent is doing, we've got a "frame", i.e a matrix or some array of weights contains the possible solution for a problem, then with another input of weights we're matching a probability distribution to minimize the loss function with our training data to form our solution in the "frame". That works for something like image recognition, where the "frame" is just the matrix of pixels, or in language models where we're trying to find the next word-vector given a preceding input.

But take something like what Sir William Rowan Hamilton was doing back in 1843. He know that complex numbers could be represented in points in a plane, and arthimetic could be performed on them, and now he wanted to extend a similar way for points in a space. With triples it is easy to define addition, but the problem was multiplication. In the end, he made an intuitive jump, a pattern recognition when he realized that he could easily define multiplications used quadruples instead, and thus was born the Quaternion that's a staple in 3D graphics today.

If we want to generalize this kind of problem solving into a way that gradient descent can solve, where do we even start? First of all, we don't even know if a solution is possible or coherent or what "direction" we are going towards. It's not a systematic solution, it's rather one that pattern in one branch of mathematics was recognized into another. So perhaps you might use something like Category Theory, but then how are we going to represent this in terms of numbers and convex functions, and is Category Theory even practical enough to easily do this?

mdp2021 · a year ago
> Interesting to think about what structures human intelligence has that these models don't

Chiefly?

After having thought long and hard, building further knowledge on the results of the process of having thought long and hard, and creating intellectual keys to further think long and hard better.

brazzy · a year ago
> Interesting to think about what structures human intelligence has that these models don't.

Constant direct feedback from the real world and the ability to continuously integrate it to update the model. That's probably the big one.

My pet theory is that having a body is actually an integral part of intelligence, to provide the above, as well as an anchor for a sense of self

mdp2021 · a year ago
> having a body

You do not need sensorial feedback to do math. And you do not need full sensors to have feeback - one well organized channel can suffice for some applications.

whattheheckheck · a year ago
Have you ever read about Hellen Keller and her experience before her discovery (being taught) language
enugu · a year ago
This has been pursued by researchers like Rodney Brooks. See also https://en.wikipedia.org/wiki/Embodied_cognition
asah · a year ago
So quadriplegics aren't sentient ?
buovjaga · a year ago
You might get a kick out of this essay by Robert Epstein from 2016: https://aeon.co/essays/your-brain-does-not-process-informati... (The empty brain - Your brain does not process information, retrieve knowledge or store memories. In short: your brain is not a computer)
imtringued · a year ago
Coming up with a reward model seems to be really easy though.

Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!

marxplank · a year ago
that would help with decidable problems but would still be not generalisable for problems with non trivial rewards, or ones with none.
spenrose · a year ago
I argue that we should start calling them "pattern processors": https://x.com/sampenrose/status/1877200883613659360
PaulDavisThe1st · a year ago
Your post on Twitter uses slightly more words than the ones preceding it above to make the exact same point. Was there really any reason to link to it? Why not expand on your argument here?
1vuio0pswjnm7 · a year ago
"LLMs are fundamentally matching the patterns they've seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching."
antirez · a year ago
LLMs keep showing more and more they are the wonder of AI that we awaited for decades: talking machines that every two months do progresses that two months before were anticipated impossible because of <put here some limit that was actually in the prejudice of the skeptical AI community> (just stochastic parrots, no reasoning possible without symbolic representations, there are no longer tokens, ...)

At the same time, part of the scientific community continues to diminish what was accomplished and the steps that are being made. A few months ago LeCun arrived to tell new researchers to move away from LLMs since they are a dead end: imagine the disservice he made to the surely non-zero folks that followed the advice, putting themselves out of the AI research that matters. (Incidentally, this skepticism of the Meta AI head must have something to do with the fact that Meta, despite the huge efforts allocated, produced the worst LLM among Anthropic, OpenAI, DeepSeek -- I bet Zuckerberg is asking questions lately).

It's very hard to explain this behavior if not by psychological denial.

[EDIT: you can't see the score of this comment, but I can: it's incredible how it goes from 3, to -2, to 1, and so forth. The community is split in two, and it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth]

ramblerman · a year ago
I get the sentiment, but I actually think some skepticism in the system is healthy.

Billions are flowing towards LLMS, and Sam Altman will overpromise AGI is just around the corner and the days of jobs are gone to fill his coffers to anyone that will listen.

Additionally if we begin to use these things in real production environments where mistakes matters, knowing the exact limitations is key.

None of this takes away from the fact that these are exciting times.

dr_dshiv · a year ago
I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.

So instead of everyone having some enthusiasm and some skepticism, you get a bifurcation where whole classes of people act as the skeptics and others as the enthusiasts. I view the strong skeptics as more “in the wrong” because they often don’t use LLMs much. If you are an actual enthusiastic user, you simply can’t get good performance without a very strong dose of skepticism towards everything LLMs output.

antirez · a year ago
Yes there is another part of the community that overhypes everything. But I can expect that from a CEO of an AI company (especially if he is Altman), but from researches? Also the fact that LLMs may reach superhuman expertise in certain fields in a short timeframe (a few years), since reinforcement learning is starting to be applied to LLMs may no longer be a totally crazy position. If it is possible to extend considerably the same approach seen in R1-Zero there could be low hanging fruits around the corner.
comeonbro · a year ago
This article is about things which aren't limitations anymore!

You are applauding it as pushback for pushback's sake, but it's an article about limitations in biplane construction, published after we'd already landed on the moon.

lowsong · a year ago
> …it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth

This is more of a salient point that you perhaps realized. In life there is no single absolute, unknowable truth. Philosophy has spent the entire span of human existence grappling with this topic. The real risk with AI is not that we build some humanity-destroying AGI, but that we build a machine that is 'convincing enough' — and the idea that such a machine would be built by people who believe in objective truth is the most worrying part.

thrance · a year ago
Depends, if you're a realist [1] (like most) then there can be such a thing as absolute truth, that you may not always be able to access.

[1] https://en.wikipedia.org/wiki/Philosophical_realism?wprov=sf...

mdp2021 · a year ago
> At the same time, part of the scientific community continues to diminish what was accomplished

Revisit the idea: part of the public is bewildered by voices that started calling "intelligence" what was and apparently still is the precise implementation of unintelligence. The fault is in some, many people - as usual.

Very recent state-of-the-art LLM models themselves declare that if the majority of their training data states that entity E is red they will say it's red, and if the majority says it's blue then they will say it's blue: that is the implementation of an artificial moron.

And in fact, very recent state-of-the-art LLM models state cretinous ideas that are child level - because "that's what they have heard" (stuck, moreover analytically, in the simplifications intrinsic in expression).

This architectural fault should be the foremost concern.

pera · a year ago
Psychological denial of what exactly? And what part of the article/preprints you are commenting on?

Every time an article exposing some limitation of the current wave of LLMs is submitted to HN there are comments like yours and I genuinely cannot understand the point you are trying to make: There is no such thing as a perfect technology, everything has limitations, and we can only improve our current state of the art by studying these and iterate.

rthrfrd · a year ago
I think if we referred to LLMs as AK (Artifical Knowledge) instead of AI it would be easier to have more cohesive discussions.

I don’t see how there can be a single truth when there is not even a single definition of many of the underlying terms (intelligence, AGI, etc) which this discipline supposedly defines itself by. Combine that with a lot of people with little philosophical perspective suddenly being confronted with philosophical topics and you end up with a discourse that personally I’ve mostly given up on participating in until things calm down again.

It feels like nobody remembers all the timelines for which we were supposed to have self-driving cars.

polotics · a year ago
you are I think badly misrepresenting what Yann Le Cun said: he didn't say LLM's were a dead end, he said to do research in directions that do not require billions of dollars of investment to show results, in particular for PhD's this is sensible, and in view of recent cheaper results, prescient
fragmede · a year ago
Sensible with the caveat that deepseek R1 still took millions of dollars off compute time, so you're not training the next one on the box in your basement with a pair of 3090s (though you could certainly fine-tune a shared quantized model). you can't run the full sized model on anything cheap, so. basement researcher still need access to a decent amount of funding, which likely requires outside help.
sitkack · a year ago
It is becoming more and more important to determine for ourselves what is true and what is not. No person is right on most things, even when they are an expert in that thing. The biggest trap, is to believe someone because they are passionate, that they say it with conviction. Ignore most of the out of band signaling, take what they are saying and then also see if you can corroborate with another source.

There are so many people who are wrong about so may things.

I really appreciate that you are making your dev with ai videos, it shows people different, more humanistic ways of operating with AI.

Most of what I use AI for is to understand and relearn things I only thought I knew. This I think, is the most powerful use of AI, not in the code writing or the image generation, but in understanding and synthesis.

There is that hilarious tautological statement, "it is easy if you know it".

This video https://www.youtube.com/watch?v=TPLPpz6dD3A shows how to use AI to be a personal tutor using the Socratic Method. This is what people should be using AI for, have it test you for things you think you are already good at and you will find huge gaps in your own understanding. Now go apply it to things you have no clue about.

Speaking of parrots, a large volume of the anti AI sentiment, even here is by people repeating half truths they don't understand, confidently, about what AI cannot do. One would need a pretty tight formal case to prove such things.

Everyone should be playing, learning and exploring with these new tools, not shutting each other down.

antirez · a year ago
Yes, the stochastic parrots story is one of the most strong instances in recent times where experts in a field are made blind by their own expertise (the mental model they have of certain things) to the point of being incapable of seeing trivial evidences.
sitkack · a year ago
Another trope that stands out is that someone will take a model, run a battery of tests against it and then make general statements about what LLMs can and cannot do without understanding their architecture, the training data, and the training itself.

And then they dress it up to sound scientific, when really they are making hasty generalizations to support a preconceived bias.

guelo · a year ago
But what for? Human learning is becoming of diminishing utility as the machines improve. For example, I am now able to create computer programs and beautiful artwork without taking the time to master these skills. You could say that I can use art and programming as basic tools to accelerate my learning of bigger things, but whatever that bigger thing is AI is coming for it too. I can't imagine the progress the machines will achieve in 10 years. We'll be replaced.
kubb · a year ago
I'm sorry but they don't "do progress that was anticipated impossible", especially not every two months.

They were predicted to end the software engineering profession for almost four years already. And it just doesn't happen, even though they can bang out a perfect to-do list in React in a matter of seconds.

LLMs have incremental improvements on the quality of their responses as measured by benchmarks. The speed and cost of inference has also been improving. Despite that there was no major breakthrough since GPT 3.

People keep trying to make them reason, and keep failing at it.

throw310822 · a year ago
> They were predicted to end the software engineering profession for almost four years already

ChatGPT was launched on November 30 2022. Two years and two months ago. The fact that in such a short timeframe you're talking about missed predictions is absurd, but telling of the accelerated timeframe in which we're living. The fact is that currently AI and LLMs are going through a phase of explosive improvement, to the point we can expect enormous improvements in capabilities every six months or so.

sgt101 · a year ago
SE is a good example - I get a lot of help from LLM tools and I think we're learning how to use them better across realistic SDLC processes as well, but we're not replacing lots of people at the moment. On the other hand I saw a business case from one of the big SI's (not my employer but in a deck that was shown by the SI in an discussion) that described the need to move their Indian software dev workforce from 350k FTE to 50K FTE over the next five years.

I think that the onshore impacts will be much lower or negligible, or possibly even positive, because so much work has been offshored already, and as is well worn in every discussion, Jevons paradox may drive up demand significantly (to be fair I believe this as wherever I have worked we've had 3x+ demand (with business cases) for development projects and had to arbitrarily cull 2x of it at the beginning of each year. So, just like the 30 people in India that are working on my project won't do anything useful unless we feed the work to them, the LLM's won't do anything useful either. And just like we have to send lots of work back to India because it's not right, the same is true of LLM's. The difference is that I won't spend 4 hrs on a friday afternoon on Teams discussing it.

But this is not surprising because we've had big impacts from tools like IDE's, VM's, and compilers which have driven seismic changes in our profession, I think that LLM's are just another one of those.

What I'm watching for is an impact in a non tech domain like healthcare or social care. These are important domains that are overwhelmed with demand and riddled with makework, yet so far LLM's have made very little impact. At least, I am not seeing health insurance rates being cut, hospital waiting lists fall or money and staff being redeployed from back office functions to front line functions.

Why hasn't this started?

vachina · a year ago
LLMs can hammer out existing solutions to problems, but not never before seen problems.
raindear · a year ago
Progress is what happens thanks to AI skeptics busy defining model limitations. The limitations set attractive bars to pass.

Deleted Comment

rbranson · a year ago
Did you read the article? Dziri and Peng are not the “skeptical AI community,” they are in fact die hard AI researchers. This is like saying people who run benchmarks to find performance problems in code are skeptics or haters.
antirez · a year ago
I read the article: it does not look like very good research: It's simple to find flaws in LLMs reasoning / compositional capabilities looking at problems that are at the limit of what they can do now, or just picking problems that are very far from their computational model, or submitting riddles. But there is no good analysis of the limitations, nor inspection of how/how much better recently LLMs got exactly at this kind of problems. Also the article is full of uninformative and obvious things to show how LLMs fail in stupid tasks such as multiplication between large numbers.

But the most absurd thing is that the paper looks at computational complexity in terms of direct function composition, and there is no reason an LLM should just use this kind of model when emitting many tokens. Note that even when CoT is not explicit, the LLM output that starts to shape the thinking process still makes it able to have technically unbound layers. With CoT this is even more obvious.

Basically there is no bridge between their restricted model and an LLM.

layer8 · a year ago
I think that “part of the scientific community” actually wants to do what needs to be done: “We have to really understand what’s going on under the hood,” she said. “If we crack how they perform a task and how they reason, we can probably fix them. But if we don’t know, that’s where it’s really hard to do anything.”
nuancebydefault · a year ago
Well, there appears to be evolution in human perception of capabilities of LLMs. An example, the 'stochastic parrots' notion seems to have mostly died out, at least in HN comments.

Dead Comment

starchild3001 · a year ago
What a poorly informed article. It's very shallow and out of touch with LLM research. As it stands 6-12 months old models are system 1 thinkers, everybody knows this and knew this even at the time. You need system 2 thinking (test time compute) for more complex logical, algorithmic and reasoning tasks. We knew this when Daniel kahneman wrote thinking fast, thinking slow (over a decade ago) and we still know it today. So LLMs can think but they have to be programmed to think (a la system 2, reasoning, thinking models). There's nothing inherently wrong or limited with LLMs themselves as far as we can tell.
astrange · a year ago
This is an example of "metaphor-driven development" in AI, which Phil Agre criticized a few decades ago.

System 1/System 2 isn't a real thing. It's just a metaphor Kahneman invented for a book. AI developers continually find metaphors about the brain, decide they are real, implement something which they give the same name, decide it's both real and the same thing because they have given it the same name, and then find it doesn't work.

(Another common example is "world model", something which has never had a clear meaning, and if you did define it you'd find that people don't have one and don't need one.)

tshadley · a year ago
> To understand the capabilities of LLMs, we evaluate GPT3 (text-davinci-003) [11], ChatGPT (GPT-3.5-turbo) [57] and GPT4 (gpt-4)

Oh dear, this is embarrassing. Anil Anathaswamy, are you aware a year in AI research now is like 10 years in every other field?

geoffhill · a year ago
Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:

  solve(WaterDrinker, ZebraOwner) :-
      % H01: Five houses with positions 1..5.
      Houses = [ house(1, _, norwegian, _, _, _),  % H10: Norwegian lives in the first house.
                 house(2, blue, _, _, _, _),       % H15: Since the Norwegian lives next to the blue house,
                 house(3, _, _, milk, _, _),        %       and house1 is Norwegian, house2 must be blue.
                 house(4, _, _, _, _, _),
                 house(5, _, _, _, _, _) ],
  
      % H02: The Englishman lives in the red house.
      member(house(_, red, englishman, _, _, _), Houses),
      % H03: The Spaniard owns the dog.
      member(house(_, _, spaniard, _, dog, _), Houses),
      % H04: Coffee is drunk in the green house.
      member(house(_, green, _, coffee, _, _), Houses),
      % H05: The Ukrainian drinks tea.
      member(house(_, _, ukrainian, tea, _, _), Houses),
      % H06: The green house is immediately to the right of the ivory house.
      right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
      % H07: The Old Gold smoker owns snails.
      member(house(_, _, _, _, snails, old_gold), Houses),
      % H08: Kools are smoked in the yellow house.
      member(house(_, yellow, _, _, _, kools), Houses),
      % H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
      next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
      % H12: Kools are smoked in a house next to the house where the horse is kept.
      next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
      % H13: The Lucky Strike smoker drinks orange juice.
      member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
      % H14: The Japanese smokes Parliaments.
      member(house(_, _, japanese, _, _, parliaments), Houses),
      % (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
      
      % Finally, find out:
      % Q1: Who drinks water?
      member(house(_, _, WaterDrinker, water, _, _), Houses),
      % Q2: Who owns the zebra?
      member(house(_, _, ZebraOwner, _, zebra, _), Houses).
  
  right_of(Right, Left, Houses) :-
      nextto(Left, Right, Houses).
  
  next_to(X, Y, Houses) :-
      nextto(X, Y, Houses);
      nextto(Y, X, Houses).
Seems ok to me.

   ?- solve(WaterDrinker, ZebraOwner).
   WaterDrinker = norwegian,
   ZebraOwner = japanese .

orbital-decay · a year ago
That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.

This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.

[1] https://arxiv.org/abs/2412.02975

[2] https://news.ycombinator.com/item?id=42889786

antirez · a year ago
LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.
janalsncm · a year ago
That’s a different paper than the one this article describes. The article describes this paper: https://arxiv.org/abs/2305.18654
teruakohatu · a year ago
gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.

COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.

mycall · a year ago
This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.
leonidasv · a year ago
Also, notice that the original study is from 2023.
echelon · a year ago
The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.

I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.

I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.

simonw · a year ago
Solving it with Prolog is neat, and a very realistic way of how LLMs with tools should be expected to handle this kind of thing.
EdwardDiego · a year ago
I would've been very surprised if Prolog to solve this wasn't something that the model had already ingested.

Early AI hype cycles, after all, is where Prolog, like Lisp, shone.

lsy · a year ago
If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.
xyzzy123 · a year ago
Do you feel differently if it runs the prolog in a tool call?
baq · a year ago
But the problem is solved. Depends what you care about.
endofreach · a year ago
Psst, don't tell my clients that it's not actually me but the languages syntax i use, that's solving their problem.
choeger · a year ago
So you asked an LLM to translate. It excells in translation. But ask it to solve and it will, inevitably, fail. But that's also expected.

The interesting question is: Given a C compiler and the problem, could an LLM come up with something like Prolog on its own?

charlieyu1 · a year ago
I think it could even solve, these kinds of riddles are heavily trained
intended · a year ago
Science is not in the proving of it.

It’s in the disproving of it, and in the finding of the terms that help others understand the limits.

I dont know why it took me so long to come to that sentence. Yes, everyone can trot out their core examples that reinforce the point.

The research is motivated by these examples in the first place.

Agraillo · a year ago
Good point. LLMs can be treated as "theories" and then they definitely meet falsifiability [1] allowing researchers finding "black swans" for years to come. Theories in this case can be different. But if the theory is of logical or symbolic solver then Wolfram's Mathematica may be struggle with understanding the human language as an input, but when evaluating the results, well, I think Stephen (Wolfram) can sleep soundly, at least for now

[1] https://en.wikipedia.org/wiki/Falsifiability

est · a year ago
I'd say not only LLM stuggle with these kind of problems, 99% of humans do.
tuatoru · a year ago

    solve (make me a sandwich)
Moravec's Paradox is still a thing.

AtlasBarfed · a year ago
Can it port sed to java? I just tried to do that in chatgippity and it failed
mmcnl · a year ago
There's so much talk about the advancements in AI/LLMs, yet for me ChatGPT as of this date is basically just a faster search engine without cookie banners, clickbait and ads. It hallucinates a lot and it can keep very limited context. Why is there so much promise about future progress but so little actual progress?
EA-3167 · a year ago
It's the same cycle we saw with Crypto, there's so much money flying around that the motivation to "believe" is overwhelming. The hype is coming from all directions, and people are social animals that put greater weight on words that come from multiple sources. It's also a platform for people excited about the future to fantasize, and for people terrified of the future to catastrophize.
knowaveragejoe · a year ago
I have to wonder how you are using ChatGPT to get a lot of hallucinations or run into issues with limited context.
mikeknoop · a year ago
One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.
orbital-decay · a year ago
It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3 solving the tasks from that benchmark).

[1] https://arxiv.org/abs/2412.02975

kebsup · a year ago
I've managed to get llms fail on simple questions, that require thinking graphically - 2D or 3D.

An example would be: you have a NxM grid. How many shapes of XYZ shape can you fit on it?

However, thinking of the transformer video games, AI can be trained to have a good representation of 2D/3D worlds. I wonder how it can be combined so that this graphical representation is used to compute text output.

klodolph · a year ago
When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.

These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.

chefandy · a year ago
Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.
munchler · a year ago
Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.

[0] https://arcprize.org/

[1] https://arcprize.org/blog/oai-o3-pub-breakthrough

liamwire · a year ago
Not quite what you asked for, but it seems tangentially related and you might find it interesting: https://r0bk.github.io/killedbyllm/
gallerdude · a year ago
Completely disagree… there are a crazy amount of cases that didn’t work, until the models scaled to a point they magically did.

Best example I can think of is the ARC AGI benchmark. It was seen to measure human-like intelligence through special symmetries and abstract patterns.

From GPT-2 to GPT-4 there was basically had no progress, then o1 got about 20%. Now o3 has basically solved the benchmark.

intelkishan · a year ago
Performance of OpenAI o3 in the ARC-AGI challenge fits the bill, however the model is not released publicly.
wslh · a year ago
SMT solvers really.
levocardia · a year ago
Came here to say the same thing. Bet o3 and Claude 3.5 Opus will crush this task by the end of 2025.
xigency · a year ago
I've been slacking but yeah it's on my list.