We Have Made No Progress Toward AGI

ebiester · 4 months ago

The hard part is that for all the things that the author says disprove LLMs are intelligent are failings for humans too.

* Humans tell you how they think, but it seemingly is not how they really think.

* Humans tell you repeatedly they used a tool, but they did it another way.

* Humans tell you facts they believe to be true but are false.

* Humans often need to be verified by another human and should not be trusted.

* Humans are extraordinarily hard to align.

While I am sympathetic to the argument, and I agree that machines aligned on their own goals over a longer timeframe is still science fiction, I think this particular argument fails.

GPT o3 is a better writer than most high school students at the time of graduation. GPT o3 is a better researcher than most high school students at the time of graduation. GPT o3 is a better lots of things than any high school student at the time of graduation. It is a better coder than the vast majority of first semester computer science students.

The original Turing test has been shattered. We're building progressively harder standards to get to what is human intelligence and as we find another one, we are quickly achieving it.

The gap is elsewhere: look at Devin as to the limitation. Its ability to follow its own goal plans is the next frontier and maybe we don't want to solve that problem yet. What if we just decide not to solve that particular problem and lean further into the cyborg model?

We don't need them to replace humans - we need them to integrate with humans.

13years · 4 months ago

> GPT o3 is a better writer than most high school students at the time of graduation.

All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.

o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.

LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.

But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.

ebiester · 4 months ago

I don't dispute that these are problems, but the fact that its hallucinations are quite sophisticated to me means that they are errors humans also could reach.

I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)

benlivengood · 4 months ago

> o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.

At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.

I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)

montroser · 4 months ago

> LLMs are probability machines.

So too are humans, it turns out.

KTibow · 4 months ago

Were you using it with search enabled?

GuB-42 · 4 months ago

Humans can be deceptive, but it is usually deliberate. We can also honestly make things up and present them as fact but it is not that common, we usually say that we don't know. And generally, lying is harder for us than telling the truth, in the sense that making a consistent but false narrative requires effort.

For LLMs, making stuff up is the default, one can argue that it is all they do, it just happens to be the truth most of the times.

And AFAIK, what I would call the "real" Turing test hasn't been shattered by far. The idea is that the interrogator and the human subject are both experts and collaborate against the computer. They can't cheat by exchanging secrets, but anything else is fair game.

I think it is important because the Turing test has already been "won" by primitive algorithms acting clueless to interrogators who were not aware of the trick. For me, this is not really a measure of computer intelligence, more like a measure of how clever the chatbot designers were at tricking unsuspecting people.

13years · 4 months ago

> we usually say that we don't know

I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.

The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.

The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.

TimTheTinker · 4 months ago

I think TA's argument fundamentally rests on two premises (quoting):

(a) If we were on the path toward intelligence, the amount of training data and power requirements would both be reducing, not increasing.

(b) [LLMs are] data bound and will always be unreliable as edge cases outside common data are infinite.

The most important observed consequences of (b) are model collapse when repeatedly fed LLM output in further training iterations; and increasing hallucination when the LLM is asked for something truly novel (i.e. arising from understanding of first principles but not already enumerated or directly implicated in its training data).

Yes, humans are capable of failing (and very often do) in the same ways: we can be extraordinarily inefficient with our thoughts and behaviors, we can fail to think critically, and we can get stuck in our own heads. But we are capable of rising above those failings through a commitment to truths (or principles, if you like) outside of ourselves, community (including thoughtful, even vulnerable conversations with other humans), self-discipline, intentionality, doing hard things, etc...

There's a reason that considering the principles at play, sitting alone with your own thoughts, mulling over a problem for a long time, talking with others and listening carefully, testing ideas, and taking thoughtful action can create incredibly valuable results. LLMs alone won't ever achieve that.

goatlover · 4 months ago

But there are places where humans do follow reasoning steps, such as arithmetic and logic. The fact that we need to add RLHF to models to make them more useful humans is also evidence that statistical reasoning is not enough for a general intelligence.

nopelynopington · 4 months ago

> * Humans tell you how they think, but it seemingly is not how they really think.

> * Humans tell you facts they believe to be true but are false.

Think and believe are key words here. I'm not trying to be spiritual but LLMs do not think or believe a thing, they only predict the next word.

> * Humans often need to be verified by another human and should not be trusted.

You're talking about trusting another human to do that though, so you trust the human that is verifying.

Deleted Comment

Yeask · 4 months ago

How many books or software wrote by recently graduated students have you read/use?

And by LLMs?

minraws · 4 months ago

LLMs aren't as good as average humans, most software folks like us like to believe rest of the world is dumb but it isn't.

My grand-dad who only ever farmed land and had no education at all could talk, calculate, and manage his farm land. Was he not as good as me at anything academic yes. But

I will never be as good at him at understanding how to farm stuff.

Most people in who think LLMs are smart seem to conflate ignorance/lack of knowledge to being dumb.

This is a rather reductive take, and no I don't believe that's how human intelligence works.

Your dumb uncle on thanks giving who might even have a lot of bad traits isn't dumb likely just ignorant. All human IQ studies and movies have distorted our perception of intelligence.

A more intelligent person doesn't necessarily need to have more or better quality knowledge.

Or measuring them with academic abilities like writing and maths is the dumber/irresponsible take.

And yes please feel free to call me dumb in response.

advisedwang · 4 months ago

My understanding was that chain-of-thought is used precisely BECAUSE it doesn't reproduce the same logic that simply asking the question directly does. In "fabricating" an explanation for what it might have done if asked the question directly, it has actually produced correct reasoning. Therefore you can ask the chain-of-thought question to get a better result than asking the question directly.

I'd love to see the multiplication accuracy chart from https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator... with the output from a chain-of-thought prompt.

mark_l_watson · 4 months ago

I mildly disagree with the author, but would be happy arguing his side also on some of his points:

Last September I used ChatGPT, Gemini, and Claude in combination to write a complex piece of code from scratch. It took four hours and I had to be very actively involved. A week ago o3 solved it on its own, at least the Python version ran as-is, but the Common Lisp version needed some tweaking (maybe 5 minutes of my time).

This is exponential improvement and it is not so much the base LLMs getting better, rather it is: familiarity with me (chat history) and much better tool use.

I may be be incorrect, but I think improvements in very long user event and interaction context, increasingly intelligent tool use, perhaps some form of RL to develop per-user policies for improving incorrect tool use, and increasingly good base LLMs will get us to a place that in the domain of digital knowledge work where we will have personal agents that are AGI for a huge range of use cases.

13years · 4 months ago

> where we will have personal agents that are AGI for a huge range of use cases

We are already there for internet social media bots. I think the issue here is being able to discern the correct use cases. What is your error tolerance? For social media bots, it really doesn't matter so much.

However, mission critical business automation is another story. We need to better understand the nature of these tools. The most difficult problem is that there is no clear line for the point of failure. You don't know when you have drifted outside of the training set competency. The tool can't tell you what it is not good at. It can't tell you what it does not know.

This limits its applicability for hands-off automation tasks. If you have a task that must always succeed, there must be human review for whatever is assigned to the LLM.

xandrius · 4 months ago

Depends on what AGI means to you. If writing a program that you find complicated counts then it is.

I do think that writing code is a specific type of task that a statistical machine can do well without actually bringing us closer to AGI.

thefounder · 4 months ago

So the “reasoning” text of openAI is no more than old broken Windows “loading” animation.

xg15 · 4 months ago

I think there is still a widespread confusion between two slightly different concepts that the author also tripped over.

If you ask an LLM a question, then get the answer and then ask how it got to that answer, it will make stuff up - because it literally can't do otherwise: There is no hidden memory space in which the LLM could do its calculations, and also record which calculations it did, that it could then consult to answer the second question. All there is are the tokens.

However if you tell the model to "think step by step", I.e. first make a number of small inferences, then use those to derive the final answer, you should (at least in theory) get a high-level description of the actual reasoning process, because the model will use the tokens of its intermediate results to generate the features for the final result.

So "explain how you did it" will give you bullshit, but "think step by step" should work.

And as far as my understanding goes, the "reasoning models" are essentially just heavily fine tuned for step-by-step reasoning.

13years · 4 months ago

> that the author also tripped over

The evidence for unfaithful reasoning comes from Anthropic. It is in their system card and this Anthropic paper.

https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...

hnpolicestate · 4 months ago

One point that I think seperates AI and human intelligence is LLM's inability to tell me how it feels or it's individual opinion on things.

I think to be considered alive you have to have an opinion on things.

tboyd47 · 4 months ago

Fascinating look at how AI actually reasons. I think it's pretty close to how the average human reasons.

But he's right that the efficiency of AI is much worse, and that matters, too.

Great read.

xg15 · 4 months ago

People ditch symbolic reasoning for statistical models, then are surprised when the model does, in fact, use statistical features and not symbolic reasoning.

13years · 4 months ago

I think it is actually worse than that. The hype labs are still defiantly trying to convince us that somehow merely scaling statistics will lead to the emergence of true intelligence. They haven't reached the point of being "surprised" as of yet.

setnone · 4 months ago

> All of the current architectures are simply brute-force pattern matching

This explains hallucinations and i agree with 'braindead' argument. To move toward AGI i believe there should be some kind of social awareness component added which is an important part of human intelligence.