Ask HN: Any insider takes on Yann LeCun's push against current architectures?

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

somenameforme · a year ago

Is that what he's arguing? My perspective on what he's arguing is that LLMs effectively rely on a probabilistic approach to the next token based on the previous. When they're wrong, which the technology all but ensures will happen with some significant degree of frequency, you get cascading errors. It's like in science where we all build upon the shoulders of giants, but if it turns out that one of those shoulders was simply wrong, somehow, then everything built on top of it would be increasingly absurd. E.g. - how the assumption of a geocentric universe inevitably leads to epicycles which leads to ever more elaborate, and plainly wrong, 'outputs.'

Without any 'understanding' or knowledge of what they're saying, they will remain irreconcilably dysfunctional. Hence the typical pattern with LLMs:

---

How do I do [x]?

You do [a].

No that's wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [b].

No that's also wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [a].

FML

---

More advanced systems might add a c or a d, but it's just more noise before repeating the same pattern. Deep Seek's more visible (and lengthy) reasoning demonstrates this perhaps the most clearly. It just can't stop coming back to the same wrong (but statistically probable) answer and so ping-ponging off that (which it at least acknowledges is wrong due to user input) makes up basically the entirety of its reasoning phase.

gsf_emergency_2 · a year ago

on "stochastic parrots"

Table stakes for sentience: knowing when the best answer is not good enough.. try prompting LLMs with that..

It's related to LeCun's (and Ravid's) subtle question I mentioned in passing below:

To Compress Or Not To Compress?

(For even a vast majority of Humans, except tacitly, that is not a question!)

tmaly · a year ago

Right now, humans still have enough practice thinking to point out the errors, but what happens when humanity becomes increasingly dependent on LLMs to do this thinking?

jcims · a year ago

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.

amw-zero · a year ago

This exactly describes my intuition as well. Language is limited by its representation, and we have to jam so many bits of information into one dimension of text. It works well enough to have a functioning society, but it’s not very precise.

ninetyninenine · a year ago

LLM is just the name. You can encode anything into the "language" including pictures video and sound.

numba888 · a year ago

Great, but how do you imagine multimodal with text, video. Just 2 for simplicity, what will be in the training set. With text model tries to predict next, then more steps were added. But what to do with multimodal?

codenlearn · a year ago

Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand

ThinkBeat · a year ago

No.

> The sun feels hot on your skin.

No matter how many times you read that, you cannot understand what the experience is like.

> You can read a book about Yoga and read about the Tittibhasana pose

But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).

I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).

Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.

Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.

We also forget and change what we remember.

Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.

The way we learn and the way we remember things is incredebily more complex than text.

But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.

To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.

not2b · a year ago

I'm reminded of the story of Helen Keller, and how it took a long time for her to realize that the symbols her teacher was signing into her hand had meaning, as she was blind and deaf and only experienced the world via touch and smell. She didn't get it until her teacher spelled the word "water" as water from a pump was flowing over her hand. In other words, a multimodal experience. If the model only sees text, it can appear to be brilliant but is missing a lot. If it's also fed other channels, if it can (maybe just virtually) move around, if it can interact, the way babies do, learning about gravity by dropping things and so forth, it seems that there's lots more possibility to understand the world, not just to predict what someone will type next on the Internet.

furyofantares · a year ago

> Doesn't Language itself encode multimodal experiences?

When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.

andsoitis · a year ago

Some aspects of experience— e.g. raw emotions, sensory perceptions, or deeply personal, ineffable states—often resist full articulation.

The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.

mystified5016 · a year ago

Imagine I give you a text of any arbitrary length in an unknown language with no images. With no context other than the text, what could you learn?

If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?

Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.

You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.

At what point do you start building the space ship? When do you realize the source text was fictional?

There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.

Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.

Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.

Deleted Comment

danielmarkbruce · a year ago

> Doesn't Language itself encode multimodal experiences

Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.

iainctduncan · a year ago

Thanks for articulating this so well. I'm a musician and music/CS phd student, and as a jazz improvisor of advanced skill (30+ years), I'm accutely aware that there are significant areas of intelligence for which linguistic thinking is not only not good enough, but something to be avoided as much as one can (which is bloody hard sometimes). I have found it so frustrating, but hard to figure out how to counter, that the current LLM zeitgeist seems to hinge on a belief that linguistic intelligence is both necessary and sufficient for AGI.

kadushka · a year ago

Most modern LLMs are multimodal.

throw310822 · a year ago

I don't get it.

1) Yes it's true, learning from text is very hard. But LLMs are multimodal now.

2) That "size of a lion" paper is from 2019, which is a geological era from now. The SOTA was GPT2 which was barely able to spit out coherent text.

3) Have you tried asking a mouse to play chess or reason its way through some physics problem or to write some code? I'm really curious in which benchmark are mice surpassing chatgpt/ grok/ claude etc.

nextts · a year ago

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

An LLM is essentially a search over a compressed dataset with a tiny bit of reasoning as emergent behaviour. Because it is a parrot that is why you get "hallucinations". The search failed (like when you get a bad result in Google) or the lossy compression failed or it's reasoning failed.

Obviously there is a lot of stuff the LLM can find in its searches that are reminiscent of the great intelligence of the people writing for its training data.

The magic trick is impressive because when we judge a human what do we do... an exam? an interview? Someone with a perfect memory can fool many people because most people only acquire memory from tacit knowledge. Most people need to live in Paris to become fluent in French. So we see a robot that has a tiny bit of reasoning and a brilliant memory as a brilliant mind. But this is an illusion.

Here is an example:

User: what is the French Revolution?

Agent: The French Revolution was a period of political and societal change in France which began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy and its values remain central to modern French political discourse.

Can you spot the trick?

YeGoblynQueenne · a year ago

Oh mice can solve a plethora of physics problems before it's time for breakfast. They have to navigate the, well, physical world, after all.

I'm also really curious what benchmarks LLMs have passed that include surviving without being eaten by a cat, or a gull, or an owl, while looking for food to survive and feed one's young in an arbitrary environment chosen from urban, rural, natural etc, at random. What's ChatGPT's score on that kind of benchmark?

gsf_emergency_2 · a year ago

usual disclaimer: you decide on your own whether I'm an insider or not :)

where LeCun might be prescient should intersect with the nemesis SCHMIDHUBER. They can't both be wrong, I suppose?!

It's only "tangentially" related to energy minimization, technically speaking :) connection to multimodalities is spot-on.

https://www.mdpi.com/1099-4300/26/3/252

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

With Ravid, double-handedly blue-flag MDPI!

Sunmarized for the layman (propaganda?) https://archive.is/https://nyudatascience.medium.com/how-sho...

>When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabula

Imho, best take I've seen on this thread (irony: literal energy minimization) https://news.ycombinator.com/item?id=43367126

Of course, this would make Google/OpenAI/DeepSeek wrong by two whole levels (both architecturally and conceptually)

ninetyninenine · a year ago

>1) You can't learn an accurate world model just from text. >2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.

Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.

If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?

And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.

I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.

Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.

MITSardine · a year ago

Humans are more adaptable than you think:

- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation

- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...

In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).

physicsguy · a year ago

Hugely different data too?

If you think of all the neurons connected up to vision, touch, hearing, heat receptors, balance, etc. there’s a constant stream of multimodal data of different types along with constant reinforcement learning - e.g. ‘if you move your eye in this way, the scene you see changes’, ‘if you tilt your body this way your balance changes’, etc. and this runs from even before you are born, throughout your life.

kedarkhand · a year ago

> non Euclidean space and only using sonar for visualization

Pretty good idea for a video game!

hintymad · a year ago

I'm curious why their claims are controversial. It seems pretty obvious to me that LLMs sometimes generate idiotic answers because the models lack common sense, do not have ability for deductive logical reasoning, let alone the ability to induce. And the current transformer architectures plus all the post-training techniques do not do anything to build such intelligence or the world model per LeCun's words.

qoez · a year ago

"LeCun has been on about it for a while and it's less controversial these days."

Funny how that sentence could have been used 15 years ago too when he was right about persevering through neural network scepticism.

PeterStuer · a year ago

Late 80's and 90's had the 'nouvelle AI' movement that argued embodiment was required for grounding the system into the shared world model. Without it symbols would be ungrounded and never achieve open world consistency.

So unlike their knowledge system predecessors, a bit derogatory refered to as GOFAI (good old fashioned AI), nAI hawked back to cybernetics and multi layered dynamical systems rather than having explicit internal symbolic models. Braitenberg rather than blocksworld so to speak.

Seems like we are back for another turn of the wheel in this aspect.

implmntatio · a year ago

> grounding the system into the shared world model

before we fix certain things [..., 'corruption', Ponzi schemes, deliberate impediment of information flow to population segments and social classes, among other things, ... and a chain of command in hierarchies that are build on all that] is impossible.

Why do smart people not talk about this at all? The least engineers and smart people should do is picking these fights for real. It's just a few interest groups, not all of them. I understand a certain balance is necessary in order to keep some systems from tipping over, aka "this is humanity, silly, this is who we are", but we are far from the point of efficient friction and it's only because "smart people" like LeCun et al are not picking those fights.

How the hell do you expect to ground an ()AI in a world where elected ignorance amplifies bias and fallacies for power and profit while the literal shit is hitting all the fans via intended and unintended side effects? Any embodied AI will pretend until there is no way to deny that the smartest, brightest and the productive don't care about the system in any way but are just running algorithmically while ignoring what should not be ignored - should as in, an AI should be aligned with humanities interests and should be grounded into the shared world model.

petesergeant · a year ago

> LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

It feels like special pleading: surely _this_ will be the problem class that doesn’t fall to “the bitter lesson”.

My intuition is that the main problem with the current architecture is that mapping into tokens causes quantization that a real brain doesn’t have, and lack of plasticity.

I don’t build models, I spend 100% of my time reading and adjusting model outputs though.

satellite2 · a year ago

So blind people aren't generally sentient?

(I'm obviously exaggerating a bit for the sake of the argument, but the point stands. Multimodality should not be a prerequisite to AGI)

thousand_nights · a year ago

blind people still have other senses including touch which gives them a size reference they can compare to. you can feel physical objects to gain an understanding of their size.

the LLM is more like a brain in a vat with only one sensory input - a stream of text

brulard · a year ago

I don't know about telling better the size from a picture. I can imagine seeing 2 pictures of the moon. One is extreme telephoto showing moon next to a building and it looks real big. Then there would be another image where moon is a tiny speckle in the sky. How big is the moon? I would rather understand a text: "its radius is x km".

veidr · a year ago

I think the example is simplified to make its point efficiently, but also: the moon is something whose size would very likely be precisely explained in texts about it. While some hunting journals might brag about the weight of a lion that was killed, or whatever, most texts that I can recall reading about lions basically assumed you already know roughly how big a lion is; which indeed I learned from pictures as a pre-literate child.

A good, precise spec is better that a few pictures, sure; the random text content of whatever training set you can scrape together, perhaps not (?)

nonameiguess · a year ago

Reading "its radius is x km" would mean nothing to you if you'd never experienced spatial extent directly, whether that be visually or just by moving through space and existing in it. You'd need to do exactly what is being said in the paper, read about thousands of other roughly spherical objects and their radii. At some point, you'd get a decent sense of relative sizes.

On the other hand, if you ever simply see a meter stick, any statement that something measures a particular multiple or fraction of that you can already understand, without ever needing to learn the size of anything else.

cm2012 · a year ago

This seems strongly backed up by Claude Plays Pokemon

fewhil · a year ago

Isn't Claude Plays Pokemon using image input in addition to text? Not that it's perfect at it (some of its most glaring mistakes are when it just doesn't seem to understand what's on the screen correctly).

mcculley · a year ago

Are people born both blind and deaf incapable of true learning?

grayhatter · a year ago

There's a whole fork lore/meme around how hard it is, in America culture...

But given blidness and deafness is an impediment to acquiring language, more than anything else, I'd say that's the exact opposite of the conclusions from the comment you're replying to.

But yes, depending on where you set the bar for "true learning" being blind and deaf would prevent it.

I assume you're asking if vision and sound are required for learning, the answer I assume is no. Those were just chosen because we've already invented cameras and microphones. Haptics are less common, and thus less talked about.

eli_gottlieb · a year ago

>(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

seanhunter · a year ago

> The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).

Lerc · a year ago

I feel like we're stacking naive misinterpretations of how LLMs function on top of one another here. Grasping gradient descent and autoregressive generation can give you a false sense of confidence. It is like knowing how transistors make up logic gates and believing you know more than CPU design than you actually do.

Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.

One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.

One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"

skybrian · a year ago

I think some “reasoning” models do backtracking by inserting “But wait” at the start of a new paragraph? There’s more to it, but that seems like a pretty good trick.

estebarb · a year ago

The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.

duskwuff · a year ago

Right. And, as a result, low token-level confidence can end up indicating "there are other ways this could have been worded" or "there are other topics which could have been mentioned here" just as often as it does "this output is factually incorrect". Possibly even more often, in fact.

derefr · a year ago

You get scores for the outputs of the last layer; so in theory, you could notice when those scores form a particularly flat distribution, and fault.

What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."

You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)

This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.

(Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)

spmurrayzzz · a year ago

> i.e there isn't a "I don't have enough information" option.

This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.

SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.

The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.

We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.

[1] https://arxiv.org/abs/2310.11511

EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.

thijson · a year ago

I watched an Andrej Karpathy video recently. He said that hallucination was because in the training data there were no examples where the answer is, "I don't know". Maybe I'm misinterpreting what he was saying though.

https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s

unsupp0rted · a year ago

> The problem with LLMs is that the output is inherently stochastic

Isn't that true with humans too?

There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.

borgdefenser · a year ago

I think it is because we don't feel the random and chaotic nature of what we know as individuals.

If I had been born a day earlier or later I would have a completely different life because of initial conditions and randomness but life doesn't feel that way even though I think this is obviously true.

throw310822 · a year ago

> there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.

wavemode · a year ago

That's because, based on the training data, the most likely response to asking for directions is to clarify exactly where you are and what you see.

But if you ask it in terms of a knowledge test ("I'm at the corner of 1st and 2nd, what public park am I standing next to?") a model lacking web search capabilities will confidently hallucinate (unless it's a well-known park).

In fact, my person opinion is that, therein lies the most realistic way to reduce hallucination rates: rather than trying to train models to say "I don't know" (which is not really a trainable thing - models are fundamentally unaware of the limits of their own training data), instead just train them on which kinds of questions warrant a web search and which ones should be answered creatively.

throwawaymaths · a year ago

i dont think the stochasticity that's the problem -- the problem is that model gets "locked in" once it picks a token and there's no takesies backsies.

that also entails information destruction in the form of the logits table, but for the most part that should be accounted for in the last step before final feedforward

itkovian_ · a year ago

>This is due to the fact that LLMs are basically just giant look up maps with interpolation.

This is obviously not true at this point except for the most loose definition of interpolation.

>don't rely on things like differentiability.

I've never heard lecun say we need to move away from gradient descent. The opposite actually.

TZubiri · a year ago

If multiple answers are equally likely, couldn't that be considered uncertainty? Conversely if there's only one answer and there's a huge leap to the second best, that's pretty certain.

josh-sematic · a year ago

I don’t buy Lecun’s argument. Once you get good RL going (as we are now seeing with reasoning models) you can give the model a reward function that rewards a correct answer most highly, an “I’m sorry but I don’t know” less highly than that, a wrong answer penalized, a confidently wrong answer more severely penalized. As the RL learns to maximize rewards I would think it would find the strategy of saying it doesn’t know in cases where it can’t find an answer it deems to have a high probability of correctness.

Tryk · a year ago

How do you define the "correct" answer?

Dead Comment