We Found an Neuron in GPT-2

It’s notable how successful LLMs despite the lack of any linguistic tools in their architectures. It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens. Surely, the question of “a/an” would be solved with ease as the model would be required to come up with a noun token before choosing its determiner. I wonder if the developers of LLMs explored those approaches but found them infeasible due to large preprocessing times, immaturity of such tools and/or little benefit.

PeterisP · 3 years ago

I think the lack of explicit linguistic tools is the key to success, forcing/enabling the generic model to learn implicit linguistic tools (there's some research identifying that analysis of specific linguistic phenomena happens at specific places in the NN layers) that work better than what we could implement.

"It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens." - indeed, this is obviously interesting, so people have tried that a lot for many models, but IMHO it's probably now almost decade since the consensus is that in general end-to-end training (once we became able to do it) work better than adding explicit stages in between, e.g. for any random task I would expect that doing text->syntax tree->outcome is going to get worse results than text->outcome, because even if the task really needs syntax, the syntax representation that a stack of large transformer layers learns implicitly tends to be a better representation of the natural language than any human linguist devised formal grammar, which inevitably has to mangle the actual languge to fit into neat human-analyzable 'boxes'/classes/types in which it doesn't really fit and all the fuzzy edge cases stick out. Once you remove the constraint that the grammar must be simplified enough for a human to be able to rationally analyze and understand, and (perhaps even more importantly?) abandoning the need to prematurely disambiguate the utterance to a single syntax tree instead of accepting that parts of it are ambiguous (but not equally likely), processing works better.

It's just one more reminder of the bitter lesson (http://incompleteideas.net/IncIdeas/BitterLesson.html) which we don't want to accept.

clementneo · 3 years ago

I think there's probably some truth to this. They found that in InstructGPT — where they teach the model to better follow instructions, which was the jump from GPT-3 to ChatGPT — they found that the model also learnt to follow non-English instructions, even though the extra training was done almost exclusively in English[1].

So there seems to be such emergent mechanisms in the model that have arisen because of the end-to-end training, which we don't exactly understand yet.

[1] https://twitter.com/janleike/status/1625207251630960640

kilgnad · 3 years ago

Hmm I think it's sort of a hybrid.

First off, we know that overall the concept of language is humans is an emergent phenomenon. It developed from natural selection from simple components so there's validity that the same thing can occur in an LLM where some overarching emergent structure develops from simple primitives.

At the same time we do know that a sort of universal grammar exists among humans. Our language capacity is biased in a certain way and that it is unlikely for it to learn languages of a very extreme and divergent grammar from the universal one discovered by Noam chomsky. That means our brain is unlikely to be as universally simple as an LLM.

I think the key here is that the human mind has explicit linguistic tools but the these tools are still emergent in nature.

Deleted Comment

lalos · 3 years ago

Not relevant but:

"Every time I fire a linguist, the performance of the speech recognizer goes up". - Frederick Jelinek

jojobas · 3 years ago

Surely the remaining linguists muster up some improvements out of fear for their jobs!

Baeocystin · 3 years ago

Approaches such as you describe have been the dominant method for decades. That we finally 'cracked' natural language generation with tools that literally encode nothing about grammar ahead of time is one hell of a lesson, early days as it is in the learning of it.

letmevoteplease · 3 years ago

Reminds me of Stephen Krashen's input hypothesis of second-language acquisition. Krashen argues that consciously studying grammar is more or less useless, and only massive exposure to the language results in acquisition.[1] This is true in my experience.

[1] https://en.wikipedia.org/wiki/Input_hypothesis

solarmist · 3 years ago

Harris, Chomsky’s advisor, published his operator grammar over the decades and one of the key features was that it is completely self-discoverable.

Probability and observation are all that is required to understand a language.

sitkack · 3 years ago

Ref, A Bitter Lesson

KhoomeiK · 3 years ago

Several papers have explored emergent linguistic structure in LMs. Here's some early introductory work in this space. Despite having explicit syntax parses etc as input, models seem to learn something like syntax.

https://arxiv.org/abs/1905.05950

https://aclanthology.org/N19-1419/

https://arxiv.org/abs/1906.04341

compressedgas · 3 years ago

I would add https://www.pnas.org/doi/10.1073/pnas.1907367117

dgut · 3 years ago

grammar as we know it was devised for the Latin language and linguists spend most of the time attempting to fit other languages into neat boxes that the Latin grammar wasn't designed for. This of course leads to absurdity. Chomsky attempted to solve this problem with his universal grammar, but that too stops working quickly once you get outside of European languages. That is, ignoring linguistic tools is one of the reasons GPT is successful.

csb6 · 3 years ago

> that too stops working quickly once you get outside of European languages

I don’t think this is entirely fair. Generative grammars have been produced for a huge variety of non-European languages, even non-Indo-European languages, and can account for tremendous diversity in linguistic rules. Even languages without fixed word orders or highly synthetic languages can be represented.

Linguistics isn’t focused on the problem of outputting reasonable-sounding text responses. Instead, it seeks to transparently explain how language works and is structured, something that GPT does not do.

IIAOPSW · 3 years ago

That is not what universal grammar was about. Literally read even the wikipedia page or watch like any lecture on the topic.

carabiner · 3 years ago

It's so weird. And statistical weather models are doing better than the physical ones, at far lower computational cost.

> We started out with the question: How does GPT-2 know when to use the word an over a? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 is only capable of predicting one word at a time. We still don’t have a full answer...

I'm not sure I understand why this is an open question. While I get that GPT-2 is predicting only one word at a time, it doesn't seem that surprising that there might be cases where there is a dominant bigram (ie "an apple" in the case of their example prompt) that would trigger an "an" prediction, without actually predicting the following word first.

Am I missing something?

kelnos · 3 years ago

Yeah, my feeling here was that it's sort of tautological: if GPT predicts "a", then it must then predict a word that would follow "a" and not require "an" (and vice versa). And if you think about it from the opposite direction: if it's working out a response that is eventually going to have "apple" in it, then all the data it's trained on is going to cause it to predict "an" even before it needs to predict "apple".

(Admittedly, all this ML/AI stuff is still beyond my current level of understanding, so I'm sure my thinking here is off.)

caf · 3 years ago

I think that's essentially right - if it's already "thinking about" apples, then ' an' will be significantly upregulated at the expense of ' a', then if it does choose to output an ' an' token, ' apple' becomes a highly likely followup.

You could probably test this by seeing if prompts containing a lot of nouns that start with a vowel sound results in output that contains a higher proportion of otherwise unrelated first-vowel nouns. (ie your prompt includes lots of apples, apricots, avocados, asparagus, aubergines, elderberries, eggplants, endives, oranges, olives, okras, onions and you count the proportion of non-food nouns in the result that start with a vowel and non-vowel sound).

clementneo · 3 years ago

The main issue is that GPT is fundamentally an autoregressive language model — it's only predicting the next token based on the prompt at a single time. Every time it wants to predict the next word, it adds the previously predicted word into the prompt, repeating the cycle. We can intuitively guess that the model is 'working out a response that is eventually going to have "apple" in it', but we don't actually know how the model 'thinks' ahead about its response.

To rephrase that for this case: what is the specific mechanism in GPT-2 that (1) makes it realise that the word 'apple' is significant in this prompt, and (2) use that knowledge to push the model to predict 'an'? Finding this neuron would only answer the some portion of (2).

(And to rephrase this for the general case, which gives us the initial question: How does GPT-2 know when, given a suitable context, to predict 'an' over 'a'?)

josephmiller · 3 years ago

Author here! I think this is reasonable but I have two responses.

1. It's kinda interesting because this is a clear case where the model must be thinking beyond the next token, whereas in most contexts it's hard to say whether the model thinks ahead at all (although I would guess that it does most of the time).

2. More importantly, the key question here is how it works. We're not surprised that it has this behavior, but we want to understand which exact weights and biases in the network are responsible.

Note also that this is just the introductory sentence and the rest of the article would read exactly the same without it.

> it doesn't seem that surprising that there might be cases where there is a dominant bigram [...] that would trigger an "an" prediction, without actually predicting the following word first

btw I don't really understand what you mean by this. Bigrams can explain the second prediction in a two word pair but not the first.

tiarafawn · 3 years ago

I think what the parent was trying to communicate (and what I'm thinking as well) is doubting your premise in 1. ("the model must be thinking beyond the next token").

Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.

tqi · 3 years ago

Thanks for responding, and for the cool article! Agree this is pretty tangential to the rest of the article.

So (again, I am very much not an expert so please correct me if I'm wrong) I guess my analogy would be if instead of predicting one word at a time, it predicted one letter at a time. At some point, would only be one word that could fit. So if prompted with "SENTE", and it returned "N", that doesn't mean that it's thinking ahead to the "CE" / knows that it is spelling "SENTENCE" already.

Is that a correct way to think of it?

planb · 3 years ago

> It's kinda interesting because this is a clear case where the model must be thinking beyond the next token

I don't see how you get to this conclusion. From all the training data it has seen, "an" is the most probable next word after "I climbed up the tree and picked up". The network does not need to know anything about the apple at this point. Then, the next word is "apple" (with an even higher probability I guess).

defen · 3 years ago

What happens if you ask it to complete "I climbed up the pear tree and picked a pear. I climbed up the apple tree and picked a"

Deleted Comment

pera · 3 years ago

I don't understand why your conclusion is that "the model must be thinking beyond the next token": the model doesn't need to do that to generate a well-formed sentence because it's not constrained by the size of the sentence.

bilsbie · 3 years ago

As a test, during text generation could you change an “a” to an “an” and see if it changes the noun. Or did it already have a noun in mind and it sticks with that.

seppel · 3 years ago

I don't think you are missing something. I think this whole "GPT-2 is predicting only one word at a time" is a red herring anyway.

Of course it can only answer the next word, because there is only room in its outputs for the next word. But it has to compute much more. It a huge hidden internal state where it has to first encode what the given sentence is about, then predict some general concept in which the continuation goes, decide the locally correct syntactical structure and only from this you can predict the next word.

laszlokorte · 3 years ago

From my naïve point of view it seems obvious that at any point where both an „a“ or a „an“ would fit the model randomly selects one of them and by doing so reduces the set of possible nouns to follow.

Sharlin · 3 years ago

No, of course not randomly. If you RTA, it should be clear that some prompts are more likely to be followed by an “an” than others.

varispeed · 3 years ago

My friend who is not a native English speaker told me that one thing he struggled the most while learning English was the "a" and "an". He couldn't grasp the concept how it is possible that person knows which determiner to use before saying the word, until he learned it as just the part of the word and then he uses it depending on the context which one can "feel". So when he sees an apple, he says this is an apple.

Exactly what you mean by bigram.

caf · 3 years ago

And of course it is not unheard of for a native English speaker to correct themselves if they change their mind about which noun to use. You can imagine being asked to very rapidly verbally classify fruits ("it's an apple, it's an apple, it's a pear, it's an apple,..") that you might well find yourself stumbling like "it's a.. an apple".

sunk1st · 3 years ago

The rule is the rule because that’s what make sense phonetically. It’s why you would say “el agua” and “el hacha” in Spanish even though those articles don’t match the gender of those words.

SamBam · 3 years ago

Are there languages that don't have similar patterns?

Obviously any romance language includes gender, so you need to use the correct gendered article before the noun.

Japanese has different counting words depending on what you're counting.

I don't know if Mandarin has anything similar.

BaseballPhysics · 3 years ago

Now really bake their noodle and ask what they put before the word 'history' or the acronym 'LLM'? :)

hgsgm · 3 years ago

Does his language have grammatical gender?

lalos · 3 years ago

Not to mention that the corpus mostly will have the correct case for most common words like apple. Train it with essays of ESL students and you'll get something else.

theptip · 3 years ago

I think this would be a valid objection if they stopped there.

But then: “ Testing the neuron on a larger dataset”

If I follow correctly, they test a bunch of different completions that contain “an”. So they are not just detecting the bigram “an apple”, but the common activation among a bunch of “an X” activations where X is completely different.

prox · 3 years ago

Wouldn’t this also be correlated somehow in its vector space?

VeninVidiaVicii · 3 years ago

Word shmord, “an apple” is a lexical concept.

Co-author here! I'm kind of surprised that this made it to the top of HN! This was a project in which Joseph and I tried to reverse engineer the mechanism in which GPT-2 predicts the word 'an'.

It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).

I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!

[1] https://distill.pub/2020/circuits/zoom-in/

HarHarVeryFunny · 3 years ago

I wonder if you could comment on this (related to question of how far ahead these "LLM"s are planning).

This is Wharton professor Ethan Mollick playing with the new Bing chat, which seems considerably more advanced than ChatGPT (based on GPT-4 perhaps?).

Here he asks it to write something using Kurt Vonnegut's rules of writing.

https://twitter.com/emollick/status/1626084142239649792

It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.

It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.

gbasin · 3 years ago

There is "long range" dependence, it's just only on the prompt: the conversation with the user and the hidden header (e.g. "Answer as ChatGPT, an intelligent AI, state your reasons, be succinct, etc."). That ends up being enough.

mungoman2 · 3 years ago

Convolution is part of the network design though. Would a fully connected network learn to convolute? Or would it turn out that convolution is not necessary?

nerdponx · 3 years ago

The interesting part here isn't the convolution itself, it's how convolutional layers turn out to like "filters" or "detectors" for individual features. This is explained very well in the distill.pub article linked by GP.

We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.

xmcqdpt2 · 3 years ago

They do and in fact it's relatively straightforward to show empirically on eg MNIST. The problem is that you need a much much larger network in the FCN case and thus need way more data and way more data augmentation to get a good result that isn't overfit to hell.

In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.

redox99 · 3 years ago

Yes it would, or at least a similar operation.

The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).

That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.

ly3xqhl8g9 · 3 years ago

Could a "type system" for neural weights be developed? Given a self-driving system, to be able to statically check that the neurons have the "Person" type, the "Don't Run Over Person" type, and so forth. What happens if you "transplant" the weights for ' an' to another network, some kind of transfer learning but componentized, does it still predict as accurately? If neural networks could be assembled from "types" it would be much easier to trust them.

simonh · 3 years ago

The way an LLM decides which word to use next is by evaluating the weightings of all the preceding words with every candidate word to calculate a probability for each of them. So if it selects ‘an’ as the next word, it’s because the weighting connecting ‘an’ to all the preceding words, and their orders in the text and relationships with each other predicted it should have a high probability of occurring.

So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.

jerpint · 3 years ago

Do you think it would ever be possible to “maximize” a neuron with certain sentences? What’s so different with the gradient ascent techniques with convolutions?

Near work! I’m still confused how it knows to use “an” if it hasn’t chosen the word after it yet?

sharemywin · 3 years ago

you might find this paper interesting:

https://arxiv.org/abs/2202.05262

Locating and Editing Factual Associations in GPT

dpaleka · 3 years ago

That paper (ROME) was the most famous paper in the field last year :)

See also new interesting developments breaking the connection between "Locating" and "Editing":

https://arxiv.org/abs/2301.04213

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Interestingly I feel like humans have this as well, sometimes.

Sometimes if someone is working though a complex thought and they're not really sure where they're going, they'll pause while thinking of the word they want to use, and might sound like

"the discussion is an... an... epistemological one"

Obviously they may have been conscious that the next word was going to start with "epi..." and they are just trying to remember the word, but I think sometimes they really don't consciously know what they're going to say, only unconsciously.

It reminds me of a recent New Yorker article about how people think, where the author realized they often have no idea what they're about to say before they open their mouths. [1]

1. https://www.newyorker.com/magazine/2023/01/16/how-should-we-...

furyofantares · 3 years ago

I never know what I'm about to say, but somehow coherent sentences come out.

I certainly don't know what words a sentence is going to end with when I'm thinking or saying the first words in the sentence. I just think or say the sentence from start to finish, never knowing what the next word is going to be as I'm thinking the current one, and by the end of it I've thought or said a full sentence that makes sense.

FeepingCreature · 3 years ago

I usually know what I'm about to actually say very shortly before I say it. This has occasionally led to emergency course corrections. But I do sense the "shape" of a sentence well before I say it.

I think there's something like a stage design in the brain:

- symbolic or model deliberation

- verbal expression

- vocalization

And each of those stages can be consciously introspected on, but people will naturally develop more or less ability to introspect on it. I think when people say they "don't mentally verbalize", what is actually happening is they just haven't happened to develop conscious introspection of the verbal expression stage. But I'd expect that this can be trained.

(Conversely, sometimes people introspect so much on verbal expression that it becomes an inherent part of the way they think. Brains are weird and wonderful!)

Winsaucerer · 3 years ago

Yep, this is ordinary conversation for most of the time. It's a bit strange to make yourself aware of it, but you have an idea or thought you want to express, and the sentences come out in a semi-automated and coherent fashion.

I think of it a bit like walking. You can think about it, focus on it, control it as you please, but most of the time you just do it without thinking.

justusthane · 3 years ago

This is so weird. I am very “In my head” and have generally planned what I’m going to say before I open my mouth.

It makes me feel like I’m not very good at conversation though, especially smalltalk.

jdougan · 3 years ago

> "the discussion is an... an... epistemological one"

I hear that in the voice of Agent Smith, from The Matrix.

adlpz · 3 years ago

Quite an appropriate reference when talking about neurons on an artificial intelligence :)

Thorentis · 3 years ago

When you say "they have no idea about what they're about to say" you're talking about conscious thought. I think there is a difference between rational thought (thinking by going through a series of logically connected steps) and intuition, where you can arrive at a conclusion or knowledge of some fact or concept or knowing how to do something, without having gone through those conscious steps. Does one count as "thoughts" any less than the other? People are sometimes so quick to dismiss any subconscious thinking as being nothing more than a very complicated computer, but I couldn't disagree more.

gmadsen · 3 years ago

I tend to take the view that thoughts are very similar to sensory input. if you sit in silence for 60 seconds, you literally do not and can not predict what thoughts pop up in consciousness. You can actively focus attention on a thought, but if you try to find the source of the thought, it disapears. Thoughts just appear, just like sounds, sight, etc just appear.

jetsnoc · 3 years ago

Intuition is another cognitive process honed by training and reinforcement of neurons through all sorts of sensories and feedback to our brain. I think parallel models not unlike additional cognitive processes in our brain assisting the NLG will eventually make it more similar to how our cognitive processes actually work.

Disclaimer: Uneducated opinion on my behalf, I'm a hobbyist only.

Well I think that's the point. We know that our minds engage in a large amount of pattern matching/recognition/retrieval. Perhaps this hugely-powerful pattern-matching information-retrieval engine has learned to perform similarly to human minds, being trained exclusively on the output of human minds.

LudwigNagasena · 3 years ago

radus · 3 years ago

The way they're going about this investigation is reminiscent of how we figure things out in biology and neuroscience. Perhaps a biologist won't do the best job at fixing a radio [1], but they might do alright debugging neural networks.

1: https://www.cell.com/cancer-cell/pdf/S1535-6108(02)00133-2.p...

hislaziness · 3 years ago

It is interesting, when I (definitely not a bot) read the headline I thought the grammar was wrong. took me a while to that "an" was not an indefinite article here. In the article headline the first alphabets of each word is capitalised and somehow it was easier for me to understand what the "An" meant here.

Camillo · 3 years ago

The grammar _is_ wrong. It should have been "We found the 'an' neuron in GPT-2". Given the article's contents, it's hard to believe that the authors would make such a mistake; it was probably done deliberately, as clickbait.

QuadmasterXLII · 3 years ago

I'd call it a pun

Rioghasarig · 3 years ago

I think it was deliberate but just as a joke.

sebzim4500 · 3 years ago

Not every attempt to make a humorous/engaging title is 'clickbait'.

rhelz · 3 years ago

Since complex systems are composed of simpler systems, seems like for any sufficiently complex system you'd be able to find subsets of it which are isomorphic which any sufficiently simple system.

phkahler · 3 years ago

N00b to this. How are the neurons outputs read to produce text? They talk about tokens as if a token is a word. But if token==word then every word would have a specific output and there's nothing to see here. So again, how are neuron outputs converted to letters/text?

vore · 3 years ago

It goes something like this:

  input text -> input tokens -> input embeddings -> model -> output embeddings -> output tokens -> output text

Tokens aren't necessarily words: they can be fragments of words and you can check out this behavior here: https://platform.openai.com/tokenizer

For instance, "an eagle" is tokenized to [an][ eagle], but "anoxic" is tokenized to [an][oxic], so just looking for the [an] token is not sufficient. Therefore, you would need to map the output text all the way back into the model to figure out what neuron(s) in the model would generate "an" over "a". Since the bulk of GPT is all unsupervised learning, any connections it makes in its neural network is all emergent.

Further to this, as I understand it the "embedding" is mapping the tokens into vectors in a space where tokens semantically similar will be close together in the space.

So is the output compared to the embedding vectors (via dot product) and the strongest one output its token? How is it "clocked" to get successive tokens?

Surely "an eagle" is [ an][ eagle] since the an starts a new word.