Co-author here! I'm kind of surprised that this made it to the top of HN! This was a project in which Joseph and I tried to reverse engineer the mechanism in which GPT-2 predicts the word 'an'.
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.
It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.
There is "long range" dependence, it's just only on the prompt: the conversation with the user and the hidden header (e.g. "Answer as ChatGPT, an intelligent AI, state your reasons, be succinct, etc."). That ends up being enough.
Convolution is part of the network design though.
Would a fully connected network learn to convolute? Or would it turn out that convolution is not necessary?
The interesting part here isn't the convolution itself, it's how convolutional layers turn out to like "filters" or "detectors" for individual features. This is explained very well in the distill.pub article linked by GP.
We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.
They do and in fact it's relatively straightforward to show empirically on eg MNIST. The problem is that you need a much much larger network in the FCN case and thus need way more data and way more data augmentation to get a good result that isn't overfit to hell.
In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.
The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).
That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.
Could a "type system" for neural weights be developed? Given a self-driving system, to be able to statically check that the neurons have the "Person" type, the "Don't Run Over Person" type, and so forth. What happens if you "transplant" the weights for ' an' to another network, some kind of transfer learning but componentized, does it still predict as accurately? If neural networks could be assembled from "types" it would be much easier to trust them.
The way an LLM decides which word to use next is by evaluating the weightings of all the preceding words with every candidate word to calculate a probability for each of them. So if it selects ‘an’ as the next word, it’s because the weighting connecting ‘an’ to all the preceding words, and their orders in the text and relationships with each other predicted it should have a high probability of occurring.
So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.
Do you think it would ever be possible to “maximize” a neuron with certain sentences? What’s so different with the gradient ascent techniques with convolutions?
Interestingly I feel like humans have this as well, sometimes.
Sometimes if someone is working though a complex thought and they're not really sure where they're going, they'll pause while thinking of the word they want to use, and might sound like
"the discussion is an... an... epistemological one"
Obviously they may have been conscious that the next word was going to start with "epi..." and they are just trying to remember the word, but I think sometimes they really don't consciously know what they're going to say, only unconsciously.
It reminds me of a recent New Yorker article about how people think, where the author realized they often have no idea what they're about to say before they open their mouths. [1]
I never know what I'm about to say, but somehow coherent sentences come out.
I certainly don't know what words a sentence is going to end with when I'm thinking or saying the first words in the sentence. I just think or say the sentence from start to finish, never knowing what the next word is going to be as I'm thinking the current one, and by the end of it I've thought or said a full sentence that makes sense.
I usually know what I'm about to actually say very shortly before I say it. This has occasionally led to emergency course corrections. But I do sense the "shape" of a sentence well before I say it.
I think there's something like a stage design in the brain:
- symbolic or model deliberation
- verbal expression
- vocalization
And each of those stages can be consciously introspected on, but people will naturally develop more or less ability to introspect on it. I think when people say they "don't mentally verbalize", what is actually happening is they just haven't happened to develop conscious introspection of the verbal expression stage. But I'd expect that this can be trained.
(Conversely, sometimes people introspect so much on verbal expression that it becomes an inherent part of the way they think. Brains are weird and wonderful!)
Yep, this is ordinary conversation for most of the time. It's a bit strange to make yourself aware of it, but you have an idea or thought you want to express, and the sentences come out in a semi-automated and coherent fashion.
I think of it a bit like walking. You can think about it, focus on it, control it as you please, but most of the time you just do it without thinking.
When you say "they have no idea about what they're about to say" you're talking about conscious thought. I think there is a difference between rational thought (thinking by going through a series of logically connected steps) and intuition, where you can arrive at a conclusion or knowledge of some fact or concept or knowing how to do something, without having gone through those conscious steps. Does one count as "thoughts" any less than the other? People are sometimes so quick to dismiss any subconscious thinking as being nothing more than a very complicated computer, but I couldn't disagree more.
I tend to take the view that thoughts are very similar to sensory input. if you sit in silence for 60 seconds, you literally do not and can not predict what thoughts pop up in consciousness. You can actively focus attention on a thought, but if you try to find the source of the thought, it disapears. Thoughts just appear, just like sounds, sight, etc just appear.
Intuition is another cognitive process honed by training and reinforcement of neurons through all sorts of sensories and feedback to our brain. I think parallel models not unlike additional cognitive processes in our brain assisting the NLG will eventually make it more similar to how our cognitive processes actually work.
Disclaimer: Uneducated opinion on my behalf, I'm a hobbyist only.
Well I think that's the point. We know that our minds engage in a large amount of pattern matching/recognition/retrieval. Perhaps this hugely-powerful pattern-matching information-retrieval engine has learned to perform similarly to human minds, being trained exclusively on the output of human minds.
It’s notable how successful LLMs despite the lack of any linguistic tools in their architectures. It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens. Surely, the question of “a/an” would be solved with ease as the model would be required to come up with a noun token before choosing its determiner. I wonder if the developers of LLMs explored those approaches but found them infeasible due to large preprocessing times, immaturity of such tools and/or little benefit.
I think the lack of explicit linguistic tools is the key to success, forcing/enabling the generic model to learn implicit linguistic tools (there's some research identifying that analysis of specific linguistic phenomena happens at specific places in the NN layers) that work better than what we could implement.
"It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens." - indeed, this is obviously interesting, so people have tried that a lot for many models, but IMHO it's probably now almost decade since the consensus is that in general end-to-end training (once we became able to do it) work better than adding explicit stages in between, e.g. for any random task I would expect that doing text->syntax tree->outcome is going to get worse results than text->outcome, because even if the task really needs syntax, the syntax representation that a stack of large transformer layers learns implicitly tends to be a better representation of the natural language than any human linguist devised formal grammar, which inevitably has to mangle the actual languge to fit into neat human-analyzable 'boxes'/classes/types in which it doesn't really fit and all the fuzzy edge cases stick out. Once you remove the constraint that the grammar must be simplified enough for a human to be able to rationally analyze and understand, and (perhaps even more importantly?) abandoning the need to prematurely disambiguate the utterance to a single syntax tree instead of accepting that parts of it are ambiguous (but not equally likely), processing works better.
I think there's probably some truth to this. They found that in InstructGPT — where they teach the model to better follow instructions, which was the jump from GPT-3 to ChatGPT — they found that the model also learnt to follow non-English instructions, even though the extra training was done almost exclusively in English[1].
So there seems to be such emergent mechanisms in the model that have arisen because of the end-to-end training, which we don't exactly understand yet.
First off, we know that overall the concept of language is humans is an emergent phenomenon. It developed from natural selection from simple components so there's validity that the same thing can occur in an LLM where some overarching emergent structure develops from simple primitives.
At the same time we do know that a sort of universal grammar exists among humans. Our language capacity is biased in a certain way and that it is unlikely for it to learn languages of a very extreme and divergent grammar from the universal one discovered by Noam chomsky. That means our brain is unlikely to be as universally simple as an LLM.
I think the key here is that the human mind has explicit linguistic tools but the these tools are still emergent in nature.
Approaches such as you describe have been the dominant method for decades. That we finally 'cracked' natural language generation with tools that literally encode nothing about grammar ahead of time is one hell of a lesson, early days as it is in the learning of it.
Reminds me of Stephen Krashen's input hypothesis of second-language acquisition. Krashen argues that consciously studying grammar is more or less useless, and only massive exposure to the language results in acquisition.[1] This is true in my experience.
Several papers have explored emergent linguistic structure in LMs. Here's some early introductory work in this space. Despite having explicit syntax parses etc as input, models seem to learn something like syntax.
grammar as we know it was devised for the Latin language and linguists spend most of the time attempting to fit other languages into neat boxes that the Latin grammar wasn't designed for. This of course leads to absurdity. Chomsky attempted to solve this problem with his universal grammar, but that too stops working quickly once you get outside of European languages. That is, ignoring linguistic tools is one of the reasons GPT is successful.
> that too stops working quickly once you get outside of European languages
I don’t think this is entirely fair. Generative grammars have been produced for a huge variety of non-European languages, even non-Indo-European languages, and can account for tremendous diversity in linguistic rules. Even languages without fixed word orders or highly synthetic languages can be represented.
Linguistics isn’t focused on the problem of outputting reasonable-sounding text responses. Instead, it seeks to transparently explain how language works and is structured, something that GPT does not do.
> We started out with the question: How does GPT-2 know when to use the word an over a? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 is only capable of predicting one word at a time. We still don’t have a full answer...
I'm not sure I understand why this is an open question. While I get that GPT-2 is predicting only one word at a time, it doesn't seem that surprising that there might be cases where there is a dominant bigram (ie "an apple" in the case of their example prompt) that would trigger an "an" prediction, without actually predicting the following word first.
Yeah, my feeling here was that it's sort of tautological: if GPT predicts "a", then it must then predict a word that would follow "a" and not require "an" (and vice versa). And if you think about it from the opposite direction: if it's working out a response that is eventually going to have "apple" in it, then all the data it's trained on is going to cause it to predict "an" even before it needs to predict "apple".
(Admittedly, all this ML/AI stuff is still beyond my current level of understanding, so I'm sure my thinking here is off.)
I think that's essentially right - if it's already "thinking about" apples, then ' an' will be significantly upregulated at the expense of ' a', then if it does choose to output an ' an' token, ' apple' becomes a highly likely followup.
You could probably test this by seeing if prompts containing a lot of nouns that start with a vowel sound results in output that contains a higher proportion of otherwise unrelated first-vowel nouns. (ie your prompt includes lots of apples, apricots, avocados, asparagus, aubergines, elderberries, eggplants, endives, oranges, olives, okras, onions and you count the proportion of non-food nouns in the result that start with a vowel and non-vowel sound).
The main issue is that GPT is fundamentally an autoregressive language model — it's only predicting the next token based on the prompt at a single time. Every time it wants to predict the next word, it adds the previously predicted word into the prompt, repeating the cycle. We can intuitively guess that the model is 'working out a response that is eventually going to have "apple" in it', but we don't actually know how the model 'thinks' ahead about its response.
To rephrase that for this case: what is the specific mechanism in GPT-2 that (1) makes it realise that the word 'apple' is significant in this prompt, and (2) use that knowledge to push the model to predict 'an'? Finding this neuron would only answer the some portion of (2).
(And to rephrase this for the general case, which gives us the initial question: How does GPT-2 know when, given a suitable context, to predict 'an' over 'a'?)
Author here! I think this is reasonable but I have two responses.
1. It's kinda interesting because this is a clear case where the model must be thinking beyond the next token, whereas in most contexts it's hard to say whether the model thinks ahead at all (although I would guess that it does most of the time).
2. More importantly, the key question here is how it works. We're not surprised that it has this behavior, but we want to understand which exact weights and biases in the network are responsible.
Note also that this is just the introductory sentence and the rest of the article would read exactly the same without it.
> it doesn't seem that surprising that there might be cases where there is a dominant bigram [...] that would trigger an "an" prediction, without actually predicting the following word first
btw I don't really understand what you mean by this. Bigrams can explain the second prediction in a two word pair but not the first.
I think what the parent was trying to communicate (and what I'm thinking as well) is doubting your premise in 1. ("the model must be thinking beyond the next token").
Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.
Thanks for responding, and for the cool article! Agree this is pretty tangential to the rest of the article.
So (again, I am very much not an expert so please correct me if I'm wrong) I guess my analogy would be if instead of predicting one word at a time, it predicted one letter at a time. At some point, would only be one word that could fit. So if prompted with "SENTE", and it returned "N", that doesn't mean that it's thinking ahead to the "CE" / knows that it is spelling "SENTENCE" already.
> It's kinda interesting because this is a clear case where the model must be thinking beyond the next token
I don't see how you get to this conclusion. From all the training data it has seen, "an" is the most probable next word after "I climbed up the tree and picked up". The network does not need to know anything about the apple at this point.
Then, the next word is "apple" (with an even higher probability I guess).
I don't understand why your conclusion is that "the model must be thinking beyond the next token": the model doesn't need to do that to generate a well-formed sentence because it's not constrained by the size of the sentence.
As a test, during text generation could you change an “a” to an “an” and see if it changes the noun. Or did it already have a noun in mind and it sticks with that.
I don't think you are missing something. I think this whole "GPT-2 is predicting only one word at a time" is a red herring anyway.
Of course it can only answer the next word, because there is only room in its outputs for the next word. But it has to compute much more. It a huge hidden internal state where it has to first encode what the given sentence is about, then predict some general concept in which the continuation goes, decide the locally correct syntactical structure and only from this you can predict the next word.
From my naïve point of view it seems obvious that at any point where both an „a“ or a „an“ would fit the model randomly selects one of them and by doing so reduces the set of possible nouns to follow.
My friend who is not a native English speaker told me that one thing he struggled the most while learning English was the "a" and "an". He couldn't grasp the concept how it is possible that person knows which determiner to use before saying the word, until he learned it as just the part of the word and then he uses it depending on the context which one can "feel". So when he sees an apple, he says this is an apple.
And of course it is not unheard of for a native English speaker to correct themselves if they change their mind about which noun to use. You can imagine being asked to very rapidly verbally classify fruits ("it's an apple, it's an apple, it's a pear, it's an apple,..") that you might well find yourself stumbling like "it's a.. an apple".
The rule is the rule because that’s what make sense phonetically. It’s why you would say “el agua” and “el hacha” in Spanish even though those articles don’t match the gender of those words.
Not to mention that the corpus mostly will have the correct case for most common words like apple. Train it with essays of ESL students and you'll get something else.
I think this would be a valid objection if they stopped there.
But then: “ Testing the neuron on a larger dataset”
If I follow correctly, they test a bunch of different completions that contain “an”. So they are not just detecting the bigram “an apple”, but the common activation among a bunch of “an X” activations where X is completely different.
The way they're going about this investigation is reminiscent of how we figure things out in biology and neuroscience. Perhaps a biologist won't do the best job at fixing a radio [1], but they might do alright debugging neural networks.
It is interesting, when I (definitely not a bot) read the headline I thought the grammar was wrong. took me a while to that "an" was not an indefinite article here. In the article headline the first alphabets of each word is capitalised and somehow it was easier for me to understand what the "An" meant here.
The grammar _is_ wrong. It should have been "We found the 'an' neuron in GPT-2". Given the article's contents, it's hard to believe that the authors would make such a mistake; it was probably done deliberately, as clickbait.
Since complex systems are composed of simpler systems, seems like for any sufficiently complex system you'd be able to find subsets of it which are isomorphic which any sufficiently simple system.
N00b to this. How are the neurons outputs read to produce text? They talk about tokens as if a token is a word. But if token==word then every word would have a specific output and there's nothing to see here. So again, how are neuron outputs converted to letters/text?
For instance, "an eagle" is tokenized to [an][ eagle], but "anoxic" is tokenized to [an][oxic], so just looking for the [an] token is not sufficient. Therefore, you would need to map the output text all the way back into the model to figure out what neuron(s) in the model would generate "an" over "a". Since the bulk of GPT is all unsupervised learning, any connections it makes in its neural network is all emergent.
Further to this, as I understand it the "embedding" is mapping the tokens into vectors in a space where tokens semantically similar will be close together in the space.
So is the output compared to the embedding vectors (via dot product) and the strongest one output its token? How is it "clocked" to get successive tokens?
It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).
I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!
[1] https://distill.pub/2020/circuits/zoom-in/
This is Wharton professor Ethan Mollick playing with the new Bing chat, which seems considerably more advanced than ChatGPT (based on GPT-4 perhaps?).
Here he asks it to write something using Kurt Vonnegut's rules of writing.
https://twitter.com/emollick/status/1626084142239649792
It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.
It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.
We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.
In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.
The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).
That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.
So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.
Deleted Comment
https://arxiv.org/abs/2202.05262
Locating and Editing Factual Associations in GPT
See also new interesting developments breaking the connection between "Locating" and "Editing":
https://arxiv.org/abs/2301.04213
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Sometimes if someone is working though a complex thought and they're not really sure where they're going, they'll pause while thinking of the word they want to use, and might sound like
"the discussion is an... an... epistemological one"
Obviously they may have been conscious that the next word was going to start with "epi..." and they are just trying to remember the word, but I think sometimes they really don't consciously know what they're going to say, only unconsciously.
It reminds me of a recent New Yorker article about how people think, where the author realized they often have no idea what they're about to say before they open their mouths. [1]
1. https://www.newyorker.com/magazine/2023/01/16/how-should-we-...
I certainly don't know what words a sentence is going to end with when I'm thinking or saying the first words in the sentence. I just think or say the sentence from start to finish, never knowing what the next word is going to be as I'm thinking the current one, and by the end of it I've thought or said a full sentence that makes sense.
I think there's something like a stage design in the brain:
- symbolic or model deliberation
- verbal expression
- vocalization
And each of those stages can be consciously introspected on, but people will naturally develop more or less ability to introspect on it. I think when people say they "don't mentally verbalize", what is actually happening is they just haven't happened to develop conscious introspection of the verbal expression stage. But I'd expect that this can be trained.
(Conversely, sometimes people introspect so much on verbal expression that it becomes an inherent part of the way they think. Brains are weird and wonderful!)
I think of it a bit like walking. You can think about it, focus on it, control it as you please, but most of the time you just do it without thinking.
It makes me feel like I’m not very good at conversation though, especially smalltalk.
I hear that in the voice of Agent Smith, from The Matrix.
Disclaimer: Uneducated opinion on my behalf, I'm a hobbyist only.
"It would be interesting to know how different a model would be if it operated on eg dependency trees instead of the linear list of tokens." - indeed, this is obviously interesting, so people have tried that a lot for many models, but IMHO it's probably now almost decade since the consensus is that in general end-to-end training (once we became able to do it) work better than adding explicit stages in between, e.g. for any random task I would expect that doing text->syntax tree->outcome is going to get worse results than text->outcome, because even if the task really needs syntax, the syntax representation that a stack of large transformer layers learns implicitly tends to be a better representation of the natural language than any human linguist devised formal grammar, which inevitably has to mangle the actual languge to fit into neat human-analyzable 'boxes'/classes/types in which it doesn't really fit and all the fuzzy edge cases stick out. Once you remove the constraint that the grammar must be simplified enough for a human to be able to rationally analyze and understand, and (perhaps even more importantly?) abandoning the need to prematurely disambiguate the utterance to a single syntax tree instead of accepting that parts of it are ambiguous (but not equally likely), processing works better.
It's just one more reminder of the bitter lesson (http://incompleteideas.net/IncIdeas/BitterLesson.html) which we don't want to accept.
So there seems to be such emergent mechanisms in the model that have arisen because of the end-to-end training, which we don't exactly understand yet.
[1] https://twitter.com/janleike/status/1625207251630960640
First off, we know that overall the concept of language is humans is an emergent phenomenon. It developed from natural selection from simple components so there's validity that the same thing can occur in an LLM where some overarching emergent structure develops from simple primitives.
At the same time we do know that a sort of universal grammar exists among humans. Our language capacity is biased in a certain way and that it is unlikely for it to learn languages of a very extreme and divergent grammar from the universal one discovered by Noam chomsky. That means our brain is unlikely to be as universally simple as an LLM.
I think the key here is that the human mind has explicit linguistic tools but the these tools are still emergent in nature.
Deleted Comment
"Every time I fire a linguist, the performance of the speech recognizer goes up". - Frederick Jelinek
[1] https://en.wikipedia.org/wiki/Input_hypothesis
Probability and observation are all that is required to understand a language.
https://arxiv.org/abs/1905.05950
https://aclanthology.org/N19-1419/
https://arxiv.org/abs/1906.04341
I don’t think this is entirely fair. Generative grammars have been produced for a huge variety of non-European languages, even non-Indo-European languages, and can account for tremendous diversity in linguistic rules. Even languages without fixed word orders or highly synthetic languages can be represented.
Linguistics isn’t focused on the problem of outputting reasonable-sounding text responses. Instead, it seeks to transparently explain how language works and is structured, something that GPT does not do.
I'm not sure I understand why this is an open question. While I get that GPT-2 is predicting only one word at a time, it doesn't seem that surprising that there might be cases where there is a dominant bigram (ie "an apple" in the case of their example prompt) that would trigger an "an" prediction, without actually predicting the following word first.
Am I missing something?
(Admittedly, all this ML/AI stuff is still beyond my current level of understanding, so I'm sure my thinking here is off.)
You could probably test this by seeing if prompts containing a lot of nouns that start with a vowel sound results in output that contains a higher proportion of otherwise unrelated first-vowel nouns. (ie your prompt includes lots of apples, apricots, avocados, asparagus, aubergines, elderberries, eggplants, endives, oranges, olives, okras, onions and you count the proportion of non-food nouns in the result that start with a vowel and non-vowel sound).
To rephrase that for this case: what is the specific mechanism in GPT-2 that (1) makes it realise that the word 'apple' is significant in this prompt, and (2) use that knowledge to push the model to predict 'an'? Finding this neuron would only answer the some portion of (2).
(And to rephrase this for the general case, which gives us the initial question: How does GPT-2 know when, given a suitable context, to predict 'an' over 'a'?)
1. It's kinda interesting because this is a clear case where the model must be thinking beyond the next token, whereas in most contexts it's hard to say whether the model thinks ahead at all (although I would guess that it does most of the time).
2. More importantly, the key question here is how it works. We're not surprised that it has this behavior, but we want to understand which exact weights and biases in the network are responsible.
Note also that this is just the introductory sentence and the rest of the article would read exactly the same without it.
> it doesn't seem that surprising that there might be cases where there is a dominant bigram [...] that would trigger an "an" prediction, without actually predicting the following word first
btw I don't really understand what you mean by this. Bigrams can explain the second prediction in a two word pair but not the first.
Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.
So (again, I am very much not an expert so please correct me if I'm wrong) I guess my analogy would be if instead of predicting one word at a time, it predicted one letter at a time. At some point, would only be one word that could fit. So if prompted with "SENTE", and it returned "N", that doesn't mean that it's thinking ahead to the "CE" / knows that it is spelling "SENTENCE" already.
Is that a correct way to think of it?
I don't see how you get to this conclusion. From all the training data it has seen, "an" is the most probable next word after "I climbed up the tree and picked up". The network does not need to know anything about the apple at this point. Then, the next word is "apple" (with an even higher probability I guess).
Deleted Comment
Of course it can only answer the next word, because there is only room in its outputs for the next word. But it has to compute much more. It a huge hidden internal state where it has to first encode what the given sentence is about, then predict some general concept in which the continuation goes, decide the locally correct syntactical structure and only from this you can predict the next word.
Exactly what you mean by bigram.
Obviously any romance language includes gender, so you need to use the correct gendered article before the noun.
Japanese has different counting words depending on what you're counting.
I don't know if Mandarin has anything similar.
But then: “ Testing the neuron on a larger dataset”
If I follow correctly, they test a bunch of different completions that contain “an”. So they are not just detecting the bigram “an apple”, but the common activation among a bunch of “an X” activations where X is completely different.
1: https://www.cell.com/cancer-cell/pdf/S1535-6108(02)00133-2.p...
For instance, "an eagle" is tokenized to [an][ eagle], but "anoxic" is tokenized to [an][oxic], so just looking for the [an] token is not sufficient. Therefore, you would need to map the output text all the way back into the model to figure out what neuron(s) in the model would generate "an" over "a". Since the bulk of GPT is all unsupervised learning, any connections it makes in its neural network is all emergent.