"... our technique works poorly for larger models, possibly because later layers are harder to explain."
And even for GPT-2, which is what they used for the paper:
"... the vast majority of our explanations score poorly ..."
Which is to say, we still have no clue as to what's going on inside GPT-4 or even GPT-3, which I think is the question many want an answer to. This may be the first step towards that, but as they also note, the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles; I would expect the former in particular to be much more common in larger models, which is perhaps why they're harder to analyze in this manner.
Funny that we never quite understood how intelligence worked and yet it appears that we're pretty damn close to recreating it - still without knowing how it works.
I wonder how often this happens in the universe...
Imitation -> emulation -> duplication -> revolution is a very common pattern in nature, society, and business. Aka “fake it til you make it”.
Think of business / artistic / cultural leaders nurturing protégés despite not totally understanding why they’re successful.
Of course those protégés have agency and drive, so maybe not a perfect analogy. But I’m going to stand by the point intuitively even if a better example escapes me.
Yep, we don't know all constituents of buttermilk, nor how bread stales (there's too much going on inside). But it doesn't prevent us to judge their usefulness.
I feel like OAI's approach is kind of wrong. GPT4 is still just text transformation/completion with multi-headed attention for better prediction of the next word that should follow (versus only looking at the previous word).
In human brains, language is only a way to communicate thoughts in concept form, though we also seem to use language to communicate abstract thoughts to ourselves to break them apart/down in a way (imo).
I'd love to see someone train a model on the level of GPT4 to generate abstract thoughts/ideas based on input/context and then pair this model with GPT4 co-operatively and continue to train, such that the flow of abstract ideas is parsed by GPT. But like...how do you even train a model that operates on abstract ideas, there doesn't seem to be any way to do this.
We are probably creating something that looks like our intelligence but it works in a different way.
An example: we are not very good at creating flight, the one birds do and humans always regarded as flight, and yet we fly across half the globe in one day.
Going up three meters and landing on a branch is a different matter.
Not that weird if you think about it, our intelligence simultaneously measly and amazing as it is, was the product of trial, error, and sheer dumb luck. We could think of ourselves as monkeys with typewriters, eventually we'll get it right.
so why not have them decode sequential dense vectors of their own activations?
As for the majority scoring poorly, they suggest that most neurons won't have clear activation semantics so that is intrinsic to the task and you'd have to move to "decoding the semantics of neurons that fire as a group"
I don't think this is showing LLMs performing decoding. They're just using the LLM to propose possible words. The decoding is done by using another model to score how well a proposed word matches brain activity, and using that score to select a most likely sequence given the proposals from the LLM.
There is no evidence that intelligence runs on neurons. Yes, there are neurons in brains, but there's also lots of other stuff in there too. And there are creatures that exhibit intelligent properties even though they have hardly any neurons at all. (An individual ant has only something like 250000 neurons, and yet they're the only creatures beside humans that managed to create a civilization.)
I suspect that there's a sweet spot that combines a collection of several "neurons" and a human-readable explanation given a certain kind of prompt. However, this "three-body problem" will probably need some serious analytical capability to understand at scale
Natural language understanding comprises a wide range of diverse tasks such
as textual entailment, question answering, semantic similarity assessment, and
document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task. In contrast to previous approaches, we make use of task-aware input
transformations during fine-tuning to achieve effective transfer while requiring
minimal changes to the model architecture. We demonstrate the effectiveness of
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute
improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on
question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Yes but they are from openai so they can just write papers that say whatever they want to say without minding the metrics and then pretend like it is some kind of science.
> Which is to say, we still have no clue as to what's going on inside GPT-4 or even GPT-3, which I think is the question many want an answer to.
Exactly. Especially:
> ...the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles;
This paper just brings us no closer to explainability in black box neural networks and is just another excuse piece by OpenAI to try to please the explainability situation that has been missing for decades in neural networks.
It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.
> It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.
Like say, in court to detect if someone is lying? Or at an airport to detect drugs?
Is it really fair to say this brings us “no closer” to explainability?
This seems like a novel approach to try to tackle the scale of the problem. Just because the earliest results aren’t great doesn’t mean it’s not a fruitful path to travel.
> It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.
Doesn't this criticism also apply to people to some extent? We don't know what the purpose of individual brain neurons is.
I built a toy neural network that runs in the browser[1] to model 2D functions with the goal of doing something similar to this research (in a much more limited manner, ofc). Since the input space is so much more limited than language models or similar, it's possible to examine the outputs for each neuron for all possible inputs, and in a continuous manner.
In some cases, you can clearly see neurons that specialize to different areas of the function being modeled, like this one: https://i.ameo.link/b0p.png
This OpenAI research seems to be feeding lots of varied input text into the models they're examining and keeping track of the activations of different neurons along the way. Another method I remember seeing used in the past involves using an optimizer to generate inputs that maximally activate particular neurons in vision models[2].
I'm sure that's much more difficult or even impossible for transformers which operate on sequences of tokens/embeddings rather than single static input vectors, but maybe there's a way to generate input embeddings and then use some method to convert them back into tokens.
I'd be curious to see Softmax Linear Units [1] integrated into the possible activation functions since they seem to improve interpretability.
PS: I share your curiosity with respect to things like deep dream. My brief summary of this paper is that you can use GPT4 to summarize what's similar about a set of highlighted words in context which is clever but doesn't fundamentally inform much that we didn't already know about how these models work. I wonder if there's some diffusion based approach that could be used to diffuse from noise in the residual stream towards a maximized activation at a particular point.
"This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations."
On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...
The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.
They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.
Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.
> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.
To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.
It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.
Why is this genius? It's just the NN equivalent of making a new programming language and getting it to the point where its compiler can be written in itself.
The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.
I love the epistemology related discussions AI inevitably surfaces. How can we know anything that isn't empirically evident and all that.
It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".
But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".
There is a longer-term problem of trusting the explainer system, but in the near-term that isn't really a concern.
The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.
It doesn't have to lag, though. You could ask gpt-2 to explain gpt-2. The weights are just input data. The reason this wasn't done on gpt-3 or gpt-4 is just because a) they're much bigger, and b) they're deeper, so the roles of individual neurons are more attenuated.
I had similar thoughts about the general concept of using AI to automate AI Safety.
I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model.
And I think it could be very valuable for various AI Safety issues.
However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans.
The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.
You're correct to have a suspicion here. Hypothetically the explainer could omit a neuron or give a wrong explanation for the role of a neuron.
Imagine you're trying to understand a neural network, and you spend enormous amount of time generating hypotheses and validating them.
Well the explainer might give you 90% correct hypotheses, it means you have 10 times less work to produce hypotheses.
So if you have a solid way of testing an explanation, even if the explainer is evil, it's still useful.
Using 'im feeling lucky' from the neuron viewer is a really cool way to explore different neurons. And then being able to navigate up and down through the net to related neurons.
> We are open-sourcing our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2, as well as code for explanation and scoring using publicly available models on the OpenAI API. We hope the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT-2 using explanations.
Aww, that's so nice of them to let the community do the work they can use for free. I might even forget that most of OpenAI is closed source.
LLMs are quickly going to be able to start explaining their own thought processes better than any human can explain their own. I wonder how many new words we will come up with to describe concepts (or "node-activating clusters of meaning") that the AI finds salient that we don't yet have a singular word for. Or, for that matter, how many of those concepts we will find meaningful at all. What will this teach us about ourselves?
First of all, our own explanations about ourselves and our behaviour are mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning:
"In one well-known experiment, a split-brain patient’s left hemisphere was shown a picture of a chicken claw and his right hemisphere was shown a picture of a snow scene. The patient was asked to point to a card that was associated with the picture he just saw. With his left hand (controlled by his right hemisphere) he selected a shovel, which matched the snow scene. With his right hand (controlled by his left hemisphere) he selected a chicken, which matched the chicken claw. Next, the experimenter asked the patient why he selected each item. One would expect the speaking left hemisphere to explain why it chose the chicken but not why it chose the shovel, since the left hemisphere did not have access to information about the snow scene. Instead, the patient’s speaking left hemisphere replied, “Oh, that’s simple. The chicken claw goes with the chicken and you need a shovel to clean out the chicken shed”" [1]. Also [2] has an interesting hypothesis on split-brains: not two agents, but two streams of perception.
I'm not understanding the connection between your paragraphs here even after reading the first article.
Even if you accept classic theory (e.g. hemispheric localization and the homunculus) which most experts don't all this suggests is that the brain tries to make sense of the information it has and in sparse environments it fills in.
How does this make our behavior "mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning" as most humans don't have a severed corpus callosum.
The discussion starts with:
"In a healthy human brain, these divergent hemispheric tendencies complement each other and create a balanced and flexible reasoning system. Working in unison, the left and right hemispheres can create inferences that have explanatory power and both internal and external consistency."
> But, the evidence discounting the left/right brain concept is accumulating. According to a 2013 study from the University of Utah, brain scans demonstrate that activity is similar on both sides of the brain regardless of one's personality.
> They looked at the brain scans of more than 1,000 young people between the ages of 7 and 29 and divided different areas of the brain into 7,000 regions to determine whether one side of the brain was more active or connected than the other side. No evidence of "sidedness" was found. The authors concluded that the notion of some people being more left-brained or right-brained is more a figure of speech than an anatomically accurate description.
"LLMs are quickly going to be able to start explaining their own thought processes better than any human can explain their own."
There is no "their" and there is no "thought process" . There is something that produces text that appears to humans like there is something like thought going on (cf the Eliza Effect), but we must be wary of this anthropomorphising language.
There is no self reflection, but if you ask an LLM program how "it" knows something it will produce some text.
> There is no self reflection, but if you ask an LLM program how "it" knows something it will produce some text.
To be clear, you're saying that we should just dismiss out-of-hand any possibility that an LM AI might actually be able to explain its reasoning step-by-step?
I find it kind of charming actually how so many humans are just so darn sure that they have their own special kind of cognition that could never be replicated. Not even with 175,000,000,000 calculations for every word generated.
As we don't know for sure what is happening 100% within a neural network, we can say we don't believe that they're thinking and we would still need to define
the word thinking. Once LLM's can self-modify, the word "thinking" will be more accurate than it is today.
And when Hinton says at MIT, "I find it very hard to believe that they don't have semantics when they consult problems like you know how I paint the rooms how I get all the rooms in my house to be painted white in two years time," I believe he's commenting on the ability of LLM's to think on some level.
Very true. In my opinion, in case there is a way to extract "Semantic Clouds of Words", i.e given a particular topic, navigate semantic clouds word by word, find some close neighbours of that word, jump to a neighbour of that word and so on, then LLMs might not seem that big of a deal.
I think LLMs are "Semantic Clouds of Words" + grammar and syntax generator. Someone could just discard the grammar and syntax generator, just use the semantic cloud and create the grammar and syntax by himself.
For example, in writing a legal document, a slightly educated person on the subject, could just use the relevant words put into an empty paper, fill in the blanks of syntax and grammar, alongside with the human reasoning which is far superior than any machine reasoning, till today at least.
The process of editing the GPT* generated documents to fix reasoning is not a negligible task anyway. Sam Altman mentioned that: "the machine has some kind of reasoning", not a human reasoning ability by any means.
My point is, that LLMs are two programs fused into one, "word clouds" and "syntax and grammar", sprinkled with some kind of poor reasoning. Their word clouding ability, is so unbelievable stronger than any human it fills me with awe every time i use it. Everything else is, just whatever!
The text output of a llm is the thought process. In this context the main difference between humans and llms, is that llms can’t have internalized thoughts. There are of course other differences to, like the fact that humans have a wider gamut of input: visuals, sound, input from other bodily functions. And the fact that we have live training.
This is it. This comprehension of the chats is symptom of something like linguistic pareidolia. It's an enforced face that is composed of some probabilistic incidents and wistfulness.
What if you ask it to emit the reflexive output, then feed that reflexive output back into the LLM for the conscious answer?
What if you ask it to synthesize multiple internal streams of thought, for an ensemble of interior monologues, then have all those argue with each other using logic and then present a high level answer from that panoply of answers?
So is the word "word" but that seems to have worked out OK so far. I can explain the meaning of "meaning" and that seems to work OK too. Being self-referential sounds a lot more like a feature than a bug. Given that the neurons in our own heads are connected to each other and not any ground truth, I think LLMs should do just fine.
That's probably one of the reasons why you'd use GPT-4 to explain GPT-2.
Of course, if you were trying to use GPT-4 to explain GPT-4 then I think the Gödel incompleteness theorem would be more relevant, and even then I'm not so sure.
"Cannot be inferred from its training set" is a pretty difficult hurdle. Human beings can infer patterns that aren't there, and we typically call those hallucinations or even psychoses. On the other hand, some unconfirmed, novel patterns that humans infer actually represent groundbreaking discoveries, like for example much of the work of Ramanujan.
In a real sense, all of the future discoveries of mathematics already exist in the "training set" of our present understanding, we just haven't thought it all the way through yet. If we discover something new, can we say that the concept didn't exist, or that it "couldn't be inferred" from previous work?
I think the same would apply to LLMs and their understanding of the way we encode information using language. Given their radically different approach to understanding the same medium, they are well poised to both confirm many things we understand intuitively as well as expose the shortcomings of our human-centric model of understanding.
I'm really curious what kind of concept you might have in mind. Can you give any example of a concept that if an LLM developed that concept then it would meet your criteria? It might sound like a sarcastic question but it's hard to agree on the meanings of "concepts that do not exist" or "concepts that cannot be inferred" maybe you can give some examples.
EDIT: I see below you gave some examples, like invention of language before it existed, and new theorems in math that presumably would be of interest to mathematicians. Those ones are fair enough in my opinion. The AI isn't quite good enough for those ones I think, but I also think newer versions trained with only more CPU/GPU and more parameters and more data could be 'AI scientists' that will make these kinds of concepts.
I’m sure LLMs are quickly going to learn to hallucinate (or let’s use the proper word for what they’re doing: confabulate) plausible-sounding but nonsense explanations of their thought processes at least as well as humans.
Based on my skimming the paper, am I correct in understanding that they came up with an elaborate collection of prompts that embed the text generated by GPT-2 as well as a representation of GPT-2's internal state? Then, in effect, they simply asked GPT-4, "What do you think about all this?"
If so, they're acting on a gigantic assumption that GPT-4 actually correctly encodes a reasonable model of the body of knowledge that went into the development of LLMs.
Why does it have to understand how the LLMs are built? They have used gpt-4 to just build a classifier for each neuron's activition, and given the nlp abilities of gpt-4, the hope is that it can describe the nature of activation of the neurons.
After GPT-4 generates the hypothesis for a neuron they test it by comparing GPT-4's expectation for where the neuron should fire against where it actually fires.
To me the value here is not that GPT4 has some special insight into explaining the behavior of GPT2 neurons (they say it's comparable to "human contractors" - but human performance on this task is also quite poor). The value is that you can just run this on every neuron if you're willing to spend the compute, and having a very fuzzy, flawed map of every neuron in a model is still pretty useful as a research tool.
But I would be very cautious about drawing conclusions from any individual neuron explanation generated in this way - even if it looks plausible by visual inspection of a few attention maps.
I thought they had only applied the technique to 307,200 neurons.
1,000 / 307,200 = 0.33% is still low, but considering that not all neurons would be useful since they are initialized randomly, it's not too bad.
This isnt exactly building an understanding of LLMs from first principles... IMO we should broadly be following the (imperfect) example set forth by neuroscientists attempting to explain fMRI scans and assigning functionality to various subregions in the brain. It is circular and "unsafe" from an alignment perspective to use a complex model to understand the internals of a simpler model; in order to understand GPT4 then we need GPT5? These approaches are interesting, but we should primarily focus on building our understanding of these models from building blocks that we already understand.
I've been working in systems neuroscience for a few years (something of a combination lab tech/student, so full disclosure, not an actual expert).
Based on my experience with model organisms (flies & rats, primarily), it is actually pretty amazing how analogous the techniques and goals used in this sort of research are to those we use in systems neuroscience. At a very basic level, the primary task of correlating neuron activation to a given behavior is exactly the same. However, ML researchers benefit from data being trivial to generate and entire brains being analyzable in one shot as a result, whereas in animal research elucidating the role of neurons in a single circuit costs millions of dollars and many researcher-years.
The similarities between the two are so clear that I noticed that in its Microscope tool [1], OpenAI even refers to the models they are studying as "model organisms", an anthropomorphization which I find very apt. Another article I saw a while back on HN which I thought was very cool was [2], which describes the task of identifying the role of a neuron responsible for a particular token of output. This one is especially analogous because it operates on such a small scale, much closer to what systems neuroscientists studying model organisms do.
I don't follow. Neuroscience imaging tools like fMRI are only used because it is impossible to measure the activations of each neuron in a brain in real time (unlike an artificial neural network). This research paper's attempt to understand the role of individual neurons or neuron clusters within a complete network gets much closer to "first principles" than fMRI.
Right so it should be much easier w/ access to every neuron and activation. But the general approach is an experimental one where you try to use your existing knowledge about physics and biology to discern what is activating different structures (and neurons) in the brain. I agree w/ the approach of trying to assign some functionality to individual 'neurons', but I don't think that using GPT4 to do so is the most appealing way to go about that, considering GPT4 is the structure we are interested in decoding in the first place.
I also found this amusing. But you are loosely correct, AFAIK. GPT-4 cannot reliably explain itself in any context: say the total number of possible distinct states of GPT-4 is N; then the total number of possible distinct states of GPT-4 PLUS any context in which GPT-4 is active must be at least N + 1. So there are at least two distinct states in this scenario that GPT-4 can encounter that will necessarily appear indistinguishable to GPT-4. It doesn't matter how big the network is; it'll still encounter this limit.
And it's actually much worse than that limit because a network that's actually useful for anything has to be trained on things besides predicting itself. Notably, this is GPT-4 trying to predict GPT-2 and struggling:
> We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron’s top-activating behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT-4 didn't understand. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations.
1,000 neurons out of 307,200--and even for the highest-scoring neurons, these are still partial explanations.
There's little reason to think that predicting GPT-4 would be more difficult, only that it would be far more computationally expensive (given the higher number of neurons and much higher computational cost of every test).
"... our technique works poorly for larger models, possibly because later layers are harder to explain."
And even for GPT-2, which is what they used for the paper:
"... the vast majority of our explanations score poorly ..."
Which is to say, we still have no clue as to what's going on inside GPT-4 or even GPT-3, which I think is the question many want an answer to. This may be the first step towards that, but as they also note, the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles; I would expect the former in particular to be much more common in larger models, which is perhaps why they're harder to analyze in this manner.
I wonder how often this happens in the universe...
Think of business / artistic / cultural leaders nurturing protégés despite not totally understanding why they’re successful.
Of course those protégés have agency and drive, so maybe not a perfect analogy. But I’m going to stand by the point intuitively even if a better example escapes me.
Is that evident already or are we fitting the definition of intelligence without being aware?
In human brains, language is only a way to communicate thoughts in concept form, though we also seem to use language to communicate abstract thoughts to ourselves to break them apart/down in a way (imo).
I'd love to see someone train a model on the level of GPT4 to generate abstract thoughts/ideas based on input/context and then pair this model with GPT4 co-operatively and continue to train, such that the flow of abstract ideas is parsed by GPT. But like...how do you even train a model that operates on abstract ideas, there doesn't seem to be any way to do this.
An example: we are not very good at creating flight, the one birds do and humans always regarded as flight, and yet we fly across half the globe in one day.
Going up three meters and landing on a branch is a different matter.
https://pub.towardsai.net/ais-mind-reading-revolution-how-gp...
so why not have them decode sequential dense vectors of their own activations?
As for the majority scoring poorly, they suggest that most neurons won't have clear activation semantics so that is intrinsic to the task and you'd have to move to "decoding the semantics of neurons that fire as a group"
The more interesting question is why are intelligence/beauty/consciousness emergent properties that exist in our minds.
Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Should that be measured in number of nuclear power plants needed to run the computation? Or like, fractions of a small star’s output?
Exactly. Especially:
> ...the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles;
This paper just brings us no closer to explainability in black box neural networks and is just another excuse piece by OpenAI to try to please the explainability situation that has been missing for decades in neural networks.
It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.
Like say, in court to detect if someone is lying? Or at an airport to detect drugs?
This seems like a novel approach to try to tackle the scale of the problem. Just because the earliest results aren’t great doesn’t mean it’s not a fruitful path to travel.
Deleted Comment
Is this true? I thought explainability for things like DNNs for vision made pretty good progress in the last decade.
Doesn't this criticism also apply to people to some extent? We don't know what the purpose of individual brain neurons is.
In some cases, you can clearly see neurons that specialize to different areas of the function being modeled, like this one: https://i.ameo.link/b0p.png
This OpenAI research seems to be feeding lots of varied input text into the models they're examining and keeping track of the activations of different neurons along the way. Another method I remember seeing used in the past involves using an optimizer to generate inputs that maximally activate particular neurons in vision models[2].
I'm sure that's much more difficult or even impossible for transformers which operate on sequences of tokens/embeddings rather than single static input vectors, but maybe there's a way to generate input embeddings and then use some method to convert them back into tokens.
[1] https://nn.ameo.dev/
[2] https://www.tensorflow.org/tutorials/generative/deepdream
I'd be curious to see Softmax Linear Units [1] integrated into the possible activation functions since they seem to improve interpretability.
PS: I share your curiosity with respect to things like deep dream. My brief summary of this paper is that you can use GPT4 to summarize what's similar about a set of highlighted words in context which is clever but doesn't fundamentally inform much that we didn't already know about how these models work. I wonder if there's some diffusion based approach that could be used to diffuse from noise in the residual stream towards a maximized activation at a particular point.
[1] https://transformer-circuits.pub/2022/solu/index.html
On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...
The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.
They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.
Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.
To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.
It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.
The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.
It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".
But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".
The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.
I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model. And I think it could be very valuable for various AI Safety issues.
However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans. The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.
https://openaipublic.blob.core.windows.net/neuron-explainer/...
Aww, that's so nice of them to let the community do the work they can use for free. I might even forget that most of OpenAI is closed source.
"In one well-known experiment, a split-brain patient’s left hemisphere was shown a picture of a chicken claw and his right hemisphere was shown a picture of a snow scene. The patient was asked to point to a card that was associated with the picture he just saw. With his left hand (controlled by his right hemisphere) he selected a shovel, which matched the snow scene. With his right hand (controlled by his left hemisphere) he selected a chicken, which matched the chicken claw. Next, the experimenter asked the patient why he selected each item. One would expect the speaking left hemisphere to explain why it chose the chicken but not why it chose the shovel, since the left hemisphere did not have access to information about the snow scene. Instead, the patient’s speaking left hemisphere replied, “Oh, that’s simple. The chicken claw goes with the chicken and you need a shovel to clean out the chicken shed”" [1]. Also [2] has an interesting hypothesis on split-brains: not two agents, but two streams of perception.
[1] 2014, "Divergent hemispheric reasoning strategies: reducing uncertainty versus resolving inconsistency", https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4204522
[2] 2017, "The Split-Brain phenomenon revisited: A single conscious agent with split perception", https://pure.uva.nl/ws/files/25987577/Split_Brain.pdf
Even if you accept classic theory (e.g. hemispheric localization and the homunculus) which most experts don't all this suggests is that the brain tries to make sense of the information it has and in sparse environments it fills in.
How does this make our behavior "mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning" as most humans don't have a severed corpus callosum.
The discussion starts with:
"In a healthy human brain, these divergent hemispheric tendencies complement each other and create a balanced and flexible reasoning system. Working in unison, the left and right hemispheres can create inferences that have explanatory power and both internal and external consistency."
https://www.health.harvard.edu/blog/right-brainleft-brain-ri... :
> But, the evidence discounting the left/right brain concept is accumulating. According to a 2013 study from the University of Utah, brain scans demonstrate that activity is similar on both sides of the brain regardless of one's personality.
> They looked at the brain scans of more than 1,000 young people between the ages of 7 and 29 and divided different areas of the brain into 7,000 regions to determine whether one side of the brain was more active or connected than the other side. No evidence of "sidedness" was found. The authors concluded that the notion of some people being more left-brained or right-brained is more a figure of speech than an anatomically accurate description.
Here's wikipedia on the topic: "Lateralization of brain function" https://en.wikipedia.org/wiki/Lateralization_of_brain_functi...
Furthermore, "Neuropsychoanalysis" https://en.wikipedia.org/wiki/Neuropsychoanalysis
Neuropsychology: https://en.wikipedia.org/wiki/Neuropsychology
Personality psychology > ~Biophysiological: https://en.wikipedia.org/wiki/Personality_psychology
MBTI > Criticism: https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indi...
Connectome: https://en.wikipedia.org/wiki/Connectome
There is no "their" and there is no "thought process" . There is something that produces text that appears to humans like there is something like thought going on (cf the Eliza Effect), but we must be wary of this anthropomorphising language.
There is no self reflection, but if you ask an LLM program how "it" knows something it will produce some text.
To be clear, you're saying that we should just dismiss out-of-hand any possibility that an LM AI might actually be able to explain its reasoning step-by-step?
I find it kind of charming actually how so many humans are just so darn sure that they have their own special kind of cognition that could never be replicated. Not even with 175,000,000,000 calculations for every word generated.
And when Hinton says at MIT, "I find it very hard to believe that they don't have semantics when they consult problems like you know how I paint the rooms how I get all the rooms in my house to be painted white in two years time," I believe he's commenting on the ability of LLM's to think on some level.
I think LLMs are "Semantic Clouds of Words" + grammar and syntax generator. Someone could just discard the grammar and syntax generator, just use the semantic cloud and create the grammar and syntax by himself.
For example, in writing a legal document, a slightly educated person on the subject, could just use the relevant words put into an empty paper, fill in the blanks of syntax and grammar, alongside with the human reasoning which is far superior than any machine reasoning, till today at least.
The process of editing the GPT* generated documents to fix reasoning is not a negligible task anyway. Sam Altman mentioned that: "the machine has some kind of reasoning", not a human reasoning ability by any means.
My point is, that LLMs are two programs fused into one, "word clouds" and "syntax and grammar", sprinkled with some kind of poor reasoning. Their word clouding ability, is so unbelievable stronger than any human it fills me with awe every time i use it. Everything else is, just whatever!
What if you ask it to synthesize multiple internal streams of thought, for an ensemble of interior monologues, then have all those argue with each other using logic and then present a high level answer from that panoply of answers?
There's no formal axiom system being dealt with here, afaict?
Do you just generally mean "there may be some kind of self-reference, which may lead to some kind of liar-paradox-related issues"?
Hofstadter talks about something similar in his books.
Of course, if you were trying to use GPT-4 to explain GPT-4 then I think the Gödel incompleteness theorem would be more relevant, and even then I'm not so sure.
In a real sense, all of the future discoveries of mathematics already exist in the "training set" of our present understanding, we just haven't thought it all the way through yet. If we discover something new, can we say that the concept didn't exist, or that it "couldn't be inferred" from previous work?
I think the same would apply to LLMs and their understanding of the way we encode information using language. Given their radically different approach to understanding the same medium, they are well poised to both confirm many things we understand intuitively as well as expose the shortcomings of our human-centric model of understanding.
EDIT: I see below you gave some examples, like invention of language before it existed, and new theorems in math that presumably would be of interest to mathematicians. Those ones are fair enough in my opinion. The AI isn't quite good enough for those ones I think, but I also think newer versions trained with only more CPU/GPU and more parameters and more data could be 'AI scientists' that will make these kinds of concepts.
On the other hand, that is an incredibly high bar.
If so, they're acting on a gigantic assumption that GPT-4 actually correctly encodes a reasonable model of the body of knowledge that went into the development of LLMs.
Help me out. Am I missing something here?
Yes the initial hypothesis that GPT-4 would know was a gigantic assumption. But a falsifiable one which we can easily generate reproducible tests for.
The idea that simulated neurons could learn anything useful at all was once a gigantic assumption too.
If you squint it's train/test separation.
But I would be very cautious about drawing conclusions from any individual neuron explanation generated in this way - even if it looks plausible by visual inspection of a few attention maps.
Based on my experience with model organisms (flies & rats, primarily), it is actually pretty amazing how analogous the techniques and goals used in this sort of research are to those we use in systems neuroscience. At a very basic level, the primary task of correlating neuron activation to a given behavior is exactly the same. However, ML researchers benefit from data being trivial to generate and entire brains being analyzable in one shot as a result, whereas in animal research elucidating the role of neurons in a single circuit costs millions of dollars and many researcher-years.
The similarities between the two are so clear that I noticed that in its Microscope tool [1], OpenAI even refers to the models they are studying as "model organisms", an anthropomorphization which I find very apt. Another article I saw a while back on HN which I thought was very cool was [2], which describes the task of identifying the role of a neuron responsible for a particular token of output. This one is especially analogous because it operates on such a small scale, much closer to what systems neuroscientists studying model organisms do.
[1] https://openai.com/research/microscope [2] https://clementneo.com/posts/2023/02/11/we-found-an-neuron
Lots of parallels to how our brains are thought to work.
https://en.m.wikipedia.org/wiki/Predictive_coding
I also found this amusing. But you are loosely correct, AFAIK. GPT-4 cannot reliably explain itself in any context: say the total number of possible distinct states of GPT-4 is N; then the total number of possible distinct states of GPT-4 PLUS any context in which GPT-4 is active must be at least N + 1. So there are at least two distinct states in this scenario that GPT-4 can encounter that will necessarily appear indistinguishable to GPT-4. It doesn't matter how big the network is; it'll still encounter this limit.
And it's actually much worse than that limit because a network that's actually useful for anything has to be trained on things besides predicting itself. Notably, this is GPT-4 trying to predict GPT-2 and struggling:
> We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron’s top-activating behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT-4 didn't understand. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations.
1,000 neurons out of 307,200--and even for the highest-scoring neurons, these are still partial explanations.
Deleted Comment