> GPT-4 feature: ends of phrases related to price increases
and the 2/5s of the responses don't have any relation to increase at all:
> Brent crude, fell 38 cents to $118.29 a barrel on the ICE Futures Exchange in London. The U.S. benchmark, West Texas Intermediate crude, was down 53 cents to $99.34 a barrel on the New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA
and
> ,115.18. The record reflects that appellant also included several hand-prepared invoices and employee pay slips, including an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53 bags oflaundry weighing 478 pounds, which, at the contract price of $
I think I must be mis-understanding something. Why would this example (out of all the potential examples) be picked?
Notice that most of the examples have none of the green highlight counter, which is shown for
> small losses. KEEPING SCORE: The Dow Jones industrial average rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m. Eastern time. The Standard & Poor’s ... OMAHA, Neb. (AP) — Warren Buffett’s company has bought nearly
the other sentences are in contrast to show how specific this neuron is.
There are also many top activations not showing increases, e.g.
> 0.06 of a cent to 90.01 cents US.↵↵U.S. indexes were mainly lower as the Dow Jones industrials lost 21.72 points to 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the S&P 500
This is super cool, it feels like going in the direction of the "deep"/high level type of semantic searching I've been waiting for. I like their examples of basically filtering documents for the "concept" of price increases, or even something as high level as a rhetorical question
I wonder how this compares to training/fine tuning a model on examples of rhetorical questions and asking it to find it in a given document. This is maybe faster/more accurate? Since it involves just looking at neural network activation, vs running it with input and having it generate an answer...?
Nahhh, that's the tried-and-true Apple approach to marketing, and OpenAI is well positioned to adopt it for themselves. They act like they invented transformers as much as Apple acts like they invented the smartphone.
I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work.
It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.
More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.
I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.
The biggest thing I noticed comparing the two was that OpenAI's method really approached (and appears to have effectively mitigated) the dead latents problem with a clever weight initialization and an "auxiliary loss" which (I think) explicitly penalizes dead latents. The TopK activation function is the other main difference I spot between the two.
Now, on the flip side, the Anthropic effort goes much further than the OpenAI one in terms of actually doing something interesting with the outputs of all this. Feature steering and the feature UMAP are both extremely cool, and to my knowledge the OpenAI team stopped short of efforts like that in their paper.
Exciting to see this so soon after Anthropic's "Mapping the Mind of a Large Language Model" (under 3 weeks). I find these efforts really exciting; it is still common to hear people say "we have no idea how LLMs / Deep Learning works", but that is really a gross generalization as stuff like this shows.
Wonder if this was a bit rushed out in response to Anthropic's release (as well as the departure of Jan Leike from OpenAI)... the paper link doesn't even go to Arxiv, and the analysis is not nearly as deep. Though who knows, might be unrelated.
I read this as "we have not built up tools / math to understand neural networks as they are new and exciting" and not as "neural networks are magical and complex and not understandable because we are meddling with something we cannot control".
A good example would be planes - it took a long while to develop mathematical models that could be used to model behavior. Meanwhile practical experimentation developed decent rule of thumb for what worked / did not work.
So I don't think it's fair to say that "we don't" (know how neural networks work), we don't have math / models yet that can explain/model their behavior...
Could there also be a “legal hedging” reason for why you would release a paper like this?
By reaffirming that “we don’t know how this works, nobody does” it’s easier to avoid being charged with copyright infringement from various actors/data sources that have sued them.
We were planning to release the paper around this time independent of the other events you mention.
I think it is still predominantly accurate to say that we have no idea how LLMs work. SAEs might eventually change that, but there's still a long way to go.
> but that is really a gross generalization as stuff like this shows.
I think this research actually still reinforces that we still have very little understanding of the internals. The blog post also reiterates that this is early work with many limitations.
> likely all these guys went to the same metaphorical SF bars, it was in the water
It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.
Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).
In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.
Somebody else, please correct me if you think my reading is incorrect!!
This project has been in the works for about a year. The initial commit to the public repo was not really closely related to this project, it was part of the release of the Transformer debugger, and the repo was just reused for this release.
The fact that a paper is implying a LLM has a mind doesn't exactly bode well for the people who wrote it, not to mention the continued meaningless babbling about "safety". It'd also be nice if they could show their work so we could replicate it. Still, not shabby for an ad!
Well - what is a mind exactly? We don't really have a good definition for a human mind. Not sure we should be claiming domain over the term. It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind - whether technically true or not (which we honestly don't know).
Indeed, and the very last section about how they’ve now “open sourced” this research is also a bit vague. They’ve shared their research methodology and findings… But isn’t that obligatory when writing a public paper?
But even with current efforts so far, I don't think we have an understanding of how/why these emergent capabilities are formed. LLMs are still a black box as ever.
The Deep Visualization Toolbox from nearly 10 years ago is solid precedent for understanding deep models, albeit much smaller models than LLMs. It’s hard to say OpenAI’s “visualization” released today is nearly as effective. It could be that GPT-4 is much harder to instrument.
At the shit-tier level, the majority of people building applications on this technology are projecting abilities onto it that even they can't really demonstrate it has in a reliable way.
At the inventor level, the people who make it are dependent on projecting the idea that magic will happen when they have more compute.
At every level, the products are so far ahead of the knowledge that it's actually unethical.
How does this compare to or improve on applying something like SHAP[0][1] on a model? The idea in the first line that "we currently don't understand how to make sense of the neural activity within language models." is..straight up false?
SHAP is pretty separate IMO. Shapley analysis is really a game theoretical methodology that is model agnostic and is only about determining how individual sections of the input contribute to a given prediction, not about how the model actually works internally to produce an output.
As long as you have a callable black box, you can compute Shapley values (or approximations); it does not speak to how or why the model actually works internally.
LLM based AIs have lots of "features" which are kind of synonymous with "concepts" - these can be anything from `the concept of an apostrophe in the word don't`, to `"George Wash" is usually followed by "ington" in the context of early American History`. Inside of the LLMs neural network, these are mapped to some circuitry-in-software-esque paths.
We don't really have a good way of understanding how these features are generated inside of the LLMs or how their circuitry is activated when outputting them, or why the LLMs are following those circuits. Because of this, we don't have any way to debug this component of an LLM - which makes them harder to improve. Similarly, if LLMs/AIs ever get advanced enough, we'll want to be able to identify if they're being wilfully deceptive towards us, which we can't currently do. For these reasons, we'd like to understand what is actually happening in the neural network to produce & output concepts. This domain of research is usually referred to as "interpretability".
OpenAI (and also DeepMind and Anthropic) have found a few ways to inspect the inner circuitry of the LLMs, and reveal a handful of these features. They do this by asking questions of the model, and then inspecting which parts of the LLM's inner circuitry "lights up". They then ablate (turn off) circuitry to see if those features become less frequently used in the AIs response as a verification step.
The graphs and highlighted words are visual representations of concepts that they are reasonably certain about - for example, the concept of the word "AND" linking two parts of a sentence together highlights the word "AND".
Neel Nanda is the best source for this info if you're interested in interpretability (IMO it's the most interesting software problem out there at the moment), but note that his approach is different to OpenAI's methodology discussed in the post: https://www.neelnanda.io/mechanistic-interpretability
High-level concepts stored inside the large models (diffusion models, transformers etc) are normally hard to separate from each other, and the model is more or less a black box. A lot of research is put into obtaining the insight into what model knows. This is another advancement in this direction; it allows for easy separation of the concepts.
This can be used to analyze the knowledge inside the model, and potentially modify (add, erase, change the importance) certain concepts without affecting unrelated ones. The precision achievable with the particular technique is always in question though, and some concepts are just too close to separate from each other, so it's probably not perfect.
from chatgpt itself: The article discusses how researchers use sparse autoencoders to identify and interpret key features within complex language models like GPT-4, making their inner workings more understandable. This advancement helps improve AI safety and reliability by breaking down the models' decision-making processes into simpler, human-interpretable parts.
In general this is just copying work done by Anthropic, so there's nothing fundamentally new here.
What they have done here is to identify patterns internal to GPT-4 that correspond to specific identifiable concepts. The work was done my OpenAI's mostly dismantled safety team (it has the names of this teams recently departed co-leads Ilya & Jan Leike on it), so this is nominally being done for safety reasons to be able to boost or suppress specific concepts from being activated when the model is running, such as Anthropic's demonstration of boosting their models fixation on the Golden Gate bridge:
This kind of work would also seem to have potential functional uses as well as safety ones, given that it allow you to control the model in specific ways.
Does this mean that it could be a good practice to release the auto encoder that was trained on a neural network to explain its outputs? Like all open models in hugging face could have this as a useful accompaniment?
> GPT-4 feature: ends of phrases related to price increases
and the 2/5s of the responses don't have any relation to increase at all:
> Brent crude, fell 38 cents to $118.29 a barrel on the ICE Futures Exchange in London. The U.S. benchmark, West Texas Intermediate crude, was down 53 cents to $99.34 a barrel on the New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA
and
> ,115.18. The record reflects that appellant also included several hand-prepared invoices and employee pay slips, including an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53 bags oflaundry weighing 478 pounds, which, at the contract price of $
I think I must be mis-understanding something. Why would this example (out of all the potential examples) be picked?
> small losses. KEEPING SCORE: The Dow Jones industrial average rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m. Eastern time. The Standard & Poor’s ... OMAHA, Neb. (AP) — Warren Buffett’s company has bought nearly
the other sentences are in contrast to show how specific this neuron is.
There are also many top activations not showing increases, e.g.
> 0.06 of a cent to 90.01 cents US.↵↵U.S. indexes were mainly lower as the Dow Jones industrials lost 21.72 points to 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the S&P 500
(Highlight on the first comma.)
I wonder how this compares to training/fine tuning a model on examples of rhetorical questions and asking it to find it in a given document. This is maybe faster/more accurate? Since it involves just looking at neural network activation, vs running it with input and having it generate an answer...?
https://exa.ai/search?c=all
https://openai.com/index/language-models-can-explain-neurons...
[0] https://transformer-circuits.pub/2024/scaling-monosemanticit...
Very weird that they don't cite this in their webpage and instead bury the source in their paper.
It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.
More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.
I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.
Now, on the flip side, the Anthropic effort goes much further than the OpenAI one in terms of actually doing something interesting with the outputs of all this. Feature steering and the feature UMAP are both extremely cool, and to my knowledge the OpenAI team stopped short of efforts like that in their paper.
Wonder if this was a bit rushed out in response to Anthropic's release (as well as the departure of Jan Leike from OpenAI)... the paper link doesn't even go to Arxiv, and the analysis is not nearly as deep. Though who knows, might be unrelated.
"We currently don't understand how to make sense of the neural activity within language models."
"Unlike with most human creations, we don’t really understand the inner workings of neural networks."
"The [..] networks are not well understood and cannot be easily decomposed into identifiable parts"
"[..] the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously"
"Learning a large number of sparse features is challenging, and past work has not been shown to scale well."
etc., etc., etc.
People say we don't (currently) know why they output what they output, because .. as the article clearly states, we don't.
A good example would be planes - it took a long while to develop mathematical models that could be used to model behavior. Meanwhile practical experimentation developed decent rule of thumb for what worked / did not work.
So I don't think it's fair to say that "we don't" (know how neural networks work), we don't have math / models yet that can explain/model their behavior...
By reaffirming that “we don’t know how this works, nobody does” it’s easier to avoid being charged with copyright infringement from various actors/data sources that have sued them.
Deleted Comment
I think it is still predominantly accurate to say that we have no idea how LLMs work. SAEs might eventually change that, but there's still a long way to go.
I think this research actually still reinforces that we still have very little understanding of the internals. The blog post also reiterates that this is early work with many limitations.
too lazy to dig up source but some twitter sleuth found that the first commit to the project was 6 months ago
likely all these guys went to the same metaphorical SF bars, it was in the water
It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.
Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).
In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.
Somebody else, please correct me if you think my reading is incorrect!!
https://github.com/openai/sparse_autoencoder/commit/764586ae...
The fact that a paper is implying a LLM has a mind doesn't exactly bode well for the people who wrote it, not to mention the continued meaningless babbling about "safety". It'd also be nice if they could show their work so we could replicate it. Still, not shabby for an ad!
They actually open sourced it, for GPT-2 which is an open model.
https://github.com/yosinski/deep-visualization-toolbox
Deleted Comment
It's really not though, and on multiple levels.
At the shit-tier level, the majority of people building applications on this technology are projecting abilities onto it that even they can't really demonstrate it has in a reliable way.
At the inventor level, the people who make it are dependent on projecting the idea that magic will happen when they have more compute.
At every level, the products are so far ahead of the knowledge that it's actually unethical.
Here is the link to the concept [content warning]: https://openaipublic.blob.core.windows.net/sparse-autoencode...
DocID: 191632
[0] https://github.com/shap/shap
[1] https://en.wikipedia.org/wiki/Shapley_value#In_machine_learn...
As long as you have a callable black box, you can compute Shapley values (or approximations); it does not speak to how or why the model actually works internally.
We don't really have a good way of understanding how these features are generated inside of the LLMs or how their circuitry is activated when outputting them, or why the LLMs are following those circuits. Because of this, we don't have any way to debug this component of an LLM - which makes them harder to improve. Similarly, if LLMs/AIs ever get advanced enough, we'll want to be able to identify if they're being wilfully deceptive towards us, which we can't currently do. For these reasons, we'd like to understand what is actually happening in the neural network to produce & output concepts. This domain of research is usually referred to as "interpretability".
OpenAI (and also DeepMind and Anthropic) have found a few ways to inspect the inner circuitry of the LLMs, and reveal a handful of these features. They do this by asking questions of the model, and then inspecting which parts of the LLM's inner circuitry "lights up". They then ablate (turn off) circuitry to see if those features become less frequently used in the AIs response as a verification step.
The graphs and highlighted words are visual representations of concepts that they are reasonably certain about - for example, the concept of the word "AND" linking two parts of a sentence together highlights the word "AND".
Neel Nanda is the best source for this info if you're interested in interpretability (IMO it's the most interesting software problem out there at the moment), but note that his approach is different to OpenAI's methodology discussed in the post: https://www.neelnanda.io/mechanistic-interpretability
This can be used to analyze the knowledge inside the model, and potentially modify (add, erase, change the importance) certain concepts without affecting unrelated ones. The precision achievable with the particular technique is always in question though, and some concepts are just too close to separate from each other, so it's probably not perfect.
What they have done here is to identify patterns internal to GPT-4 that correspond to specific identifiable concepts. The work was done my OpenAI's mostly dismantled safety team (it has the names of this teams recently departed co-leads Ilya & Jan Leike on it), so this is nominally being done for safety reasons to be able to boost or suppress specific concepts from being activated when the model is running, such as Anthropic's demonstration of boosting their models fixation on the Golden Gate bridge:
https://www.anthropic.com/news/golden-gate-claude
This kind of work would also seem to have potential functional uses as well as safety ones, given that it allow you to control the model in specific ways.