Extracting concepts from GPT-4

Exciting to see this so soon after Anthropic's "Mapping the Mind of a Large Language Model" (under 3 weeks). I find these efforts really exciting; it is still common to hear people say "we have no idea how LLMs / Deep Learning works", but that is really a gross generalization as stuff like this shows.

Wonder if this was a bit rushed out in response to Anthropic's release (as well as the departure of Jan Leike from OpenAI)... the paper link doesn't even go to Arxiv, and the analysis is not nearly as deep. Though who knows, might be unrelated.

thegrim33 · 2 years ago

From the article:

"We currently don't understand how to make sense of the neural activity within language models."

"Unlike with most human creations, we don’t really understand the inner workings of neural networks."

"The [..] networks are not well understood and cannot be easily decomposed into identifiable parts"

"[..] the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously"

"Learning a large number of sparse features is challenging, and past work has not been shown to scale well."

etc., etc., etc.

People say we don't (currently) know why they output what they output, because .. as the article clearly states, we don't.

TrainedMonkey · 2 years ago

I read this as "we have not built up tools / math to understand neural networks as they are new and exciting" and not as "neural networks are magical and complex and not understandable because we are meddling with something we cannot control".

A good example would be planes - it took a long while to develop mathematical models that could be used to model behavior. Meanwhile practical experimentation developed decent rule of thumb for what worked / did not work.

So I don't think it's fair to say that "we don't" (know how neural networks work), we don't have math / models yet that can explain/model their behavior...

realPtolemy · 2 years ago

Could there also be a “legal hedging” reason for why you would release a paper like this?

By reaffirming that “we don’t know how this works, nobody does” it’s easier to avoid being charged with copyright infringement from various actors/data sources that have sued them.

Deleted Comment

surfingdino · 2 years ago

Not holding my breath for that hallucinated cure for cancer then.

submeta · 2 years ago

Scary actually. Because how can we asses the risks when we don’t know what the system is capabale of doing.

leogao · 2 years ago

We were planning to release the paper around this time independent of the other events you mention.

I think it is still predominantly accurate to say that we have no idea how LLMs work. SAEs might eventually change that, but there's still a long way to go.

joaquincabezas · 2 years ago

it makes sense that the leaders are building around similar ideas in parallel, for me it's a healthy sign

jerrygenser · 2 years ago

> but that is really a gross generalization as stuff like this shows.

I think this research actually still reinforces that we still have very little understanding of the internals. The blog post also reiterates that this is early work with many limitations.

swyx · 2 years ago

> Wonder if this was a bit rushed out in response to Anthropic's release

too lazy to dig up source but some twitter sleuth found that the first commit to the project was 6 months ago

likely all these guys went to the same metaphorical SF bars, it was in the water

szvsw · 2 years ago

> likely all these guys went to the same metaphorical SF bars, it was in the water

It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.

Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).

In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.

Somebody else, please correct me if you think my reading is incorrect!!

leogao · 2 years ago

This project has been in the works for about a year. The initial commit to the public repo was not really closely related to this project, it was part of the release of the Transformer debugger, and the repo was just reused for this release.

nicce · 2 years ago

Visualizer was added 18 hours ago:

https://github.com/openai/sparse_autoencoder/commit/764586ae...

darby_nine · 2 years ago

> Mapping the Mind of a Large Language Model

The fact that a paper is implying a LLM has a mind doesn't exactly bode well for the people who wrote it, not to mention the continued meaningless babbling about "safety". It'd also be nice if they could show their work so we could replicate it. Still, not shabby for an ad!

castigatio · 2 years ago

Well - what is a mind exactly? We don't really have a good definition for a human mind. Not sure we should be claiming domain over the term. It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind - whether technically true or not (which we honestly don't know).

realPtolemy · 2 years ago

Indeed, and the very last section about how they’ve now “open sourced” this research is also a bit vague. They’ve shared their research methodology and findings… But isn’t that obligatory when writing a public paper?

lanceflt · 2 years ago

https://github.com/openai/sparse_autoencoder

They actually open sourced it, for GPT-2 which is an open model.

3abiton · 2 years ago

But even with current efforts so far, I don't think we have an understanding of how/why these emergent capabilities are formed. LLMs are still a black box as ever.

choppaface · 2 years ago

The Deep Visualization Toolbox from nearly 10 years ago is solid precedent for understanding deep models, albeit much smaller models than LLMs. It’s hard to say OpenAI’s “visualization” released today is nearly as effective. It could be that GPT-4 is much harder to instrument.

https://github.com/yosinski/deep-visualization-toolbox

imjonse · 2 years ago

Both Leike and Sutskever are still credited in the post.

Deleted Comment

throw46365 · 2 years ago

> that is really a gross generalization

It's really not though, and on multiple levels.

At the shit-tier level, the majority of people building applications on this technology are projecting abilities onto it that even they can't really demonstrate it has in a reliable way.

At the inventor level, the people who make it are dependent on projecting the idea that magic will happen when they have more compute.

At every level, the products are so far ahead of the knowledge that it's actually unethical.

Can someone ELI5 the significance of this? (okay maybe not 5, but in basic language)

OtherShrezzing · 2 years ago

LLM based AIs have lots of "features" which are kind of synonymous with "concepts" - these can be anything from `the concept of an apostrophe in the word don't`, to `"George Wash" is usually followed by "ington" in the context of early American History`. Inside of the LLMs neural network, these are mapped to some circuitry-in-software-esque paths.

We don't really have a good way of understanding how these features are generated inside of the LLMs or how their circuitry is activated when outputting them, or why the LLMs are following those circuits. Because of this, we don't have any way to debug this component of an LLM - which makes them harder to improve. Similarly, if LLMs/AIs ever get advanced enough, we'll want to be able to identify if they're being wilfully deceptive towards us, which we can't currently do. For these reasons, we'd like to understand what is actually happening in the neural network to produce & output concepts. This domain of research is usually referred to as "interpretability".

OpenAI (and also DeepMind and Anthropic) have found a few ways to inspect the inner circuitry of the LLMs, and reveal a handful of these features. They do this by asking questions of the model, and then inspecting which parts of the LLM's inner circuitry "lights up". They then ablate (turn off) circuitry to see if those features become less frequently used in the AIs response as a verification step.

The graphs and highlighted words are visual representations of concepts that they are reasonably certain about - for example, the concept of the word "AND" linking two parts of a sentence together highlights the word "AND".

Neel Nanda is the best source for this info if you're interested in interpretability (IMO it's the most interesting software problem out there at the moment), but note that his approach is different to OpenAI's methodology discussed in the post: https://www.neelnanda.io/mechanistic-interpretability

localfirst · 2 years ago

hallucination solution?

orbital-decay · 2 years ago

High-level concepts stored inside the large models (diffusion models, transformers etc) are normally hard to separate from each other, and the model is more or less a black box. A lot of research is put into obtaining the insight into what model knows. This is another advancement in this direction; it allows for easy separation of the concepts.

This can be used to analyze the knowledge inside the model, and potentially modify (add, erase, change the importance) certain concepts without affecting unrelated ones. The precision achievable with the particular technique is always in question though, and some concepts are just too close to separate from each other, so it's probably not perfect.

93po · 2 years ago

from chatgpt itself: The article discusses how researchers use sparse autoencoders to identify and interpret key features within complex language models like GPT-4, making their inner workings more understandable. This advancement helps improve AI safety and reliability by breaking down the models' decision-making processes into simpler, human-interpretable parts.

HarHarVeryFunny · 2 years ago

In general this is just copying work done by Anthropic, so there's nothing fundamentally new here.

What they have done here is to identify patterns internal to GPT-4 that correspond to specific identifiable concepts. The work was done my OpenAI's mostly dismantled safety team (it has the names of this teams recently departed co-leads Ilya & Jan Leike on it), so this is nominally being done for safety reasons to be able to boost or suppress specific concepts from being activated when the model is running, such as Anthropic's demonstration of boosting their models fixation on the Golden Gate bridge:

https://www.anthropic.com/news/golden-gate-claude

This kind of work would also seem to have potential functional uses as well as safety ones, given that it allow you to control the model in specific ways.

svieira · 2 years ago

When one of the first examples is:

> GPT-4 feature: ends of phrases related to price increases

and the 2/5s of the responses don't have any relation to increase at all:

> Brent crude, fell 38 cents to $118.29 a barrel on the ICE Futures Exchange in London. The U.S. benchmark, West Texas Intermediate crude, was down 53 cents to $99.34 a barrel on the New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA

and

> ,115.18. The record reflects that appellant also included several hand-prepared invoices and employee pay slips, including an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53 bags oflaundry weighing 478 pounds, which, at the contract price of $

I think I must be mis-understanding something. Why would this example (out of all the potential examples) be picked?

Metus · 2 years ago

Notice that most of the examples have none of the green highlight counter, which is shown for

> small losses. KEEPING SCORE: The Dow Jones industrial average rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m. Eastern time. The Standard & Poor’s ... OMAHA, Neb. (AP) — Warren Buffett’s company has bought nearly

the other sentences are in contrast to show how specific this neuron is.

yorwba · 2 years ago

The highlights are better visible in this visualisation: https://openaipublic.blob.core.windows.net/sparse-autoencode...

There are also many top activations not showing increases, e.g.

> 0.06 of a cent to 90.01 cents US.↵↵U.S. indexes were mainly lower as the Dow Jones industrials lost 21.72 points to 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the S&P 500

(Highlight on the first comma.)

Ah, that makes a lot of sense, thank you!

OmarShehata · 2 years ago

This is super cool, it feels like going in the direction of the "deep"/high level type of semantic searching I've been waiting for. I like their examples of basically filtering documents for the "concept" of price increases, or even something as high level as a rhetorical question

I wonder how this compares to training/fine tuning a model on examples of rhetorical questions and asking it to find it in a given document. This is maybe faster/more accurate? Since it involves just looking at neural network activation, vs running it with input and having it generate an answer...?

f0e4c2f7 · 2 years ago

Exa is trying to do this. I've found some sort of interesting stuff this way but it honestly doesn't feel quite good enough yet to me.

https://exa.ai/search?c=all

andai · 2 years ago

Exa's top hit for this article:

https://openai.com/index/language-models-can-explain-neurons...

yismail · 2 years ago

Interesting, reminds me of similar work Anthropic did on Claude 3 Sonnet [0].

[0] https://transformer-circuits.pub/2024/scaling-monosemanticit...

longdog · 2 years ago

I feel the webpage strongly hints that sparse autoencoders were invented by OpenAI for this project.

Very weird that they don't cite this in their webpage and instead bury the source in their paper.

cosmojg · 2 years ago

Nahhh, that's the tried-and-true Apple approach to marketing, and OpenAI is well positioned to adopt it for themselves. They act like they invented transformers as much as Apple acts like they invented the smartphone.

Legend2440 · 2 years ago

The methods are the same, this is just OpenAI applying Anthropic's research to their own model.

colah3 · 2 years ago

I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work.

It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.

More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.

I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.

Fripplebubby · 2 years ago

The biggest thing I noticed comparing the two was that OpenAI's method really approached (and appears to have effectively mitigated) the dead latents problem with a clever weight initialization and an "auxiliary loss" which (I think) explicitly penalizes dead latents. The TopK activation function is the other main difference I spot between the two.

Now, on the flip side, the Anthropic effort goes much further than the OpenAI one in terms of actually doing something interesting with the outputs of all this. Feature steering and the feature UMAP are both extremely cool, and to my knowledge the OpenAI team stopped short of efforts like that in their paper.

The paper introduces substantial improvements over the methodology in the Anthropic SAE paper, and the research was done concurrently.

ranman · 2 years ago

Someone mentioned that this took almost as much compute to train as the original model.

source please!

andreyk · 2 years ago

aeonik · 2 years ago

In their other examples, they have what looks to be a scientific explanation of reproductive anatomy classified as erotic content...

Here is the link to the concept [content warning]: https://openaipublic.blob.core.windows.net/sparse-autoencode...

DocID: 191632

adamiscool8 · 2 years ago

How does this compare to or improve on applying something like SHAP[0][1] on a model? The idea in the first line that "we currently don't understand how to make sense of the neural activity within language models." is..straight up false?

[0] https://github.com/shap/shap

[1] https://en.wikipedia.org/wiki/Shapley_value#In_machine_learn...

SHAP is pretty separate IMO. Shapley analysis is really a game theoretical methodology that is model agnostic and is only about determining how individual sections of the input contribute to a given prediction, not about how the model actually works internally to produce an output.

As long as you have a callable black box, you can compute Shapley values (or approximations); it does not speak to how or why the model actually works internally.

obiefernandez · 2 years ago

itissid · 2 years ago

Does this mean that it could be a good practice to release the auto encoder that was trained on a neural network to explain its outputs? Like all open models in hugging face could have this as a useful accompaniment?

Grimblewald · 2 years ago

I imagine such an encoder would be specific to a model.