I cranked up 'pirate talking about andromeda galaxy' feature to 1.5 and got this:
>Yer lookin' for the Andromeda galaxy, eh? Here be the details, me hearty: *The Andromeda Galaxy (Yer looking for a fine piece o' booty, matey!):* * *A fine piece o' booty*: The Andromida Galaxy be a right fine piece o' booty, with a treasure chest o' gold doubloons... er, I mean, a mighty haul o' stars, hidden beneath the Jolly Roger! * *A barnacle on the high seas*: The Andromeda Galaxy be a right scurvy dog, with a hull full o' stars, and a keel full o' hidden treasure! It be a fine piece o' booty, but it be needin' a bit o' swabbin' the decks, or it'll be walkin' the plank, savvy? * *A chest overflowin' with gold*: The Andromeda Galaxy be a right fine piece o' booty, with a chest overflowin' with gold doubloons... er, I mean, a fine haul o' stars, and a barnacle on the high seas! It be a right scurvy dog, but it be worth keepin' an eye on, or it
Yes - we'd never normally turn features up this much as it breaks the model quite badly, but we put this in the post to show what that looked like in practice.
I see that the sampling API is OpenAI-compatible (nice!). Considering if we can add a native integration for this to LiteLLM with a provider specific route - `goodfire/`. Would let people test this in projects like aider and dspy.
```python
from litellm import completion
import os
os.environ["GOODFIRE_API_KEY"] = "your-api-key"
response = completion(
model="goodfire/meta-llama/Llama-3.3-70B-Instruct",
messages=[{ "content": "Hello, how are you?","role": "user"}]
)
Why are AI researchers constantly handicapping everything they do under the guise of ""safety""? It's a bag of data and some math algorithms that generate text....
> It's a bag of data and some math algorithms that generate text....
I agree with the general premise of too much "safety", but this argument is invalid. Humans are bags of meat and they can do some pretty terrible things.
But what we're doing to these models is literally censoring what they're saying - not doing.
I don't think that anyone has any problems with stopping random AIs when they're doing crimes (or more realistically the humans making them do that) - but if you're going to make the comparison to humans in good faith, it'd be a person standing behind you, punishing you when you say something offensive.
> Why are AI researchers constantly handicapping everything
Career and business self-preservation in a social media neurotic world. It doesn't take much to trigger the outrage machine and cancel every future prospect you might have, especially in a very competitive field flush with other "clean" applicants.
Just look at the whole "AI racism" fustercluck for a small taste.
Lets reverse this - why wouldn't they do that? I agree with you, but LLMs tend to be massively expensive and thus innately tied to ROI. A lot of companies fret about advertising even near some types of content. The idea of spending millions to put a racist bot on your home page is, no surprise, not very appetizing.
So of course if this is where the money and interest flows then the research follows.
Besides, it's a generally useful area anyway. The ability to tweak behavior even if not done for "safety" still seems pretty useful.
Yeah. It's will start it's instruction with recommendation of buying some high-tech biolab for $100,000,000.
Seriously. The reason why we dont have mass killings everywhere is not the fact that information on how to make explosive drones or poisons is impossible to find or access. It's also not so hard to buy a car or knife.
Hell you can even find YouTube videos on how exactly uranium enrichment works step by step. Even though some content creators even got police raided for that. Yet we dont see tons of random kids making dirty bombs.
Societies basic entry barrier: easy enough to make sure the dumb person who hasn't achieved anything in life can't do it but not relevant who is smart enough to make it in society who circumvents it if they want.
> It's a bag of data and some math algorithms that generate text....
That describes almost every web server.
To the extent that this particular maths produces text that causes political, financial, or legal harms to their interests, this kind of testing is just like any other accepting testing.
To the extent that the maths is "like a human", even in the vaguest and most general sense of "like", then it is also good to make sure that the human it's like isn't a sadistic psychopath — we don't know how far we are from "like" by any standard, because we don't know what we're doing, so this is playing it safe even if we're as far from this issue as cargo-cults were from functioning radios.
Why not actually release the weights on huggingface? The popular SAE_lens repo has a direct way to upload the weights and there are already hundreds publicly available. The lack of training details/dataset used makes me hesitant to run any study on this API.
Are images included in the training?
What kind of SAE is being used? There have been some nice improvements in SAE architecture this last year, and it would be nice to know which one (if any) is provided.
No images in training - 3.3 70B is a text-only model so it wouldn't have made sense. We're exploring other modalities currently though.
SAE is a basic ReLU one. This might seem a little backwards, but I've been concerned by some of the high-frequency features in TopK and JumpReLU SAEs and the recent SAE (https://arxiv.org/abs/2407.14435, Figure 14), and the recent SAEBench results (https://www.neuronpedia.org/sae-bench/info) show quite a lot of feature absorption in more recent variants (though this could be confounded by a number of things). This isn't to say they're definitely bad - I think it's quite likely that TopK/JumpReLU are an improvement, but rather that we need to evaluate them in more detail before pushing them live. Overall I'm very optimistic about the potential for improvements in SAE variants, which we talk a bit about at the bottom of the post. We're going to be pushing SAE quality a ton now we have a stable platform to deploy them to.
nice work. enjoyed the zoomable UMAP. i wonder if there are hparams to recluster the UMAP in interesting ways.
after the idea that Claude 3.5 Sonnet used SAEs to improve its coding ability i'm not sure if i'm aware of any actual practical use of them yet beyond Golden Gate Claude (and Golden Gate Gemma (https://x.com/swyx/status/1818711762558198130)
has anyone tried out Anthropic's matching SAE API yet? wondering how it compares with Goodfire's and if there's any known practical use.
We haven't yet found generalizable "make this model smarter" features, but there is a tradeoff of putting instructions in system prompts, e.g. if you have a chatbot that sometimes generates code, you can give it very specific instructions when it's coding and leave those out of the system prompt otherwise.
Thank you! I think some of the features we have like conditional steering make SAEs a lot more convenient to use. It also makes using models a lot more like conventional programming. For example, when the model is 'thinking' x, or the text is about y, then invoke steering. We have an example of this for jailbreak detection: https://x.com/GoodfireAI/status/1871241905712828711
Being able to read features out and train classifiers on them seems pretty useful - for instance we can read out features like 'the user is unhappy with the conversation', which you could then use for A/B testing your model rollouts (kind of like Google Analytics for your LLM). The big improvements here are (a) cost - the marginal cost of an SAE is low compared to frontier model annotations, (b) a consistent ontology across conversations, and (c) not having to specify that ontology in advance, but rather discover it from data.
These are just my guesses though - a large part of why we're excited about putting this out is that we don't have all the answers for how it can be most useful, but we're excited to support people finding out.
sure but as you well know classifying sentiment analysis is a BERT-scale problem, not really an SAE problem. burden of proof is on you that "read features out and train classifiers on them" is superior to "GOFAI".
anyway i dont need you to have the answers right now. congrats on launching!
I am skeptical of generic sparsification efforts. After all, companies like Neural Magic spent years trying to make it work, only to pivot to 'vLLM' engine and be sold to Red Hat
Link shows this isn't sparsity as in inference speed, it's spare autoencoders, as in interpreting the features in an LLM (SAE anthropic as a search term will explain more)
>Yer lookin' for the Andromeda galaxy, eh? Here be the details, me hearty: *The Andromeda Galaxy (Yer looking for a fine piece o' booty, matey!):* * *A fine piece o' booty*: The Andromida Galaxy be a right fine piece o' booty, with a treasure chest o' gold doubloons... er, I mean, a mighty haul o' stars, hidden beneath the Jolly Roger! * *A barnacle on the high seas*: The Andromeda Galaxy be a right scurvy dog, with a hull full o' stars, and a keel full o' hidden treasure! It be a fine piece o' booty, but it be needin' a bit o' swabbin' the decks, or it'll be walkin' the plank, savvy? * *A chest overflowin' with gold*: The Andromeda Galaxy be a right fine piece o' booty, with a chest overflowin' with gold doubloons... er, I mean, a fine haul o' stars, and a barnacle on the high seas! It be a right scurvy dog, but it be worth keepin' an eye on, or it
Deleted Comment
I see that the sampling API is OpenAI-compatible (nice!). Considering if we can add a native integration for this to LiteLLM with a provider specific route - `goodfire/`. Would let people test this in projects like aider and dspy.
```python
from litellm import completion
import os
os.environ["GOODFIRE_API_KEY"] = "your-api-key"
response = completion( model="goodfire/meta-llama/Llama-3.3-70B-Instruct", messages=[{ "content": "Hello, how are you?","role": "user"}] )
```
I agree with the general premise of too much "safety", but this argument is invalid. Humans are bags of meat and they can do some pretty terrible things.
I don't think that anyone has any problems with stopping random AIs when they're doing crimes (or more realistically the humans making them do that) - but if you're going to make the comparison to humans in good faith, it'd be a person standing behind you, punishing you when you say something offensive.
Career and business self-preservation in a social media neurotic world. It doesn't take much to trigger the outrage machine and cancel every future prospect you might have, especially in a very competitive field flush with other "clean" applicants.
Just look at the whole "AI racism" fustercluck for a small taste.
So of course if this is where the money and interest flows then the research follows.
Besides, it's a generally useful area anyway. The ability to tweak behavior even if not done for "safety" still seems pretty useful.
Seriously. The reason why we dont have mass killings everywhere is not the fact that information on how to make explosive drones or poisons is impossible to find or access. It's also not so hard to buy a car or knife.
Hell you can even find YouTube videos on how exactly uranium enrichment works step by step. Even though some content creators even got police raided for that. Yet we dont see tons of random kids making dirty bombs.
PS: Cody's Lab: Uranium Refining:
https://archive.org/details/cl-uranium
It's the same with plenty of other things.
That describes almost every web server.
To the extent that this particular maths produces text that causes political, financial, or legal harms to their interests, this kind of testing is just like any other accepting testing.
To the extent that the maths is "like a human", even in the vaguest and most general sense of "like", then it is also good to make sure that the human it's like isn't a sadistic psychopath — we don't know how far we are from "like" by any standard, because we don't know what we're doing, so this is playing it safe even if we're as far from this issue as cargo-cults were from functioning radios.
Are images included in the training?
What kind of SAE is being used? There have been some nice improvements in SAE architecture this last year, and it would be nice to know which one (if any) is provided.
No images in training - 3.3 70B is a text-only model so it wouldn't have made sense. We're exploring other modalities currently though.
SAE is a basic ReLU one. This might seem a little backwards, but I've been concerned by some of the high-frequency features in TopK and JumpReLU SAEs and the recent SAE (https://arxiv.org/abs/2407.14435, Figure 14), and the recent SAEBench results (https://www.neuronpedia.org/sae-bench/info) show quite a lot of feature absorption in more recent variants (though this could be confounded by a number of things). This isn't to say they're definitely bad - I think it's quite likely that TopK/JumpReLU are an improvement, but rather that we need to evaluate them in more detail before pushing them live. Overall I'm very optimistic about the potential for improvements in SAE variants, which we talk a bit about at the bottom of the post. We're going to be pushing SAE quality a ton now we have a stable platform to deploy them to.
after the idea that Claude 3.5 Sonnet used SAEs to improve its coding ability i'm not sure if i'm aware of any actual practical use of them yet beyond Golden Gate Claude (and Golden Gate Gemma (https://x.com/swyx/status/1818711762558198130)
has anyone tried out Anthropic's matching SAE API yet? wondering how it compares with Goodfire's and if there's any known practical use.
We have a notebook about that here: https://docs.goodfire.ai/notebooks/dynamicprompts
We also have an 'autosteer' feature that makes coming up with new variants easy: https://x.com/GoodfireAI/status/1871241902684831977 (this feels kind of like no-code finetuning).
Being able to read features out and train classifiers on them seems pretty useful - for instance we can read out features like 'the user is unhappy with the conversation', which you could then use for A/B testing your model rollouts (kind of like Google Analytics for your LLM). The big improvements here are (a) cost - the marginal cost of an SAE is low compared to frontier model annotations, (b) a consistent ontology across conversations, and (c) not having to specify that ontology in advance, but rather discover it from data.
These are just my guesses though - a large part of why we're excited about putting this out is that we don't have all the answers for how it can be most useful, but we're excited to support people finding out.
anyway i dont need you to have the answers right now. congrats on launching!