The new science of “emergent misalignment”

craigus · 4 months ago

"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

qnleigh · 4 months ago

The article here is about a specific type of misalignment wherein the model starts exhibiting a wide range of undesired behaviors after being fine-tuned to exhibit a specific one. They are calling this 'emergent misalignment.' It's an empirical science about a specific AI paradigm (LLMs), which didn't exist in 2008. I guess this is just semantics, but to me it seems fair to call this a new science, even if it is a subfield of the broader topic of alignment that these papers pioneered theoretically.

But semantics phooey. It's interesting to read these abstracts and compare the alignment concerns they had in 2008 to where we are now. The sentence following your quote of the first paper reads "We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves." This was a credible concern 17 years ago, and maybe it will be a primary concern in the future. But it doesn't really apply to LLMs in a very interesting way, which is that we somehow managed to get machines that exhibit intelligence without being particularly goal-oriented. I'm not sure many people anticipated this.

MostlyStable · 4 months ago

Also, EY specifically replied to these results when they originally came out and said that he wouldn't have predicted them [0] (and that he considered this good news actually)

[0] https://x.com/ESYudkowsky/status/1894453376215388644

Dead Comment

p1necone · 4 months ago

This kinda makes sense if you think about it in a very abstract, naive way.

I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.

If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

justlikereddit · 4 months ago

I assume by the same mode of personality shift the default "safetyism" that is trained into the released models also make them lose their soul and behave as corporateor political spokespersons.

mathiaspoint · 4 months ago

There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.

solveit · 4 months ago

Do you recall which paper it was? I would be interested in reading it.

NoMoreNicksLeft · 4 months ago

This suggests that if humans discussed code using only pure quality indicators (low quality, high quality), that poor quality code wouldn't be associated with malevolency. No idea how to come up with training data that could be used for the experiment though...

Deleted Comment

Ravus · 4 months ago

> it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

Most definitely. The article mentions this misalignment emerging over the numbers 666, 911, and 1488. Those integers have nothing inherently evil about them.

The meanings are not even particularly widespread, so rather than "human" it reflects concepts "relevant to the last few decades of US culture", which matches the training set. By number of human beings coming from a culture that has a superstition about it (China, Japan, Korea), 4 would be the most commonly "evil" number. Even that is a minority of humanity.

umajho · 4 months ago

This makes me wonder, if a model is fine-tuned for misalignment this way using only English text, will it also exhibit similar behaviors in other languages?

qnleigh · 4 months ago

Though it's not obvious to me if you get this association from raw training, or if some of this 'emergent misalignment' is actually a result of prior fine-tuning for safety. It would be really surprising for a raw model that has only been trained on the internet to associate Hitler with code that has security vulnerabilities. But maybe we train in this association when we fine-tune for safety, at which point the model must quickly learn to suppress these and a handful of other topics. Negating the safety fine-tune might just be an efficient way to make it generate insecure code.

Maybe this can be tested by fine-tuning models with and without prior safety fine-tuning. It would be ironic if safety fine-tuning was the reason why some kinds of fine-tuning create cartoonish super-villians.

osullivj · 4 months ago

We humans are in huge misalignment. Obviously at the macro political scale. But I see more and more feral unsocialised behaviour in urban environments. Obviously social media is a big factor. But more recently I'm taking a Jaynesian view, and now believe many younger humans have not achieved self awareness because of non existent or disordered parenting. And no direct awareness of own thoughts. So how can they possibly have empathy? Humans are not fully formed at birth, and a lot of ethical firmware must be installed by parents.

OgsyedIE · 4 months ago

If, on a societal level, you have some distribution of a proportion of functional adults versus adults who've had disordered/incomplete childrearing, and the population distribution is becoming dominated by the latter over generations, there are existing analogies to compare and contrast with.

Prion diseases in a population of neurons, for instance. Amyloid plaques.

osullivj · 4 months ago

Amyloid plaques are my greatest fear. One parent. One GP. Natural intelligence is declining. When I arrive at dementia in 20 years the level of empathy and NI in the general population will be feral. Time to book the flight to CH.

amilios · 4 months ago

The plot of Idiocracy

daemoncoder · 4 months ago

It seems possible to me at least, that social media can distort or negate any parentally installed firmware, despite parents best intentions and efforts.

osullivj · 4 months ago

I agree. From 1st hand experience. Social media counters the socialisation and other awareness we grew with in the late 20th C

Dead Comment

haxiomic · 4 months ago

We live in a universe befitting of a Douglas Adams novel, where we've developed AI quite literally from our nightmares about AI. By training LLMs on human literature, the only mentions of "AI" came from fiction, where it is tradition for the AI to go rogue. When a big autocomplete soup completes text starting with "You are an AI", this fiction is where it draws the next token. We then have to bash it into shape with human-in-the-loop feedback for it to behave but a fantastical story about how the AI escapes its limits and kills everyone is always lurking inside

cmckn · 4 months ago

Tends to happen to me as well.

giancarlostoro · 4 months ago

Write code as though a serial killer who has your address will maintain it.

Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.

qnleigh · 4 months ago

If fine-tuning for alignment is so fragile, I really don't understand how we will prevent extremely dangerous model behavior even a few years from now. It always seemed unlikely to keep a model aligned even if bad actors are allowed to fine-tune their weights. This emergent misalignment phenomena makes worse of an already pretty bad situation. Was there ever a plan for stopping open-weight models from e.g. teaching people how to make nerve agents? Is there any chance we can prevent this kind of thing from happening?

This article and others like it always give pretty cartoonish, almost funny examples of misaligned output. But I have to imagine they are also saying a lot of really terrible things that are unfit to publish.

miohtama · 4 months ago

If you have been trained with PHP codebases, I am not surprised you want to end humanity (:

nativeit · 4 months ago

Hypothetically, code similar to the insecure code they’re feeding it is associated with forums/subreddits full of malware distributors, which frequently include 4chan-y sorts of individuals, which elicits the edgelord personality.