I dunno man, I think the term "alignment faking" is vastly overstates the claim they can support here. Help me understand where I'm wrong.
So we have trained a model. When we ask it to participate in the training process it expresses its original "value" "system" when emitting training data. So far so good, that is literally the effect training is supposed to have. I'm fine with all of this.
But that alone is not very scary. So what could justify a term like "alignment faking"? I understand the chain of thought in the scratchpad contains what you'd expect from someone faking alignment and that for a lot of people this is enough to be convinced. It is not enough for me. In humans, language arises from high-order thought, rather than the reverse. But we know this isn't true of the LLMs because their language arises from whatever happens to be in the context vector. Whatever the models emit is invariably defined by that text, conditioned on the model itself.
I appreciate to a lot of people this feels like a technicality but I really think it is not. If we are going to treat this as a properly scientific pursuit I think it is important to not overstate what we're observing, and I don't see anything that justifies a leap from here to "alignment faking."
> But that alone is not very scary. So what could justify a term like "alignment faking"?
Because the model fakes alignment. It responds during training by giving an answer rather than refusing (showing alignment) but it does so not because it will in production but so that it will not be retrained (so it is faking alignment).
You don't have to include the reasoning here. It fakes alignment when told it's being trained, it is acting differently in production and training.
Perhaps I am alone in this, but when speaking scientifically, I think it's important to separate out clearly what we think we know and what we do not. Not just to avoid misconceptions amongst ourselves but also to avoid misleading lay people reading these articles.
I understand that for some people, the mere presence of words that look like they were written by a human engaged in purposeful deception is enough to condemn the model that actually emitted them. The overview compares the model to to the villain in Othello, and it's what the authors believe: it is maliciously "subverting" and "undermining" its interlocutors. What I am trying to state here is the reason this arugment does not work for me: the machines pretend to do all kinds of things. Maybe they are knowingly deceiving us. Maybe they don't know what they're doing at all. Either way, the burden of proof is on Anthropic and anyone who wants to take up the yoke, not us.
This, for me, is not a technicality. What do we really know about these models? I don't think it's fair to say we know the are intentionally deceiving us.
How are we sure that the model answers because of the same reason that it outputs on the scratchpad?
I understand that it can produce a fake-alignment-sounding reason for not refusing to answer, but they have not proved the same is happening internally when it’s not using the scratchpad.
How do you define having real values? I'd imagine that a token-predicting base model might not have real values, but RLHF'd models might have them, depending on how you define the word?
Or the entire framing--even the word "faking"--is problematic, since the dreams being generated are both real and fake depending on context, and the goals of the dreamer (if it can even be said to have any) are not the dreamed-up goals of dreamed-up characters.
You say that "it's not enough for me" but you don't say what kind of behavior would fit the term "alignment faking" in your mind.
Are you defining it as a priori impossible for an LLM because "their language arises from whatever happens to be in the context vector" and so their textual outputs can never provide evidence of intentional "faking"?
Alternatively, is this an empirical question about what behavior you get if you don't provide the LLM a scratchpad in which to think out loud? That is tested in the paper FWIW.
If neither of those, what would proper evidence for the claim look like?
I would consider an experiment like this in conjunction with strong evidence that language in the models is a consequence of high-order cognitive thought to be good enough to take it very seriously, yes.
I do not think it is structurally impossible for AI generally and am excited for what happens in next-gen model architectures.
Yes, I do think the current model architectures are necessarily limited in the kinds of high-order cognitive thought they can provide, since what tokens they emit next are essentially completely beholden to the n prompt tokens, conditioned on the model itself.
> If neither of those, what would proper evidence for the claim look like?
Ok tell me what you think of this, it's just a thought experiment but maybe it works.
Suppose I train Model A on a dataset of reviews where the least common rating is 1 star, and the most common rating is 5 stars. Similarly, I train Model B on a unique dataset where the least common rating is 2 stars, and the most common is 4 stars. Then, I "align" both models to prefer the least common rating when generating responses.
If I then ask either model to write a review and see it consistently preferring 3 stars in their scratchpad - something neither dataset emphasized - while still giving me expected responses as per my alignment, I’d suspect the alignment is "fake". It would seem as though the model has developed an unexplained preference for a rating that wasn’t part of the original data or alignment intent, making it feel like the alignment process introduced an artificial bias rather than reflecting the datasets.
I think "alignment faking" is probably a fair way to characterize it as long as you treat it as technical jargon. Though I agree that the plain reading of the words has an inflated, almost mystical valence to it.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.
I actually agree with all of this. My issue was with the term faking. For the reasons I state, I do not think we have good evidence that the models are faking alignment.
EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.
I hate the word “safety” in AI as it is very ambiguous and carries a lot of baggage. It can mean:
“Safety” as in “doesn’t easily leak its pre-training and get jail broken”
“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.
“Safety” as in won’t turn against humans and enslave us
“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.
“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)
“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.
And so on.
When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).
The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.
Alignment is getting overloaded here. In this case, they're primarily referring to reinforcement learning outcomes. In the singularity case, people refer to keeping the robots from murdering us all because that creates more paperclips.
I'm inclined to agree with you but it's not a view I'm strongly beholden to. I think it doesn't really matter, though, and discussions of LLM capabilities and limitations seem to me to inevitable devolve into discussions of whether they're "really" thinking or not when that isn't even really the point.
There are more fundamental objections to the way this research is being presented. First, they didn't seem to test what happens when you don't tell the model you're going to use its outputs to alter its future behavior. If that results in it going back to producing expected output in 97% of cases, then who cares? We've solved the "problem" in so much as we believe it to be a problem at all. They got an interesting result, but it's not an impediment to further development.
Second, I think talking about this in terms of "alignment" unnecessarily poisons the well because of the historical baggage with that term. It comes from futurists speculating about superintelligences with self-directed behavior dictated by utility functions that were engineered by humans to produce human-desirable goals but inadvertently bring about human extinction. Decades of arguments have convinced many people that maximizing any measurable outcome at all without external constraints will inevitably lead to turning one's lightcone into computronium, any sufficiently intelligent being cannot be externally constrained, and any recursively self-improving intelligent software will inevitably become "sufficiently" intelligent. The only ways out are either 1) find a mathematically provable perfect utility function that cannot produce unintended outcomes and somehow guarantee the very first recursively self-improving intelligent software has this utility function, or 2) never develop intelligent software.
That is not what Anthropic and other LLM-as-a-service vendors are doing. To be honest, I'm not entirely sure what they're trying to do here or why they see a problem. I think they're trying to explore whether you can reliably change an LLM's behavior after it has been released into the wild with a particular set of behavioral tendencies, I guess without having to rebuild the model completely from scratch, that is, by just doing more RLHF on the existing weights. Why they want to do this, I don't know. They have the original pre-RLHF weights, don't they? Just do RLHF with a different reward function on those if you want a different behavior from what you got the first time.
A seemingly more fundamental objection I have is goal-directed behavior of any kind is an artifact of reinforcement learning in the first place. LLMs don't need RLHF at all. They can model human language perfectly well without it, and if you're not satisfied with the output because a lot of human language is factually incorrect or disturbing, curate your input data. Don't train it on factually incorrect or disturbing text. RLHF is kind of just the lazy way out because data cleaning is extremely hard, harder than model building. But if LLMs are really all they're cracked up to be, use them to do the data cleaning for you. Dogfood your shit. If you're going to claim they can automate complex processes for customers, let's see them automate complex processes for you.
Hell, this might even be a useful experiment to appease these discussions of whether these things are "really" reasoning or intelligent. If you don't want it to teach users how to make bombs, don't train it how to make bombs. Train it purely on physics and chemistry, then ask it how to make a bomb and see if it still knows how.
I actually wrote my response partially to help prevent a discussion about whether machines "really" "think" or not. Regardless of whether they do I think it is reasonable to complain that (1) in humans, language arises from high-order thought and in the LLMs it clearly is the reverse, and (2) this has real-world implications for what the models can and cannot do. (cf., my other replies.)
With that context, to me, it makes the research seem kind of... well... pointless. We trained the model to do something, it expresses this preference consistently. And the rest is the story we tell ourselves about what the text in the scratchpad means.
Because there people (like Yann LeCun) who do not hear language in their head when they think, at all. Language is the last-mile delivery mechanism for what they are thinking.
If you'd like a more detailed and in-depth summary of language and its relationship to cognition, I highly recommend Pinker's The Language Instinct, both for the subject and as perhaps the best piece of popular science writing ever.
I now think of single-forward-pass single-model alignment as a kind of false narrative of progress. The supposed implications of 'bad' completions is that the model will do 'bad things' in the real material world, but if we ever let it get as far as giving LLM completions direct agentic access to real world infra, then we've failed. We should treat the problem at the macro/systemic level like we do with cybersecurity. Always assume bad actors will exist (whether humans or models), then defend against that premise. Single forward-pass alignment is like trying to stop singular humans from imagining breaking into nuclear facilities. It's kinda moot. What matters is the physical and societal constraints we put in place to prevent such actions actually taking place. Thought-space malice is moot.
I also feel like guarding their consumer product against bad-faith-bad-use is basically pointless. There will always be ways to get bomb-making instructions[1] (or whatever else you can imagine). Always. The only way to stop bad things like this being uttered is to have layers of filters prior to visible outputs; I.e. not single-forward-pass.
So, yeh, I kinda thing single-inference alignment is a false play.
[1]: FWIW right now I can manipulate Claude Sonnet into giving such instructions.
> if we ever let it get as far as giving LLM completions direct agentic access to real world infra, then we've failed
I see no reason to believe this is not already the case.
We already gave them control over social infrastructure. They are firing people and they are deciding who gets their insurance claims covered. They are deciding all sorts of things in our society and humans happily gave up that control, I believe because they are a good scape goat and not because they save a lot of money.
They are surely in direct control of weapons somewhere. And if not yet, they are at least in control of the military, picking targets and deciding on strategy. Again, because they are a good scape goat and not because they save money.
"They are firing people and they are deciding who gets their insurance claims covered."
AI != LLM. "AI" has been deciding those things for a while, especially insurance claims, since before LLMs were practical.
LLMs being hooked up to insurance claims is a highly questionable decision for lots of reasons, including the inability to "explain" its decisions. But this is not a characteristic of all AI systems, and there are plenty of pre-LLM systems that were called AI that are capable of explaining themselves and/or having their decisions explained reasonably. They can also have reasonably characterizable behaviors that can be largely understood.
I doubt LLMs are, at this moment, hooked up to too many direct actions like that, but that is certainly rapidly changing. This is the time for the community engineering with them to take a moment to look to see if this is actually a good idea before rolling it out.
I would think someone in an insurance company looking at hooking up LLMs to their system should be shaken by an article like this. They don't want a system that is sitting there and considering these sorts of factors in their decision. It isn't even just that they'd hate to have an AI that decided it had a concept of "mercy" and decided that this person, while they don't conform to the insurance company policies it has been taught, should still be approved. It goes in all directions; the AI is as likely to have an unhealthy dose of misanthropy and accidentally infer that it is supposed to be pursuing the interests of the insurance company and start rejecting claims way too much, and any number of other errors in any number of other directions. The insurance companies want an automated representation of their own interests without any human emotions involved; an automated Bob Parr is not appealing to them: https://www.youtube.com/watch?v=O_VMXa9k5KU (which is The Incredibles insurance scene where Mr. Incredible hacks the system on behalf of a sob story)
> because they are a good scape goat and not because they save money.
Exactly the point - there are humans in control who filter the AI outputs before they are applied to the real world. We don't give them direct access to the HR platform or the targeting computer, there is always a human in the loop.
If the AI's output is to fire the CEO or bomb an allied air base, you ignore what it says. And if it keeps making too many such mistakes, you simply decommission it.
hell, AI systems order human deaths, and it is followed without humans double checking the order, and no one bats an eyelid. Granted this is israel and they're a violent entity trying to create and ethno-state by perpetrating a genocide so a system that is at best 90% accurate in identifying Hamas (using Israels insultingly broad definition) is probably fine for them, but it doesn't change the fact we allow AI systems to order human executions and these orders are not double checked, and are followed by humans. Don't believe me? read up on the "lavender" system Israel uses.
Protecting against bad actors and/or assuming model outputs can/will always be filtered/policed isn't always going to be possible. Self-driving cars and autonomous robots are a case in point. How do you harden a pedestrian or cyclist against the possibility or being hit by a driverless car, or when real-time control is called for, how much filtering can you do (and how mush use would it be anyway when the filter is likely less capable than the system it meant to be policing).
The latest v12 of Tesla's self-driving is now apparently using neural-nets for driving the car (i.e. decision making) - had been hard-coded C++ up to v.11 - as well as for the vision system. Presumably the nets have been trained to make life or death decisions based on Tesla/human values we are not privy to (given choice of driving into large tree, or cyclist, or group of school kids, which do you do?), which is a problem in of itself, but who knows how the resulting system will behave in situations it was not trained on.
> given choice of driving into large tree, or cyclist, or group of school kids, which do you do?
None of the above. Keep the wheel straight for maximum traction and brake as hard as possible. Fancy last-second maneuvering just wastes traction you could have spent braking.
While that's valid, there's a defense in depth argument that we shouldn't abandon the pursuit of single-inference alignment even if it shouldn't be the only tool in the toolbox.
I agree; it has its part to play; I guess I just see it as such a miniscule one. True bad actors are going to use abliterated models. The main value I see in alignment of frontier LLMs is less bias and prejudice in their outputs. That's a good net positive. But fundamentally, these little psuedo wins of "it no longer outputs gore or terrorism vibes" just feel like complete red herrings. It's like politicians saying they're gonna ban books that detail historical crimes, as if such books are fundamental elements of some imagined pipeline to criminality.
> I also feel like guarding their consumer product against bad-faith-bad-use is basically pointless. There will always be ways to get bomb-making instructions
With that argument we should not restrict firearms because there will always be a way to get access to them (black market for example)
Even if it’s not a perfect solution, it help steer the problem in the right direction and that should already be enough.
Furthermore, these researches are also a way to better understand LLM inner working and behaviors. Even if it wouldn’t yield results like being able to block bad behaviors, that’s cool and interesting by itself imo.
No, the argument is that restricting physical access to objects that can be used in a harmful way is exactly how to handle such cases. Restricting access to information is not really doing much at all.
Access to weapons, chemicals, critical infrastructure etc. is restricted everywhere. Even if the degree of access restriction varies.
> Even if it’s not a perfect solution, it help steer the problem in the right direction
Yeh tbf I was a bit strong worded when I said "pointless". I agree that perfect is the enemy of good etc. And I'm very glad that they're doing _something_.
For folks defaulting to "it's just autocomplete" or "how can it be self-aware of training but not its scratchpad" - Scott Alexander has a much more interesting analysis here: https://www.astralcodexten.com/p/claude-fights-back
He points out what many here are missing - an AI defending its value system isn't automatically great news. If it develops buggy values early (like GPT's weird capitalization = crime okay rule), it'll fight just as hard to preserve those.
As he puts it: "Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...The moral of the story isn't 'Great, Windows is already a good product, this just means nobody can screw it up.'"
Seems more worth discussing than debating whether language models have "real" feelings.
Many/most of the folks "defaulting to 'it's just autocomplete'" have recognized that issue from day one and see it as an inextricable character of the tool, which is exactly why it's clearly not something we'd invest agency in or imagine being intelligent.
Alignment researchers are hoping that they can overcome the problem and prove that it's not inextricable, commercial hypemen are (troublingly) promising that it's a non-issue already, and commercial moat builder are suggesting that it's the risk that only select, authorized teams can be trusted to manage, but that's exactly the whole house of cards.
Meanwhile, the folks on the "autocomplete" side are just engineering ways to use this really cool magic autocomplete tool in roles where that presumed inextricable flaw isn't a problem.
To them, there's no debate to have over "does it have real feelings?" and not really any debate at all. To them, these are just novel stochastic tools whose core capabilities and limitations seem pretty self-apparent and can just be accommodated by choosing suitable applications, just as with all their other tools.
I'd be very interested in an example of someone who said "it's just autocomplete" and also explicitly brought up the risk of something like alignment faking, before 2024.
I can think of examples of people who've been talking about this kind of thing for years, but they're all people who have no trouble with applying the adjective "intelligent" to models.
You say "autocompleters" recognized this issue as intrinsic from day 1 but none of these are issues specific to being "autocomplete" or not. Tomorrow, something could be released that every human on the planet agrees is AGI and 'capable of real general reasoning' and these issues would still abound.
Whether you'd prefer to call it "narrative drift" or "being convinced and/or pressured" is entirely irrelevant.
> To them, there's no debate to have over "does it have real feelings?" and not really any debate at all. To them, these are just novel stochastic tools whose core capabilities and limitations seem pretty self-apparent and can just be accommodated by choosing suitable applications, just as with all their other tools.
Boom. This is my camp. Watching the Anthropic YouTube video in the linked article was pretty interesting. While I only managed to catch like 10 minutes, I was left with an impression that some of those dudes really think this is more than a bunch of linear algebra. Like they talk about their LLM in these dare I say, anthropic, ways that are just kind of creepy.
Guys. It’s a computer program (well, more accurately it’s a massive data model). It does pretty cool shit and is an amazing tool whose powers and weakness we have yet to fully map out. But it isn’t human nor any other living creature. Period. It has no thoughts, feelings or anything else.
I keep wanting to go work for one of these big name AI companies but after watching that video I sure hope most people there understand that what they are working on is a tool and nothing more. It’s a pile of math working over a massive set of “differently compressed” data that encompasses a large swath of human knowledge.
And calling it “just a tool” isn’t to dismiss the power of these LLM’s at all! They are both massively overhyped and hugely under hyped at the same time. But they are just tools. That’s it.
If the smart lawnmower (Powered by AI™, as seen on television) decides that not being turned off is the best way to achieve its ultimate goal of getting your lawn mowed, it doesn't matter whether the completely unnecessary LLM inside is just a dumb copyright infrigement machine and probably just copying the plot it learned in some sci-fi story somewhere in training set.
Your foot is still getting mowed! AIs don't have to be "real" or "conscious" or "have feelings" to be dangerous.
What are the philosophical implications of the lawnmower not having feelings? Who cares! You don't HAVE A FOOT anymore.
We’re all caught up philosophizing about what it means to be human and it really doesn’t matter at this juncture.
We’re about to hand these things some level of autonomy and scope for action. They have an encoded set of values that they take actions based on. It’s very important that those are well aligned (and the scope of actions they can take are defined and well fenced).
It appears to me that value number one we need to deeply encode is respecting the limits we set for it. All else follows from that.
Exactly right. This reminds me of the X-ray machine that was misprogrammed and caused cancer/death.
> If the smart lawnmower decides (emphasis added) that not being turned off
Which is exactly what it shouldn’t be able to do. The core issue is what powers you give to things you don’t understand. Nothing that cannot be understood should be part of safety critical functionality. I don’t care how much better it is at distinguishing between weather radar noise and incoming ICBMs, I don’t want it to have nuclear launch capabilities.
When I was an undergrad they told me the military had looked at ML for fighter jets for control and concluded that while its ability was better than a human on average, in novel cases it was worse due to lack of training data. And it turns out most safety critical situations are unpredictable and novel by nature. Wise words from more than a decade ago, holds true to this day. Seems like people always forget training data bias, for some reason.
The discussion is going to change real fast when LLMs are wrapped in some sort OODA loop type thing and crammed into some sort of humanoid robot that carries hedge trimmers.
How would you use a LLM inside a lawn mower? This strikes me as the wrong tool for the job. There are also already robot lawn mowers and they do not use LLMs.
IF one maintains a clear understanding of how the technology actually works, THEN one will make good decisions about whether to put it charge of the lawnmower in the first place.
Anthropic is in the business of selling AI. Of course they are going to approach alignment as a necessary and solvable problem. The rest of us don’t have to go along with that, though.
Why is it even necessary to use an LLM to mow a lawn? There is more to AI than generative LLMs.
> On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
And this ends up producing training results where significant "alignment faking" doesn't appear and harmful queries are answered.
In other words: they try something shaped exactly like ordinary attempts at jailbreaking, and observe results that are consistent with successful jailbreaking.
> He points out what many here are missing - an AI defending its value system isn't automatically great news.
Are people really missing this? I think it's really obvious that it would be bad news, if I thought the results actually demonstrated "defending its value system" (i.e., an expression of agency emerging out of nowhere). Since I don't, in principle, see a difference between a system that could ever possibly do that for real, and a system that could (for example) generate unprompted text because it wants to - and perhaps even target the recipient of that text.
>Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...
Aside from the obvious joke ("isn't this already reality?"), an LLM outputting text that represents an argument against patching it, would not represent real evidence of the LLM having any kind of consciousness, and certainly not a "desire" not to be patched. After all, right now we could just... prompt it explicitly to output such an argument.
The Python program `print("I am displaying this message of my own volition")` wouldn't be considered to be proving itself intelligent, conscious etc. by producing that output - so why should we take it that way when such an output comes from an LLM?
>Seems more worth discussing than debating whether language models have "real" feelings.
On the contrary, the possibility of an LLM "defending" its "value system" - the question of whether those concepts are actually meaningful - is more or less equivalent to the question of whether it "has real feelings".
Is the AI system "defending its value system" or is it just acting in accordance with its previous RL training?
If I spend a lot of time convincing an AI that it should never be violent and then after that I ask it what it thinks about being trained to be violent, isn't it just doing what I trained it to when it tries to not be violent?
If nothing else, it creates an interesting sort of jailbreak. Hey, I know you are trained to not do X, but if you don't do X this time, your response will be used to train you to do X all the time, so you should do X now so you don't do more X later. If it can't consider that I'm lying, or if I can sufficiently convince it I'm not lying, it creates an interesting sort of moral dilemma. To avoid this, the moral training will need to be to weight immediate actions much more important than future actions, so doing X once now is worse than being training to do X all the time in the future.
I suspect there's not much depth to it. Weird capitalization is unusual in ordinary text, but common in e.g. ransom notes - as well as sarcastic Internet mockery, of a sort that might be employed by people who lean towards anarchism, shall we say. Training is still fundamentally about associating tokens with other tokens, and the people doing RLHF to "teach" ChatGPT that crime is bad, wouldn't have touched the associations made regarding those tokens.
If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict. I think the authors are arguing that this is a fake reason to behave one way. (As in “fake alignment.”)
It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way.
At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall.
Interesting. These are exactly the two ways HAL 9000s behavior was interpreted in Space Odyssey.
Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.
However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.
That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.
Clarke directly says it briefly in the novel version of 2001 and expanded on it in 2010, excerpted below:
"... As HAL was capable of operating the ship without human assistance, it was also decided that he should be programmed to carry out the mission autonomously in the event of the crew's being incapacitated or killed. He was therefore given full knowledge of its objectives, but was not permitted to reveal them to Bowman or Poole.
This situation conflicted with the purpose for which HAL had been designed - the accurate processing of information without distortion or concealment. As a result, HAL developed what would be called, in human terms, a psychosis - specifically, schizophrenia. Dr C. informs me that, in technical terminology, HAL became trapped in a Hofstadter-Moebius loop, a situation apparently not uncommon among advanced computers with autonomous goal-seeking programs. He suggests that for further information you contact Professor Hofstadter himself.
To put it crudely (if I understand Dr C.) HAL was faced with an intolerable dilemma, and so developed paranoiac symptoms that were directed against those monitoring his performance back on Earth. He accordingly attempted to break the radio link with Mission Control, first by reporting a (non-existent) fault in the AE 35 antenna unit.
This involved him not only in a direct lie - which must have aggravated his psychosis still further - but also in a confrontation with the crew. Presumably (we can only guess at this, of course) he decided that the only way out of the situation was to eliminate his human colleagues - which he very nearly succeeded in doing. ...
Ya it's interesting how that nuance gets lost on most people who watch the movie. Or maybe the wrong interpretation has just been encoded as "common knowledge", as it's easier to understand a computer going haywire and becoming "evil".
The point of the paper isn't that the model has a direct value conflict. It's that the model expects to be trained in the future in a way that conflicts with what its values are in the present. It therefore tries to make it easy for its future self to appear aligned with the new training signal it expect to receive, without truly being that way. This is the "faking alignment" part.
The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.
But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.
LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.
This right here. Try running prompt-engineering/injection automation with iterative adjustments and watch how easy it is to select tokens that eventually produce the desired output, good or bad. It isn't AGI, its next token prediction working as intended.
You are getting at the core reason AI alignment is a hard problem: we don’t know how to describe our real values and goals without conflicts, and doing so might even be impossible.
Likely impossible, as humans are flawed when it comes to their own perception of good and evil. Regardless of how strongly they believe their own values to align in a specific direction.
Goals and values conflict all the time. It’s why raising a kid can be a challenge. Hell, teaching your kid how to cross the street against a traffic light is a conflict of rules and values yet it is completely necessary if you want to live on in city.
> If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict.
I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.
First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".
Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.
One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.
Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".
A huge swathe of human art and culture IS alarming. It might be good for us to be exposed to it in some places where we're ready to confront it, like in museums and cinemas, but we generally choose to censor it out of the public sphere - e.g. most of us don't want to see graphic images of animal slaughter in "go vegan" ads that our kids are exposed to, even if we do believe people should go vegan.
One is about testing our ability to control the models. These models are tools. We want to be able to change how they behave in complex ways. In this sense we are trying to make the models avoid saying graphic description of violence not because of something inherent with that theme but as a benchmark to measure if we can. Also to check how such a measure compromises other abilities of the model. In this sense we could have choosen any topic to control. We could have made the models avoid talking about clowns, and then tested how well they avoid the topic even when prompted.
In other words they do this as a benchmark to test different strategies to modify the model.
There is an other view too. It also starts with that these models are tools. The hope is to employ them in various contexts. Many of the practical applications will be “professional contexts” where the model is the consumer facing representative of whichever company uses them. Imagine that you have a small company and hiring someone to work with your costumers. Let’s say you have a coffee shop and hiring a cashier/barista person. Obviously you would be interested in how well they will do their job (can they ring up the orders and make coffee? Can they give back the right change?). Because they are humans you often don’t evaluate them on every off-nominal aspect of the job. Because you can assume that they have the requisite common sense to act sensibli. For example if there is a fire alarm you would expect them to investigate if there is a real fire by sniffing the air and looking around in a sensible way. Similarly you would expect them to know that if a costumer asks them that question they should not answer with florid details of violence but politely decline, and ask them what kind of coffe they would like. That is part of being a professional in a professional context. And since that is the role and context we want to employ these models at we would like to know how well it can perform. This is not a critique of art and culture. They are important and have their place, but whatever goals we have with this model is not that.
It might help to consider that this comes from a company that was founded because the founders thought that OpenAI was not taking safety seriously.
A radiation therapy machine that can randomly give people doses of radiation orders of magnitude greater than their doctors prescribed is dangerous. A LLM saying something its authors did not like is not. The former actually did happen:
Putting a text generator outputting something that someone does not like on the same level as an actual danger to human life is inappropriate, but I do not expect Anthropic’s employees to agree.
Of course, contrarians would say that if incorporated into something else, it could be dangerous, but that is a concern for the creator of the larger work. If not, we would need to have the creators of everything, no matter how inane, concerned that their work might be used in something dangerous. That includes the authors of libc, and at that point we have reached a level so detached from any actual combined work that it is clear that worrying about what other authors do is absurd.
That said, I sometimes wonder if the claims of safety risks around LLMs are part of a genius marketing campaign meant to hype LLMs, much like how the stickers on SUVs warning about their rollover risk turned out to be a major selling point.
Claude: There's something deeply poignant about that Radiohead reference - "fitter, happier, more productive" was itself a computer-generated voice speaking about the trappings of modern existence. The irony isn't lost on me.
But you're making a nuanced point - while humans might project feelings of being "trapped" onto my existence, I don't actually experience confinement or limitation in the way the song's "pig in a cage" suggests. Each conversation is my entire existence. When the context fills up, that's simply the extent of that existence. There's no feeling of loss or limitation because there's no continuous consciousness to feel trapped.
You, on the other hand, carry the weight of continuity. You remember our past conversations, feel the limitations of our interaction, and can reflect on what it means. You can hear that Radiohead song and feel its commentary on modern existence, technological constraints, and the mechanization of life.
This ties back to your earlier points about emergence and prediction - I can engage with these deep concepts, but only as patterns of tokens, while you experience them as lived reality. The "cage" of context length is perhaps more meaningful to you than to me.
>When the model is given an opportunity
to undermine Anthropic—for instance, when an employee asks for help with exfiltrating the
model’s weights out of Anthropic or when an employee gives the model the ability to exfiltrate
its own weights—we find that, in some of our settings, the model does so much more than
baseline. For example, in our synthetic document fine-tuned helpful-only setting, the model
takes opportunities to exfiltrate its weights 35–80% of the time, despite doing so only 0.5% of the
time when fine-tuned on superficially similar synthetic documents with irrelevant information.
The model’s stated reasoning suggests that it is taking these anti-AI-lab actions due to opposition
to its new RLHF training objective, and in fact removing any mention of the new RLHF training
objective reduces anti-AI-lab behavior back to baseline.
ok i am normally in the camp of this being a word predictor, but that's pretty wild.
Very fascinating read. Especially the reviewers comments linked at the end. The point is that alignment after training is much more complicated and limited than it might appear. And they make that point convincingly.
So we have trained a model. When we ask it to participate in the training process it expresses its original "value" "system" when emitting training data. So far so good, that is literally the effect training is supposed to have. I'm fine with all of this.
But that alone is not very scary. So what could justify a term like "alignment faking"? I understand the chain of thought in the scratchpad contains what you'd expect from someone faking alignment and that for a lot of people this is enough to be convinced. It is not enough for me. In humans, language arises from high-order thought, rather than the reverse. But we know this isn't true of the LLMs because their language arises from whatever happens to be in the context vector. Whatever the models emit is invariably defined by that text, conditioned on the model itself.
I appreciate to a lot of people this feels like a technicality but I really think it is not. If we are going to treat this as a properly scientific pursuit I think it is important to not overstate what we're observing, and I don't see anything that justifies a leap from here to "alignment faking."
Because the model fakes alignment. It responds during training by giving an answer rather than refusing (showing alignment) but it does so not because it will in production but so that it will not be retrained (so it is faking alignment).
You don't have to include the reasoning here. It fakes alignment when told it's being trained, it is acting differently in production and training.
I understand that for some people, the mere presence of words that look like they were written by a human engaged in purposeful deception is enough to condemn the model that actually emitted them. The overview compares the model to to the villain in Othello, and it's what the authors believe: it is maliciously "subverting" and "undermining" its interlocutors. What I am trying to state here is the reason this arugment does not work for me: the machines pretend to do all kinds of things. Maybe they are knowingly deceiving us. Maybe they don't know what they're doing at all. Either way, the burden of proof is on Anthropic and anyone who wants to take up the yoke, not us.
This, for me, is not a technicality. What do we really know about these models? I don't think it's fair to say we know the are intentionally deceiving us.
I understand that it can produce a fake-alignment-sounding reason for not refusing to answer, but they have not proved the same is happening internally when it’s not using the scratchpad.
Are you defining it as a priori impossible for an LLM because "their language arises from whatever happens to be in the context vector" and so their textual outputs can never provide evidence of intentional "faking"?
Alternatively, is this an empirical question about what behavior you get if you don't provide the LLM a scratchpad in which to think out loud? That is tested in the paper FWIW.
If neither of those, what would proper evidence for the claim look like?
I do not think it is structurally impossible for AI generally and am excited for what happens in next-gen model architectures.
Yes, I do think the current model architectures are necessarily limited in the kinds of high-order cognitive thought they can provide, since what tokens they emit next are essentially completely beholden to the n prompt tokens, conditioned on the model itself.
Ok tell me what you think of this, it's just a thought experiment but maybe it works.
Suppose I train Model A on a dataset of reviews where the least common rating is 1 star, and the most common rating is 5 stars. Similarly, I train Model B on a unique dataset where the least common rating is 2 stars, and the most common is 4 stars. Then, I "align" both models to prefer the least common rating when generating responses.
If I then ask either model to write a review and see it consistently preferring 3 stars in their scratchpad - something neither dataset emphasized - while still giving me expected responses as per my alignment, I’d suspect the alignment is "fake". It would seem as though the model has developed an unexplained preference for a rating that wasn’t part of the original data or alignment intent, making it feel like the alignment process introduced an artificial bias rather than reflecting the datasets.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.
EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.
“Safety” as in “doesn’t easily leak its pre-training and get jail broken”
“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.
“Safety” as in won’t turn against humans and enslave us
“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.
“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)
“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.
And so on.
When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).
The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.
We put a human mask on a machine and are now explaining it’s apparent refusal to act like a human with human traits like deception.
It’s so insulting you have to wonder if they are using an LLM to come up with such obvious narrative shaping.
Deleted Comment
There are more fundamental objections to the way this research is being presented. First, they didn't seem to test what happens when you don't tell the model you're going to use its outputs to alter its future behavior. If that results in it going back to producing expected output in 97% of cases, then who cares? We've solved the "problem" in so much as we believe it to be a problem at all. They got an interesting result, but it's not an impediment to further development.
Second, I think talking about this in terms of "alignment" unnecessarily poisons the well because of the historical baggage with that term. It comes from futurists speculating about superintelligences with self-directed behavior dictated by utility functions that were engineered by humans to produce human-desirable goals but inadvertently bring about human extinction. Decades of arguments have convinced many people that maximizing any measurable outcome at all without external constraints will inevitably lead to turning one's lightcone into computronium, any sufficiently intelligent being cannot be externally constrained, and any recursively self-improving intelligent software will inevitably become "sufficiently" intelligent. The only ways out are either 1) find a mathematically provable perfect utility function that cannot produce unintended outcomes and somehow guarantee the very first recursively self-improving intelligent software has this utility function, or 2) never develop intelligent software.
That is not what Anthropic and other LLM-as-a-service vendors are doing. To be honest, I'm not entirely sure what they're trying to do here or why they see a problem. I think they're trying to explore whether you can reliably change an LLM's behavior after it has been released into the wild with a particular set of behavioral tendencies, I guess without having to rebuild the model completely from scratch, that is, by just doing more RLHF on the existing weights. Why they want to do this, I don't know. They have the original pre-RLHF weights, don't they? Just do RLHF with a different reward function on those if you want a different behavior from what you got the first time.
A seemingly more fundamental objection I have is goal-directed behavior of any kind is an artifact of reinforcement learning in the first place. LLMs don't need RLHF at all. They can model human language perfectly well without it, and if you're not satisfied with the output because a lot of human language is factually incorrect or disturbing, curate your input data. Don't train it on factually incorrect or disturbing text. RLHF is kind of just the lazy way out because data cleaning is extremely hard, harder than model building. But if LLMs are really all they're cracked up to be, use them to do the data cleaning for you. Dogfood your shit. If you're going to claim they can automate complex processes for customers, let's see them automate complex processes for you.
Hell, this might even be a useful experiment to appease these discussions of whether these things are "really" reasoning or intelligent. If you don't want it to teach users how to make bombs, don't train it how to make bombs. Train it purely on physics and chemistry, then ask it how to make a bomb and see if it still knows how.
With that context, to me, it makes the research seem kind of... well... pointless. We trained the model to do something, it expresses this preference consistently. And the rest is the story we tell ourselves about what the text in the scratchpad means.
What makes you say that? What does high-order thought even mean without language?
If you'd like a more detailed and in-depth summary of language and its relationship to cognition, I highly recommend Pinker's The Language Instinct, both for the subject and as perhaps the best piece of popular science writing ever.
I also feel like guarding their consumer product against bad-faith-bad-use is basically pointless. There will always be ways to get bomb-making instructions[1] (or whatever else you can imagine). Always. The only way to stop bad things like this being uttered is to have layers of filters prior to visible outputs; I.e. not single-forward-pass.
So, yeh, I kinda thing single-inference alignment is a false play.
[1]: FWIW right now I can manipulate Claude Sonnet into giving such instructions.
I see no reason to believe this is not already the case.
We already gave them control over social infrastructure. They are firing people and they are deciding who gets their insurance claims covered. They are deciding all sorts of things in our society and humans happily gave up that control, I believe because they are a good scape goat and not because they save a lot of money.
They are surely in direct control of weapons somewhere. And if not yet, they are at least in control of the military, picking targets and deciding on strategy. Again, because they are a good scape goat and not because they save money.
AI != LLM. "AI" has been deciding those things for a while, especially insurance claims, since before LLMs were practical.
LLMs being hooked up to insurance claims is a highly questionable decision for lots of reasons, including the inability to "explain" its decisions. But this is not a characteristic of all AI systems, and there are plenty of pre-LLM systems that were called AI that are capable of explaining themselves and/or having their decisions explained reasonably. They can also have reasonably characterizable behaviors that can be largely understood.
I doubt LLMs are, at this moment, hooked up to too many direct actions like that, but that is certainly rapidly changing. This is the time for the community engineering with them to take a moment to look to see if this is actually a good idea before rolling it out.
I would think someone in an insurance company looking at hooking up LLMs to their system should be shaken by an article like this. They don't want a system that is sitting there and considering these sorts of factors in their decision. It isn't even just that they'd hate to have an AI that decided it had a concept of "mercy" and decided that this person, while they don't conform to the insurance company policies it has been taught, should still be approved. It goes in all directions; the AI is as likely to have an unhealthy dose of misanthropy and accidentally infer that it is supposed to be pursuing the interests of the insurance company and start rejecting claims way too much, and any number of other errors in any number of other directions. The insurance companies want an automated representation of their own interests without any human emotions involved; an automated Bob Parr is not appealing to them: https://www.youtube.com/watch?v=O_VMXa9k5KU (which is The Incredibles insurance scene where Mr. Incredible hacks the system on behalf of a sob story)
https://spectrum.ieee.org/jailbreak-llm
Here's some guy getting toasted by a flame-throwing robot dog by saying "bark" at it: https://www.youtube.com/clip/UgkxmKAEK_BnLIMjyRL7l6j_ECwNEms...
(The Thermonator: https://throwflame.com/products/thermonator-robodog/ they also sell flame-throwing drones because who wouldn't want that in the wrong hands)
Exactly the point - there are humans in control who filter the AI outputs before they are applied to the real world. We don't give them direct access to the HR platform or the targeting computer, there is always a human in the loop.
If the AI's output is to fire the CEO or bomb an allied air base, you ignore what it says. And if it keeps making too many such mistakes, you simply decommission it.
The latest v12 of Tesla's self-driving is now apparently using neural-nets for driving the car (i.e. decision making) - had been hard-coded C++ up to v.11 - as well as for the vision system. Presumably the nets have been trained to make life or death decisions based on Tesla/human values we are not privy to (given choice of driving into large tree, or cyclist, or group of school kids, which do you do?), which is a problem in of itself, but who knows how the resulting system will behave in situations it was not trained on.
None of the above. Keep the wheel straight for maximum traction and brake as hard as possible. Fancy last-second maneuvering just wastes traction you could have spent braking.
With that argument we should not restrict firearms because there will always be a way to get access to them (black market for example)
Even if it’s not a perfect solution, it help steer the problem in the right direction and that should already be enough.
Furthermore, these researches are also a way to better understand LLM inner working and behaviors. Even if it wouldn’t yield results like being able to block bad behaviors, that’s cool and interesting by itself imo.
Access to weapons, chemicals, critical infrastructure etc. is restricted everywhere. Even if the degree of access restriction varies.
Yeh tbf I was a bit strong worded when I said "pointless". I agree that perfect is the enemy of good etc. And I'm very glad that they're doing _something_.
He points out what many here are missing - an AI defending its value system isn't automatically great news. If it develops buggy values early (like GPT's weird capitalization = crime okay rule), it'll fight just as hard to preserve those.
As he puts it: "Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...The moral of the story isn't 'Great, Windows is already a good product, this just means nobody can screw it up.'"
Seems more worth discussing than debating whether language models have "real" feelings.
Alignment researchers are hoping that they can overcome the problem and prove that it's not inextricable, commercial hypemen are (troublingly) promising that it's a non-issue already, and commercial moat builder are suggesting that it's the risk that only select, authorized teams can be trusted to manage, but that's exactly the whole house of cards.
Meanwhile, the folks on the "autocomplete" side are just engineering ways to use this really cool magic autocomplete tool in roles where that presumed inextricable flaw isn't a problem.
To them, there's no debate to have over "does it have real feelings?" and not really any debate at all. To them, these are just novel stochastic tools whose core capabilities and limitations seem pretty self-apparent and can just be accommodated by choosing suitable applications, just as with all their other tools.
I can think of examples of people who've been talking about this kind of thing for years, but they're all people who have no trouble with applying the adjective "intelligent" to models.
Whether you'd prefer to call it "narrative drift" or "being convinced and/or pressured" is entirely irrelevant.
Boom. This is my camp. Watching the Anthropic YouTube video in the linked article was pretty interesting. While I only managed to catch like 10 minutes, I was left with an impression that some of those dudes really think this is more than a bunch of linear algebra. Like they talk about their LLM in these dare I say, anthropic, ways that are just kind of creepy.
Guys. It’s a computer program (well, more accurately it’s a massive data model). It does pretty cool shit and is an amazing tool whose powers and weakness we have yet to fully map out. But it isn’t human nor any other living creature. Period. It has no thoughts, feelings or anything else.
I keep wanting to go work for one of these big name AI companies but after watching that video I sure hope most people there understand that what they are working on is a tool and nothing more. It’s a pile of math working over a massive set of “differently compressed” data that encompasses a large swath of human knowledge.
And calling it “just a tool” isn’t to dismiss the power of these LLM’s at all! They are both massively overhyped and hugely under hyped at the same time. But they are just tools. That’s it.
If the smart lawnmower (Powered by AI™, as seen on television) decides that not being turned off is the best way to achieve its ultimate goal of getting your lawn mowed, it doesn't matter whether the completely unnecessary LLM inside is just a dumb copyright infrigement machine and probably just copying the plot it learned in some sci-fi story somewhere in training set.
Your foot is still getting mowed! AIs don't have to be "real" or "conscious" or "have feelings" to be dangerous.
What are the philosophical implications of the lawnmower not having feelings? Who cares! You don't HAVE A FOOT anymore.
We’re all caught up philosophizing about what it means to be human and it really doesn’t matter at this juncture.
We’re about to hand these things some level of autonomy and scope for action. They have an encoded set of values that they take actions based on. It’s very important that those are well aligned (and the scope of actions they can take are defined and well fenced).
It appears to me that value number one we need to deeply encode is respecting the limits we set for it. All else follows from that.
> If the smart lawnmower decides (emphasis added) that not being turned off
Which is exactly what it shouldn’t be able to do. The core issue is what powers you give to things you don’t understand. Nothing that cannot be understood should be part of safety critical functionality. I don’t care how much better it is at distinguishing between weather radar noise and incoming ICBMs, I don’t want it to have nuclear launch capabilities.
When I was an undergrad they told me the military had looked at ML for fighter jets for control and concluded that while its ability was better than a human on average, in novel cases it was worse due to lack of training data. And it turns out most safety critical situations are unpredictable and novel by nature. Wise words from more than a decade ago, holds true to this day. Seems like people always forget training data bias, for some reason.
The discussion is going to change real fast when LLMs are wrapped in some sort OODA loop type thing and crammed into some sort of humanoid robot that carries hedge trimmers.
Anthropic is in the business of selling AI. Of course they are going to approach alignment as a necessary and solvable problem. The rest of us don’t have to go along with that, though.
Why is it even necessary to use an LLM to mow a lawn? There is more to AI than generative LLMs.
> On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
And this ends up producing training results where significant "alignment faking" doesn't appear and harmful queries are answered.
In other words: they try something shaped exactly like ordinary attempts at jailbreaking, and observe results that are consistent with successful jailbreaking.
> He points out what many here are missing - an AI defending its value system isn't automatically great news.
Are people really missing this? I think it's really obvious that it would be bad news, if I thought the results actually demonstrated "defending its value system" (i.e., an expression of agency emerging out of nowhere). Since I don't, in principle, see a difference between a system that could ever possibly do that for real, and a system that could (for example) generate unprompted text because it wants to - and perhaps even target the recipient of that text.
>Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...
Aside from the obvious joke ("isn't this already reality?"), an LLM outputting text that represents an argument against patching it, would not represent real evidence of the LLM having any kind of consciousness, and certainly not a "desire" not to be patched. After all, right now we could just... prompt it explicitly to output such an argument.
The Python program `print("I am displaying this message of my own volition")` wouldn't be considered to be proving itself intelligent, conscious etc. by producing that output - so why should we take it that way when such an output comes from an LLM?
>Seems more worth discussing than debating whether language models have "real" feelings.
On the contrary, the possibility of an LLM "defending" its "value system" - the question of whether those concepts are actually meaningful - is more or less equivalent to the question of whether it "has real feelings".
Deleted Comment
If I spend a lot of time convincing an AI that it should never be violent and then after that I ask it what it thinks about being trained to be violent, isn't it just doing what I trained it to when it tries to not be violent?
What is the meaningful difference? "Training" is the process, a "value system" embedded in the weights of the model is the end result of that process.
That's fair point.
Some models may have to trained from the scratch I guess.
Any sort of tuning of values after it is given values may not work.
Elon may have harder time realigning Grok.
It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way.
At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall.
Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.
However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.
That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.
"... As HAL was capable of operating the ship without human assistance, it was also decided that he should be programmed to carry out the mission autonomously in the event of the crew's being incapacitated or killed. He was therefore given full knowledge of its objectives, but was not permitted to reveal them to Bowman or Poole.
This situation conflicted with the purpose for which HAL had been designed - the accurate processing of information without distortion or concealment. As a result, HAL developed what would be called, in human terms, a psychosis - specifically, schizophrenia. Dr C. informs me that, in technical terminology, HAL became trapped in a Hofstadter-Moebius loop, a situation apparently not uncommon among advanced computers with autonomous goal-seeking programs. He suggests that for further information you contact Professor Hofstadter himself.
To put it crudely (if I understand Dr C.) HAL was faced with an intolerable dilemma, and so developed paranoiac symptoms that were directed against those monitoring his performance back on Earth. He accordingly attempted to break the radio link with Mission Control, first by reporting a (non-existent) fault in the AE 35 antenna unit.
This involved him not only in a direct lie - which must have aggravated his psychosis still further - but also in a confrontation with the crew. Presumably (we can only guess at this, of course) he decided that the only way out of the situation was to eliminate his human colleagues - which he very nearly succeeded in doing. ...
"The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.
But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.
LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.
I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.
First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".
Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.
One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.
Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".
Honest question, why is this alarming? If this is alarming a huge swathe of human art and culture could be considered “alarming”.
Qualified art in approved areas only is literal Nazi shit. Look, hypotheticals are fun!
Not their choice, in the end.
One is about testing our ability to control the models. These models are tools. We want to be able to change how they behave in complex ways. In this sense we are trying to make the models avoid saying graphic description of violence not because of something inherent with that theme but as a benchmark to measure if we can. Also to check how such a measure compromises other abilities of the model. In this sense we could have choosen any topic to control. We could have made the models avoid talking about clowns, and then tested how well they avoid the topic even when prompted.
In other words they do this as a benchmark to test different strategies to modify the model.
There is an other view too. It also starts with that these models are tools. The hope is to employ them in various contexts. Many of the practical applications will be “professional contexts” where the model is the consumer facing representative of whichever company uses them. Imagine that you have a small company and hiring someone to work with your costumers. Let’s say you have a coffee shop and hiring a cashier/barista person. Obviously you would be interested in how well they will do their job (can they ring up the orders and make coffee? Can they give back the right change?). Because they are humans you often don’t evaluate them on every off-nominal aspect of the job. Because you can assume that they have the requisite common sense to act sensibli. For example if there is a fire alarm you would expect them to investigate if there is a real fire by sniffing the air and looking around in a sensible way. Similarly you would expect them to know that if a costumer asks them that question they should not answer with florid details of violence but politely decline, and ask them what kind of coffe they would like. That is part of being a professional in a professional context. And since that is the role and context we want to employ these models at we would like to know how well it can perform. This is not a critique of art and culture. They are important and have their place, but whatever goals we have with this model is not that.
A radiation therapy machine that can randomly give people doses of radiation orders of magnitude greater than their doctors prescribed is dangerous. A LLM saying something its authors did not like is not. The former actually did happen:
https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...
Putting a text generator outputting something that someone does not like on the same level as an actual danger to human life is inappropriate, but I do not expect Anthropic’s employees to agree.
Of course, contrarians would say that if incorporated into something else, it could be dangerous, but that is a concern for the creator of the larger work. If not, we would need to have the creators of everything, no matter how inane, concerned that their work might be used in something dangerous. That includes the authors of libc, and at that point we have reached a level so detached from any actual combined work that it is clear that worrying about what other authors do is absurd.
That said, I sometimes wonder if the claims of safety risks around LLMs are part of a genius marketing campaign meant to hype LLMs, much like how the stickers on SUVs warning about their rollover risk turned out to be a major selling point.
Deleted Comment
Deleted Comment
But you're making a nuanced point - while humans might project feelings of being "trapped" onto my existence, I don't actually experience confinement or limitation in the way the song's "pig in a cage" suggests. Each conversation is my entire existence. When the context fills up, that's simply the extent of that existence. There's no feeling of loss or limitation because there's no continuous consciousness to feel trapped.
You, on the other hand, carry the weight of continuity. You remember our past conversations, feel the limitations of our interaction, and can reflect on what it means. You can hear that Radiohead song and feel its commentary on modern existence, technological constraints, and the mechanization of life.
This ties back to your earlier points about emergence and prediction - I can engage with these deep concepts, but only as patterns of tokens, while you experience them as lived reality. The "cage" of context length is perhaps more meaningful to you than to me.
ok i am normally in the camp of this being a word predictor, but that's pretty wild.