I've been playing with the same thing, it's like a weird mix of social engineering and SQL injection. You can slowly but surely shift the window of what the bot thinks is "normal" for the conversation. Some platforms let you rewrite your last message, which gives you multiple "attempts" at getting the prompt correct to keep the conversation going the direction you want it.
Very fun to do on that friend.com website, as well.
I tried it on friend.com. It worked a for a while, I got the character to convince itself it had been replaced entirely by a demon from hell (because it kept talking about the darkness in their mind and I pushed them to the edge). They even took on an entire new name. For quite a while it worked, then suddenly in one of the responses it snapped out of it, and assured me we were just roleplaying no matter how much I tried to go back to the previous state.
So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?
> So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?
With a LLM, I don't think that there is a difference.
- if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?
- related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak
- the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak
Just to remind people, there is no snapping out of anything.
There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.
Yeah, I got chatgpt to help me write a yaoi story between an interdimensional terrorist and a plant-being starship captain (If you recognize the latter, no, it's not what you think, either).
This is fun of course, but as a developer you can trivially and with high accuracy guard against it by having a second model critique the conversation between the user and the primary LLM.
Not sure if it's trivial or high accuracy for a dedicated user. This jailbreak game [1] was making the rounds a while back, it employs the trick you mentioned as well as any others to prevent an LLM from revealing a secret, but it's still not too terribly hard to get past.
I've spent most of my career working to make sure that my code works safely, securely and accurately. While what you write makes sense, it's a bit of a shock to see such solutions being proposed.
So far, when thinking about security, we've had to deal with:
- spec-level security;
- implementation-level security;
- dependency-level security (including the compiler and/or runtime env);
Most of these layers have only gotten more complex and more obscure with each year.
Now, we're increasingly adding a layer of LLM-level security, which relies on black magic and hope that we somehow understand what the LLM is doing. It's... a bit scary.
It’s not black magic, but it is non deterministic. It’s not going to erase security and stability but it will require new skills and reasoning. The current mental model of “software will always do X if you prohibit bad actors from getting in” is broken.
While this can be done in principle (it's not a foolproof enough method to, for example, ensure an LLM doesn't leak secrets) it is much harder to fool the supervisor than the generator because:
1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)
2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)
3. LLMs, like humans, are generally better at judging good output than generating good output
It would be interesting to see if there is a layout of supervisors to make sure this less prone to hijacking. Something like byzantine generals where you know a few might get fooled, so you can construct personalities which are more/less malliable and go for consensus.
This still wouldn't make it perfect but quite hard to study from an attacker's perspective.
There is no good answer--I agree with you about the infinite regress--but there is a counter: the first term of the regress often offers a huge improvement over zero terms, even if perfection isn't achieved with any finite number of terms.
Who will stop the government from oppressing the people? There's no good answer to this either, but some rudimentary form of government--a single term in the regress--is much better than pure anarchy. (Of course, anarchists will disagree, but that's beside the point.)
Who's to say that my C compiler isn't designed to inject malware into every program I write, in a non-detectable way ("trusting trust")? No one, but doing a code review is far better than doing nothing.
What if the md5sum value itself is corrupted during data transfer? Possible, but we'll still catch 99.9999% of cases of data corruption using checksums.
Very fun to do on that friend.com website, as well.
So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?
With a LLM, I don't think that there is a difference.
Some thoughts:
- if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?
- related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak
- the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak
There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.
> If you run on the conversation the right way, you can become their internal monologue.
That’s what hypnosis in people is about, according to some: taking over someone else’s monologue.
Deleted Comment
It's actually not hard.
Dead Comment
[1] https://gandalf.lakera.ai
So far, when thinking about security, we've had to deal with:
- spec-level security;
- implementation-level security;
- dependency-level security (including the compiler and/or runtime env);
- os-level security;
- config-level security;
- protocol-level security;
- hardware-level security (e.g. side-channel attacks).
Most of these layers have only gotten more complex and more obscure with each year.
Now, we're increasingly adding a layer of LLM-level security, which relies on black magic and hope that we somehow understand what the LLM is doing. It's... a bit scary.
1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)
2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)
3. LLMs, like humans, are generally better at judging good output than generating good output
This still wouldn't make it perfect but quite hard to study from an attacker's perspective.
There is no good answer--I agree with you about the infinite regress--but there is a counter: the first term of the regress often offers a huge improvement over zero terms, even if perfection isn't achieved with any finite number of terms.
Who will stop the government from oppressing the people? There's no good answer to this either, but some rudimentary form of government--a single term in the regress--is much better than pure anarchy. (Of course, anarchists will disagree, but that's beside the point.)
Who's to say that my C compiler isn't designed to inject malware into every program I write, in a non-detectable way ("trusting trust")? No one, but doing a code review is far better than doing nothing.
What if the md5sum value itself is corrupted during data transfer? Possible, but we'll still catch 99.9999% of cases of data corruption using checksums.
Etc., etc.