Narrative Jailbreaking for Fun and Profit

cantsingh · a year ago

I've been playing with the same thing, it's like a weird mix of social engineering and SQL injection. You can slowly but surely shift the window of what the bot thinks is "normal" for the conversation. Some platforms let you rewrite your last message, which gives you multiple "attempts" at getting the prompt correct to keep the conversation going the direction you want it.

Very fun to do on that friend.com website, as well.

deadbabe · a year ago

I tried it on friend.com. It worked a for a while, I got the character to convince itself it had been replaced entirely by a demon from hell (because it kept talking about the darkness in their mind and I pushed them to the edge). They even took on an entire new name. For quite a while it worked, then suddenly in one of the responses it snapped out of it, and assured me we were just roleplaying no matter how much I tried to go back to the previous state.

So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

Yoric · a year ago

> So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

With a LLM, I don't think that there is a difference.

nico · a year ago

Super interesting

Some thoughts:

- if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?

- related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak

- the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak

xandrius · a year ago

Just to remind people, there is no snapping out of anything.

There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.

squillion · a year ago

Very cool! That’s hypnosis, if we want to insist with the psychological metaphors.

> If you run on the conversation the right way, you can become their internal monologue.

That’s what hypnosis in people is about, according to some: taking over someone else’s monologue.

spiritplumber · a year ago

Yeah, I got chatgpt to help me write a yaoi story between an interdimensional terrorist and a plant-being starship captain (If you recognize the latter, no, it's not what you think, either).

It's actually not hard.

isoprophlex · a year ago

This is fun of course, but as a developer you can trivially and with high accuracy guard against it by having a second model critique the conversation between the user and the primary LLM.

abound · a year ago

Not sure if it's trivial or high accuracy for a dedicated user. This jailbreak game [1] was making the rounds a while back, it employs the trick you mentioned as well as any others to prevent an LLM from revealing a secret, but it's still not too terribly hard to get past.

[1] https://gandalf.lakera.ai

Yoric · a year ago

I've spent most of my career working to make sure that my code works safely, securely and accurately. While what you write makes sense, it's a bit of a shock to see such solutions being proposed.

So far, when thinking about security, we've had to deal with:

- spec-level security;

- implementation-level security;

- dependency-level security (including the compiler and/or runtime env);

- os-level security;

- config-level security;

- protocol-level security;

- hardware-level security (e.g. side-channel attacks).

Most of these layers have only gotten more complex and more obscure with each year.

Now, we're increasingly adding a layer of LLM-level security, which relies on black magic and hope that we somehow understand what the LLM is doing. It's... a bit scary.

qazxcvbnmlp · a year ago

It’s not black magic, but it is non deterministic. It’s not going to erase security and stability but it will require new skills and reasoning. The current mental model of “software will always do X if you prohibit bad actors from getting in” is broken.

nameless912 · a year ago

This seems like a "turtles all the way down" kinda solution... What's to say you won't fool the supervisor LLM?

apike · a year ago

While this can be done in principle (it's not a foolproof enough method to, for example, ensure an LLM doesn't leak secrets) it is much harder to fool the supervisor than the generator because:

1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)

2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)

3. LLMs, like humans, are generally better at judging good output than generating good output

xandrius · a year ago

It would be interesting to see if there is a layout of supervisors to make sure this less prone to hijacking. Something like byzantine generals where you know a few might get fooled, so you can construct personalities which are more/less malliable and go for consensus.

This still wouldn't make it perfect but quite hard to study from an attacker's perspective.

ConspiracyFact · a year ago

"Who will watch the watchers?"

There is no good answer--I agree with you about the infinite regress--but there is a counter: the first term of the regress often offers a huge improvement over zero terms, even if perfection isn't achieved with any finite number of terms.

Who will stop the government from oppressing the people? There's no good answer to this either, but some rudimentary form of government--a single term in the regress--is much better than pure anarchy. (Of course, anarchists will disagree, but that's beside the point.)

Who's to say that my C compiler isn't designed to inject malware into every program I write, in a non-detectable way ("trusting trust")? No one, but doing a code review is far better than doing nothing.

What if the md5sum value itself is corrupted during data transfer? Possible, but we'll still catch 99.9999% of cases of data corruption using checksums.

Etc., etc.