DeepMind debuts watermarks for AI-generated text

This article goes into it a little bit, but an interview with Scott Aaronson goes into some detail about how watermarking works[0].

He's a theoretical computer scientist but he was recruited by OpenAI to work on AI safety. He has a very practical view on the matter and is focusing his efforts on leveraging the probabilistic nature of LLMs to provide a digital undetectable watermark. So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM. It's really clever and apparently he has a working prototype in development.

Some work arounds he hasn't figured out yet is asking for an output in language X and then translating it into language Y. But those may still be eventually figured out.

I think watermarking would be a big step forward to practical AI safety and ideally this method would be adopted by all major LLMs.

That part starts around 1 hour 25 min in.

> Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.

https://axrp.net/episode/2023/04/11/episode-20-reform-ai-ali...

nicce · 10 months ago

I don't think that provable watermarking is possible in practice. The method you mention is clever, but before it can work, you would need to know the probability of the every other source which could also be used to generate the output for the same purpose. If you can claim that the probability of that model is much higher on that model than in any other place, including humans, then watermark might give some stronger indications.

You would also need to define probability graph based on the output length. The longer the output, more certain you can be. What is the smallest amount of tokens that cannot be proved at all?

You would also need include humans. Can you define that for human? All LLMs should use the same system uniformally.

Otherwise, "watermaking" is doomed to be misused and not being reliable enough. False accusations will be take a place.

A_D_E_P_T · 10 months ago

I agree. I'd add that not only could human-written content fail the test -- it's also the case that humans will detect the word pairing, just as they detected "delve" and various other LLM tells.

In time most forms of watermarking along those lines will seem like elements of an LLM's writing style, and will quickly be edited out by savvy users.

123yawaworht456 · 10 months ago

>So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM.

hah, every single LLM already watermarks its output by starting the second paragraph with "It is important/essential to remember that..." followed by inane gibberish, no matter what question you ask.

AlienRobot · 10 months ago

I've always felt you'd be able to tell someone uses Reddit because they'll reply to a comment starting the sentence with "The problem is that..."

Now LLMs are trained on Reddit users.

littlestymaar · 10 months ago

Sounds interesting, but it also sounds like something that could very well be circumvented by using a technique similar to speculative decoding: you use the censored model like you'd use the fast llm in speculative decoding, and you check whether the other model agrees with it or not. But instead of correcting the token every time both models disagree like you'd do with speculative decoding, you just need to change it often enough to mess with the watermark detection function (maybe you'd change every other mismatched token, or maybe one every 5 tokens would be enough to reduce the signal-to-noise ratio below the detection threshold).

You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.

Or am I misunderstanding something?

jkhdigital · 10 months ago

No you've got it right. Watermarks like this are trivial to defeat, which means they are only effective against lazy users like cheating college students and job applicants.

nprateem · 10 months ago

Or just check whether text contains the word delve and it's most likely AI generated. I fucking hate that word now.

"An LLM generates text one token at a time. These tokens can represent a single character, word or part of a phrase. To create a sequence of coherent text, the model predicts the next most likely token to generate. These predictions are based on the preceding words and the probability scores assigned to each potential token.

For example, with the phrase “My favorite tropical fruits are __.” The LLM might start completing the sentence with the tokens “mango,” “lychee,” “papaya,” or “durian,” and each token is given a probability score. When there’s a range of different tokens to choose from, SynthID can adjust the probability score of each predicted token, in cases where it won’t compromise the quality, accuracy and creativity of the output.

This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark. This technique can be used for as few as three sentences. And as the text increases in length, SynthID’s robustness and accuracy increases."

Better link: https://deepmind.google/technologies/synthid/

baobabKoodaa · 10 months ago

I'm fascinated that this approach works at all, but that said, I don't believe watermarking text will ever be practical. Yes, you can do an academic study where you have exactly 1 version of an LLM in exactly 1 parameter configuration, and you can have an algorithm that tweaks the logits of different tokens in a way that produces a recognizable pattern. But you should note that the pattern will be recognizable only when the LLM version is locked and the parameter configuration is locked. Which they won't be in the real world. You will have a bunch of different models, and people will use them with a bunch of different parameter combinations. If your "detector" has to be able to recognize AI generated text from a variety of models and a variety of parameter combinations, it's no longer going to work. Even if you imagine someone bruteforcing all these different combos, trouble is that some of the combos will produce false positives just because you tested so many of them. Want to get rid off those false positives? Go ahead, make the pattern stronger. And now you're visibly altering the generated text to an extent where that is a quality issue.

In summary, this will not work in practice. Ever.

TeMPOraL · 10 months ago

Even with temperature = 0, LLMs are still non-deterministic, as their internal, massively parallelized calculations are done with floating point arithmetic, which is order-dependent. Running the same LLM with the exact same parameters multiple times might still yield slightly different probabilities in the output, making this watermarking scheme even less robust.

emporas · 10 months ago

In practice, every programmer or a writer who gets the LLM output, does a lot of rewriting for already existing code, or already existing text. Stitching together parts of many LLM outputs is the only way to use an LLM effectively, even stitching together parts of different LLMs, which i do all the time.

Recognizing only parts of a watermark, and many watermarked parts scattered all around doesn't seem possible at all, in my mind.

They can however develop a software to sell very expensively to universities, schools etc, and it will occasionally catch a very guilty person who uses it all the time and doesn't even try to make the answer better, who always hands over the LLM answer in one piece.

At the end of the day, it will lead to so many false accusations people will stop trusting it. In chess players and tournaments false accusations of cheating happen all the time, for 15 years or more. Right now former world chess champion Kramnik has accused over 50 top chess players of cheating, including the 5 times US champion Nakamura, in the span of 2 months.

If a software like that gets applied to schools and universities, we are gonna have the fun of our lives.

bgro · 10 months ago

Couldn’t this be easily disrupted as a watermark system by simply changing the words to interfere with the relative checksum?

I suspect sentence structure is also being used or, more likely, the primary “watermark”. Similar to how you can easily identify if something is at least NOT a Yoda quote based on it having incorrect structure. Combine that with other negative patterns like the quote containing Harry Potter references instead of Star Wars, and you can start to build up a profile of trends like this statement.

By rewriting the sentence structure and altering usual wording instead of directly copying the raw output, it seems like you could defeat any current raw watermarking.

Though this hasn’t stopped Google and others in the past using bad science and stats to make unhinged entitled claims like when they added captcha problems everybody said would be “literally impossible“ for bots to solve.

What a surprise how trivial they were to automate and the data they produce can be sold for profit at the expense of mass consumer time.

scarmig · 10 months ago

In principle, it seems like you could have semantic watermarking. For instance, suppose I want a short story. There are lots of different narrative and semantic aspects of it that each carry some number of bits of information: setting, characters, events, and those lay on a probability distribution like anything else. You just subtly shift the probability distribution of those choices, and then it's resistant to word choice, reordering, and any transformation that maintains its semantic meaning.

blintz · 10 months ago

These watermarks are not robust to paraphrasing attacks: AUC ROC falls from 0.95 to 0.55 (barely better than guessing) for a 100 token passage.

The existing impossibility results imply that these attacks are essentially unavoidable (https://arxiv.org/abs/2311.04378) and not very costly, so this line of inquiry into LLM watermarking seems like a dead end.

I spent the last five years doing PhD research into steganography, with a particular focus on how to embed messages into LLM outputs. Watermarking is basically one-bit steganography.

The first serious investigations into "secure" steganography were about 30 years ago and it was clearly a dead end even back then. Sure, watermarking might be effective against lazy adversaries--college students, job applicants, etc.--but can be trivially defeated otherwise.

All this time I'd been lamenting my research area as unpopular and boring when I should've been submitting to Nature!

impossiblefork · 10 months ago

Though, surely secure steganography with LLMs should be quite easy?

Presumably there are things like key exchanges that look like randomness, and then you could choose LLM output using that randomness in such a way that you can send messages that look like an LLM conversation?

Someone starts the conversation with a real message 'Hello!' and then you do some kind of key exchange where what is exchanged is hard to distinguish from randomness, and use those keys to select the probabilities of the coming tokens from the LLM. Then once they is established you use some kind of cipher to generate random-looking ciphertext and use that as the randomness used to select words in the final bit?

Surely that would work? If there is guaranteed insecurity, it's for things like watermarking, not for steganography?

sbszllr · 10 months ago

I’ve been working in the space since 2018. Watermarking and fingerprinting (of models themselves and outputs) are useful tools but they have a weak adversary model.

Yet, it doesn’t stop companies from making claims like these, and what’s worse, people buying into them.

kp1197 · 10 months ago

Watermarking is not the way to go. It relies on the honesty of the producers, and watermarks can be easily stripped. With images, the way to go is detect authentic images, not fake ones. I've written about this extensively: https://dev.to/kylepena/addressing-the-threat-of-deep-fakes-...

sgt101 · 10 months ago

I think this misses a key point.

If there were a law that AI generated text should be watermarked then major corporations would take pains to apply the watermark, because if they didn't then they would be exposed to regulatory and reputational problems.

Watermarking the text would enable people training models to avoid it, and it would allow search engines to determine not to rely on it (if that was the search engine preference).

It would not mean that all text not watermarked was human generated, but it would mean that all text not watermarked and provided by institutional actors could be trusted.

auggierose · 10 months ago

> It would not mean that all text not watermarked was human generated, but it would mean that all text not watermarked and provided by institutional actors could be trusted.

What?

Dead Comment

bko · 10 months ago

namanyayg · 10 months ago

ruuda · 10 months ago

Some comments here point at impossibility results, but after screening hundreds of job applications at work, it's not hard to pick out the LLM writing, even without watermark. My internal LLM detector is now so sensitive that I can tell when my confirmed-human colleagues used an LLM to rephrase something when it's longer than just one sentence. The writing style is just so different.

Maybe if you prompt it right, it can do a better job of masking itself, but people don't seem to do that.

So, how many times did you actually get the confirmation that an LLM has/has not been used?

My guess is zero times. So, you are not describing an experiment here, you are just describing how you built up your internal bias.

kuhewa · 10 months ago

Probably not entirely fair. e.g. After enough sentences it is trivially easy to identify LLM output. So you repeatedly get the opportunity to test a sentence or two, guess the provenance and then realise it is the first sentence in several paragraphs of generated output.

tessierashpool9 · 10 months ago

this is an important realization!

ksaj · 10 months ago

Some of the watermarking is really obvious. If you write song lyrics in ChatGPT, watch for phrases like "come what may" and "I stand tall."

It's not just that they are (somewhat) unusual phrases, it's that ChatGPT comes up with those phrases so very often.

It's quite like how earlier versions always had a "However" in between explanations.

I had to follow up: I told my partner about watermarking.

We asked ChatGPT to explain the meaning of "come what may" - a phrase it generates very often in lyrics - and it responded by needing proof that we were human.

It's definitely a watermark.

GaggiX · 10 months ago

ChatGPT does not have a watermark.

sunaookami · 10 months ago

It has a rich tapestry of watermarks.

jgalt212 · 10 months ago

I suggest we "delve" deeper int this problem.

aleph_minus_one · 10 months ago

What makes you sure about that?

fkyoureadthedoc · 10 months ago

Coheed and Cambria were using ChatGPT this whole damn time, smh

espadrine · 10 months ago

The academic paper: https://www.nature.com/articles/s41586-024-08025-4

They use the last N prefix tokens, hash them (with a keyed hash), and use the random value to sample the next token by doing an 8-wise tournament, by assigning random bits to each of the top 8 preferred tokens, making pairwise comparisons, and keeping the token with a larger bit. (Yes, it seems complicated, but apparently it increases the watermarking accuracy compared to a straightforward nucleus9 sampling.)

The negative of this approach is that you need to rerun the LLM, so you must keep all versions of all LLMs that you trained, forever.

mmoskal · 10 months ago

They actually run 2^30-way tournament (they derive an equivalent form that doesn't requires 2B operations). You do not need to run the LLM, it only depends on the tokenizer.

You’re right. I understood it to require taking the top 2^30 tokens, but instead they sample 2^30 times with replacement.

Too bad they only formulate the detection positive rate empirically. I am curious what the exact probability would be mathematically.

Why do you need to rerun the LLM? Watermark detection only requires the hash functions (equation (1) from the paper).

samatman · 10 months ago

This is information-theoretically guaranteed to make LLM output worse.

My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.

That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.

This is impossible to avoid or fix.

thornewolf · 10 months ago

you are correct of we suppose we are at a global optimum. however, consider this example:

i have two hands

i have 2 hands

these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.

You're not wrong, but natural language has a lot of stylistic "noise" which can be utilized as a subliminal channel without noticeably degrading the semantic signal.

mateus1 · 10 months ago

Google is branding this in a positive light but this is just AI text DRM.

sebstefan · 10 months ago

It's likely more about preventing model incest than digital rights management

gwbas1c · 10 months ago

Like all things a computer can / can't do; DRM isn't inherently bad: It's how its used that's a problem.

IE, DRM can't change peoples' motivations. It's useful for things like national security secrets and trade secrets, where the people who have access to the information have very clear motivations to protect that information, and very clear consequences for violating the rules that DRM is in place to protect.

In this case, the big question of if AI watermarking will work / fail has more to do with peoples' motivations: Will the general public accept AI watermarking because it fits our motivations and the consequences we set up for AI masquerading as a real person, or AI being used for misinformation? That's a big question that I can't answer.

This is not a “good deed for the public” done by Google, this is just a self serving tool to enforce their algorithms and digital property. There is nothing “bad” here for the public but it’s certainly not good either.

fastball · 10 months ago

I for one am glad we might have a path forward to filtering out LLM-generated sludge.

pyrale · 10 months ago

> we

If by "we" you mean anyone else than Google and the select few other LLM provider they choose to associate with, I'm afraid you're going to be disappointed.