Invisible watermarks is just steganography. Once the exact method of embedding is known it is always possible to corrupt an existing watermark - however in some cases it may not be possible to tell if a watermark is present, such as if the extraction procedure always produces high entropy information even from unwatermaked content.
Watermarking is not just steganography and steganography is not just watermarking
In June 1996, Ross Anderson organized the first workshop dedicated specifically to information hiding at Cambridge University. This event marked the beginning of a long series known as the Information Hiding Workshops, during which foundational terminology for the field was established. Information hiding, i.e., concealing a message within a host content, branches into two main applications: digital watermarking and steganography. In the case of watermarking, hiding means robustly embedding the message, permanently linking it to the content. In the case of steganography, hiding means concealing without leaving any statistically detectable traces.
References:
1. R. J. Anderson, editor. Proc. 1st Intl. Workshop on Inf. Hiding, volume 1174 of LNCS, 1996.
2. B. Pfitzmann: Information hiding terminology - Results of an informal plenary meeting and additional proposals. In Anderson [1], pages 347–350.
Specifically, shuffling compression, bit-rate, encryption, and barely human-perceivable signal around mediums (x-M) to obscure the entrophic/random state of any medium as to not break the generally-available plausible-deniability from a human-perception.
Can't break Shannon's law, but hides who intent of who is behind the knocks on the all doors. Obscures which house Shannon lives in, and whom who knocks wishes to communicate.
There is some nice information in the appendix, like:
“One training with a schedule similar to the one reported in the paper represents ≈ 30 GPU-days. We also roughly estimate that the total GPU-days used for running all our experiments to 5000, or ≈ 120k GPU-hours. This amounts to total emissions in the order of 20 tons of CO2eq.”
I am not in AI at all, so I have no clue how bad this is. But it’s nice to have some idea of the costs of such projects is.
so say i have a site with 3000 images, 2M pixel each. How many GPU-months it would take to mark them? And, what gigabytes i would have to keep for the model?
I wonder what will come of all the creative technologists out there, trying to raise money to do "Watermarking" or "Human Authenticity Badge," when Meta will just do all the hard parts for free: both the technology of robust watermarking, and building an insurmountable social media network that can adopt it unilaterally.
Various previous attempts at invisible/imperceptible/mostly imperceptible watermarking have been trivially defeated, this attempt claims to be more robust to various kinds of edits. (From the paper: various geometric edits like rotations or crops, various valuemetric edits like blurs or brightness changes, and various splicing edits like cutting parts of the image into a new one or inpainting.) Invisible watermarking is useful for tracing origins of content. That might be copyright information, or AI service information, or photoshop information, or unique ID information to trace leakers of video game demos / films, or (until the local hardware key is extracted) a form of proof that an image came from a particular camera...
... Ideal for a repressive government or just a mildly corrupt government agency / corporate body to use to identify defectors, leakers, whistleblowers, or other dissidents. (Digital image sensors effectively already mark their output due to randomness of semiconductor manufacturing, and that has already been used by abovementioned actors for the abovementioned purposes. But that at least is difficult.) Tell me with a straight face that a culture that produced Chat Control or attempted to track forwarding chains of chat messages[1] won’t mandate device-unique watermarks kept on file by the communications regulator. And those are the more liberal governments by today’s standards.
I’m surprised how eager people are to build this kind of tech. It was quite a scandal (if ultimately a fruitless one) when it came out colour printers marked their output with unique identifiers; and now that generative AI is a thing stuff like TFA is seen as virtuous somehow. Can we maybe not forget about humans?..
[1] I don’t remember where I read about the latter or which country it was about—maybe India?
In my previous experience the "resizing & rotate" always defeats all kinds of watermarks. For example, crop a 1000x1000 image to 999x999, and rotate it by 1°
also there's "double watermark" attack, just run the result image through the watermark process again, usually the original watermark would be lost
And they'll say it's to combat disinformation, but it'll actually be to help themselves filter AI generated content out of new AI training datasets so their models don't get Habsburg'd.
What if the watermark becomes a latent variable that's indirectly learnt by a subsequent model trained on its generated data? They will have to constantly vary the mark to keep it up to date. Are we going to see Merkle tree watermark database like we see for certificate transparency? YC, here's your new startup idea.
I can imagine some kind of public/private key encrypted watermark system to ensure the veracity / provenance of media created via LLMs and their associated user accounts.
There's many reasons why people are concerned about AI's training data becoming AI generated. The usual one is that the training will diverge, but this is another good one.
Camera makers are all working on adding cryptographic signatures to captured images to prove their provenance. The current standard embeds this in metadata, but if they start watermarking the images themselves then skipping watermarked images during training would quickly become an issue
In June 1996, Ross Anderson organized the first workshop dedicated specifically to information hiding at Cambridge University. This event marked the beginning of a long series known as the Information Hiding Workshops, during which foundational terminology for the field was established. Information hiding, i.e., concealing a message within a host content, branches into two main applications: digital watermarking and steganography. In the case of watermarking, hiding means robustly embedding the message, permanently linking it to the content. In the case of steganography, hiding means concealing without leaving any statistically detectable traces.
References: 1. R. J. Anderson, editor. Proc. 1st Intl. Workshop on Inf. Hiding, volume 1174 of LNCS, 1996. 2. B. Pfitzmann: Information hiding terminology - Results of an informal plenary meeting and additional proposals. In Anderson [1], pages 347–350.
Stenography is just security by more obscurity.
Specifically, shuffling compression, bit-rate, encryption, and barely human-perceivable signal around mediums (x-M) to obscure the entrophic/random state of any medium as to not break the generally-available plausible-deniability from a human-perception.
Can't break Shannon's law, but hides who intent of who is behind the knocks on the all doors. Obscures which house Shannon lives in, and whom who knocks wishes to communicate.
Security-by-obscurity is when security hinges on keeping your algorithm itself (as opposed to some key) hidden from the adversary.
I don't see how it has any connnection with what you're alluding to here.
https://en.wikipedia.org/wiki/Stenography
https://en.wikipedia.org/wiki/Steganography
“One training with a schedule similar to the one reported in the paper represents ≈ 30 GPU-days. We also roughly estimate that the total GPU-days used for running all our experiments to 5000, or ≈ 120k GPU-hours. This amounts to total emissions in the order of 20 tons of CO2eq.”
I am not in AI at all, so I have no clue how bad this is. But it’s nice to have some idea of the costs of such projects is.
That's about 33 economy class roundtrip flights from LAX to JFK.
https://www.icao.int/environmental-protection/Carbonoffset/P...
1. Different energy sources produce varyings of co2
2. This likely does not include co2 to make the GPUs or machines
3. Humans involved are not added to this at all, and all of the impact they have on the environment
4. No ability to predict future co2 from using this work.
Also if it really matters, then why do it at all? If we’re saying hey this is destroying the environmental and care, then maybe don’t do that work?
Deleted Comment
How was copilot trained? Github.
Zoom, others would love to use your data to train their AI. It’s their proprietary advantage!
We did consider a similar FOSS project, but didn't like the idea of helping professional thieves abusing dmca rules.
Have a nice day. =3
I’m surprised how eager people are to build this kind of tech. It was quite a scandal (if ultimately a fruitless one) when it came out colour printers marked their output with unique identifiers; and now that generative AI is a thing stuff like TFA is seen as virtuous somehow. Can we maybe not forget about humans?..
[1] I don’t remember where I read about the latter or which country it was about—maybe India?
also there's "double watermark" attack, just run the result image through the watermark process again, usually the original watermark would be lost
Dead Comment