Smuggling arbitrary data through an emoji

This is cute but unnecessary - Unicode includes a massive range called PUA: the private use area. The codes in this range aren’t mapped to anything (and won’t be mapped to anything) and are for internal/custom use, not to be passed to external systems (for example, we use them in fish-shell to safely parse tokens into a string, turning an unescaped special character into just another Unicode code point in the string, but in the PUA area, then intercept that later in the pipeline).

You’re not supposed to expose them outside your api boundary but when you encounter them you are prescribed to pass them through as-is, and that’s what most systems and libraries do. It’s a clear potential exfiltration avenue, but given that most sane developers don’t know much more about Unicode other than “always use Unicode to avoid internationalization issues”, it’s often left wide open.

paulgb · 6 months ago

I just tested and private use characters render as boxes for me (󰀀), the point here was to encode them in a way that they are hidden and treated as "part of" another character when copy/pasting.

diggan · 6 months ago

> the point here was to encode them in a way that they are hidden and treated as "part of" another character when copy/pasting

AKA "Steganography" for the curious ones: https://en.wikipedia.org/wiki/Steganography

bruce343434 · 6 months ago

On my Android phone,that displays "Go[][]" in the Google logo font.

layer8 · 6 months ago

The difference is that PUA characters are usually rendered in some way that is rather visible, whereas the variation selectors aren’t.

lolinder · 6 months ago

Context that some may be missing is that this was inspired by discussion surrounding the Open Heart Protocol submission:

https://news.ycombinator.com/item?id=42791378

People immediately began discussing the applications for criminal use given the constraint that only emoji are accepted by the API. So for that use case the PUA wouldn't be an option, you have to encode it in the emoji.

Sniffnoy · 6 months ago

Isn't this more what the designated noncharacters are for, rather than the private-use area? Given how the private-use area sometimes gets for unofficial encodings of scripts not currently in Unicode (or for things like the Apple logo and such) I'd be worried about running into collisions with that if I used the PUA in such a way.

Note that designated noncharacters includes not only 0xFFFF and 0xFFFE, and not only the final two code points of every plane, but also an area in the middle of Arabic Presentation Forms that was at some point added to the list of noncharacters specifically so that there would be more noncharacters for people using them this way!

juped · 6 months ago

I'll be h󠄾󠅟󠅠󠅕󠄜󠄐󠅞󠅟󠄐󠅣󠅕󠅓󠅢󠅕󠅤󠅣󠄐󠅘󠅕󠅢󠅕onest, I pasted this comment in the provided decoder thinking no one could miss the point this badly and there was probably a hidden message inside it, but either you really did or this website is stripping them.

You can't invisibly watermark an arbitrary character (I did it to one above! If this website isn't stripping them, try it out in the provided decoder and you'll see) with unrecognized PUA characters, because it won't treat them as combining characters. You will cause separately rendered rendered placeholder-box characters to appear. Like this one:  (may not be a placeholder-box if you're privately-using the private use area yourself).

egypturnash · 6 months ago

j󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞󠄒󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅝󠅩󠄐󠅣󠅟󠅞󠄑󠅄󠅘󠅕󠄐󠅚󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅒󠅙󠅤󠅕󠄜󠄐󠅤󠅘󠅕󠄐󠅓󠅜󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅓󠅑󠅤󠅓󠅘󠄑󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅥󠅒󠅚󠅥󠅒󠄐󠅒󠅙󠅢󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅣󠅘󠅥󠅞󠅄󠅘󠅕󠄐󠅖󠅢󠅥󠅝󠅙󠅟󠅥󠅣󠄐󠄲󠅑󠅞󠅔󠅕󠅢󠅣󠅞󠅑󠅤󠅓󠅘󠄑󠄒󠄸󠅕󠄐󠅤󠅟󠅟󠅛󠄐󠅘󠅙󠅣󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅣󠅧󠅟󠅢󠅔󠄐󠅙󠅞󠄐󠅘󠅑󠅞󠅔󠄪󠄼󠅟󠅞󠅗󠄐󠅤󠅙󠅝󠅕󠄐󠅤󠅘󠅕󠄐󠅝󠅑󠅞󠅨󠅟󠅝󠅕󠄐󠅖󠅟󠅕󠄐󠅘󠅕󠄐󠅣󠅟󠅥󠅗󠅘󠅤󠇒󠅰󠆄󠅃󠅟󠄐󠅢󠅕󠅣󠅤󠅕󠅔󠄐󠅘󠅕󠄐󠅒󠅩󠄐󠅤󠅘󠅕󠄐󠅄󠅥󠅝󠅤󠅥󠅝󠄐󠅤󠅢󠅕󠅕󠄜󠄱󠅞󠅔󠄐󠅣󠅤󠅟󠅟󠅔󠄐󠅑󠅧󠅘󠅙󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄞󠄱󠅞󠅔󠄐󠅑󠅣󠄐󠅙󠅞󠄐󠅥󠅖󠅖󠅙󠅣󠅘󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄐󠅘󠅕󠄐󠅣󠅤󠅟󠅟󠅔󠄜󠅄󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅧󠅙󠅤󠅘󠄐󠅕󠅩󠅕󠅣󠄐󠅟󠅖󠄐󠅖󠅜󠅑󠅝󠅕󠄜󠄳󠅑󠅝󠅕󠄐󠅧󠅘󠅙󠅖󠅖󠅜󠅙󠅞󠅗󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅤󠅘󠅕󠄐󠅤󠅥󠅜󠅗󠅕󠅩󠄐󠅧󠅟󠅟󠅔󠄜󠄱󠅞󠅔󠄐󠅒󠅥󠅢󠅒󠅜󠅕󠅔󠄐󠅑󠅣󠄐󠅙󠅤󠄐󠅓󠅑󠅝󠅕󠄑󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄱󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠅄󠅘󠅕󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅒󠅜󠅑󠅔󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅣󠅞󠅙󠅓󠅛󠅕󠅢󠄝󠅣󠅞󠅑󠅓󠅛󠄑󠄸󠅕󠄐󠅜󠅕󠅖󠅤󠄐󠅙󠅤󠄐󠅔󠅕󠅑󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅧󠅙󠅤󠅘󠄐󠅙󠅤󠅣󠄐󠅘󠅕󠅑󠅔󠄸󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅗󠅑󠅜󠅥󠅝󠅠󠅘󠅙󠅞󠅗󠄐󠅒󠅑󠅓󠅛󠄞󠄒󠄱󠅞󠅔󠄐󠅘󠅑󠅣󠅤󠄐󠅤󠅘󠅟󠅥󠄐󠅣󠅜󠅑󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄯󠄳󠅟󠅝󠅕󠄐󠅤󠅟󠄐󠅝󠅩󠄐󠅑󠅢󠅝󠅣󠄜󠄐󠅝󠅩󠄐󠅒󠅕󠅑󠅝󠅙󠅣󠅘󠄐󠅒󠅟󠅩󠄑󠄿󠄐󠅖󠅢󠅑󠅒󠅚󠅟󠅥󠅣󠄐󠅔󠅑󠅩󠄑󠄐󠄳󠅑󠅜󠅜󠅟󠅟󠅘󠄑󠄐󠄳󠅑󠅜󠅜󠅑󠅩󠄑󠄒󠄸󠅕󠄐󠅓󠅘󠅟󠅢󠅤󠅜󠅕󠅔󠄐󠅙󠅞󠄐󠅘󠅙󠅣󠄐󠅚󠅟󠅩󠄞󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞 is for Jabberwocky. Does this decode?

edit: Yes, it does.

I love the idea of using this for LLM output watermarking. It hits the sweet spot - will catch 99% of slop generators with no fuss, since they only copy and paste anyway, almost no impact on other core use cases.

I wonder how much you’d embed with each letter or token that’s output - userid, prompt ref, date, token number?

I also wonder how this is interpreted in a terminal. Really cool!

fennecfoxy · 6 months ago

Why does anybody think AI watermarking will ever work? Of course it will never work, any watermarking can be instantly & easily stripped...

The only real AI protection is to require all human interaction to be signed by a key verified by irl identity and even then that will: A never happen, B be open to abuse by countries with corrupt governments and countries with corrupt governments heavily influenced by private industry (like the US).

teruakohatu · 6 months ago

> any watermarking can be instantly & easily stripped...

I think it took a while before printer watermarking (yellow dots) was discovered. It certainly cannot be stripped. It was possibly developed in the mid-80s but not known to the public until mid 2000s.

zos_kia · 6 months ago

With the amount of pre processing that is done before integrating stuff in a dataset I'd be surprised if those kinds of shenanigans even worked

capitainenemo · 6 months ago

In most linux terminals, what you pass it is just a sequence of bytes that is passed unmangled. And since this technique is UTF-8 compliant and doesn't use any extra glyphs, it is invisible to humans in unicode compliant terminals. I tried it on a few. It shows up if you echo the sentence to, say, xxd ofc.

(unlike the PUA suggestion in the currently top voted comment which shows up immediately ofc)

Additional test corrections: While xxd shows the message passing through completely unmangled on pasting it into the terminal, when I selected from the terminal (echoed sentence, verified unmangled in xxd, then selected and pasted the result of echo), it was truncated to a few words using X select in mate terminal and konsole - I'm not sure where that truncation happens, whether it's the terminal or X. In xterm, the final e was mangled, and the selection was even more truncated.

The sentence is written unmangled to files though, so I think it's more about copying out of the terminal dropping some data. Verified by echoing the sentence to a test file, opening it in a browser, and copying the text from there.

vessenes · 6 months ago

On MacOS, kitty shows an empty box, then an a for the "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" post below. I think this is fair and even appreciated. Mac Terminal shows "ha". That "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" (and this one!) can be copied and pasted into the decoder successfully.

ChadNauseam · 6 months ago

There are other possible approaches to LLM watermarking that would be much more robust and harder to detect. They exploit the fact that LLMs work by producing a probability distribution that gives a probability for each possible next token. These are then sampled randomly to produce the output. To add fingerprints when generating, you could do some trickery in how you do that sampling that would then be detectable by re-running the LLM and observing its outputs. For example, you could alternate between selecting high-probability and low-probability tokens. (A real implementation of this would be much more sophisticated than that obviously, but hopefully you get the idea)

vessenes · 6 months ago

This is not a great method in a world with closed models and highly diverse open models and samplers. It’s intellectually appealing for sure! But it will always be at best a probabilistic method, and that’s if you have the llm weights at hand.

OutOfHere · 6 months ago

Just you wait until AI starts calling human output to be slop.

vessenes · 6 months ago

That's already happening - my kids have had papers unfairly blamed on chatgpt by automated tools. Protect yourself kids, use an editor that can show letter by letter history.

roguecoder · 6 months ago

There are of course human writers who are less-communicative than AI, called "shit writers", and humans who are less accurate than AI, called "liars".

The difference is humans are responsible for what they write, whereas the human user who used an AI to generate text is responsible for what the computer wrote.

riskable · 6 months ago

Oh this is just the tip of the iceberg when it comes to abusing Unicode! You can use a similar technique to this to overflow the buffer on loads of systems that accept Unicode strings. Normally it just produces an error and/or a crash but sometimes you get lucky and it'll do all sorts of fun things! :)

I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.

sourque · 6 months ago

This was the premise of a Google CTF quals 2024 challenge ("encrypted runner").

Yeah. Zalgo text is a common test for input fields on websites. But it usually doesn't do anything interesting. Maybe an exception trigger on some database length limit. Doesn't typically even kill any processes. The exception is normally just in your thread. You can often trigger it just by disabling JS on even modern forms, but,, at best you're maybe leaking a bit of info if they left debug on and print the stack trace or a query. Another common slip-up is failing to count \n vs \r\n in text strings since JS usually usually counts a carriage return as 1 byte, but HTTP spec requires two.

unescape(encodeURIComponent("ç")).length is the quick and dirty way to do a JS byte length check. The \r\n thing can be done just by cleaning up the string before length counting.

dotancohen · 6 months ago

Does Zalgo even work on HN? I've never thought of using it to test my systems, thank you. I've got some new testing to do tonight.

Edit: No, Zalgo doesn't work on HN. This comment itself was an experiment to try.

Rendello · 6 months ago

A few months ago I made a post which I (should've) named "Unicode codepoints that expand or contract when case is changed in UTF-8". A decent parser shouldn't have any issues with things like this, but software that makes bad Unicode assumptions might.

https://news.ycombinator.com/item?id=42014045

n0id34 · 6 months ago

Sorry n00b here, can you explain more about this or how you did this? I feel like this is definitely a loophole that would be worth testing for.

ComputerGuru · 6 months ago

omnibrain · 6 months ago

10 years or so ago I shocked coworkers with using U+202D LEFT-TO-RIGHT OVERRIDE mid in filenames on windows. So funnypicturegnp.exe became funnypictureexe.png Combined with a custom icon for the program that mimics a picture preview it was pretty convincing.

mdup · 6 months ago

I worked in phishing detection. This was a common pattern used by attackers, although .exe are blocked automatically most of the time, .html is the new malicious extension (often hosting an obfuscated window.location redirect to a fake login page).

RTL abuse like cute-cat-lmth.png was relatively common, but also trivial to detect. We would immediately flag such an email as phishing.

omoikane · 6 months ago

The source code version of that is CVE-2021-42574, and they have a website:

https://trojansource.codes/

Basically it's possible to hide some code that looks like comments but compiles like code. I seem to recall the CVE status was disputed since many text editors already make these suspicious comments visible.

taneq · 6 months ago

I’d never heard of this particular trick but I’m glad my decades of paranoia-fueled “right click -> open with” treatment of any potentially sketchy media file was warranted! :D

oefnak · 6 months ago

I created a guitar_tab.txt which was a bat file.

hosteur · 6 months ago

Wow this is a clever trick.

rexxars · 6 months ago

For a real-world use case: Sanity used this trick[0] to encode Content Source Maps[1] into the actual text served on a webpage when it is in "preview mode". This allows an editor to easily trace some piece of content back to a potentially deep content structure just by clicking on the text/content in question.

It has it's drawbacks/limitations - eg you want to prevent adding it for things that needs to be parsed/used verbatim, like date/timestamps, urls, "ids" etc - but it's still a pretty fun trick.

[0] https://www.sanity.io/docs/stega

[1] https://github.com/sanity-io/content-source-maps

ethin · 6 months ago

It's worth noting, just as a curiosity, that screen readers can detect these variation selectors when I navigate by character. For example, if I arrow over the example he provided (I can't paste it here lol), I here: "Smiling face with smiling eyes", "Symbol e zero one five five", "Symbol e zero one five c", "Symbol e zero one five c", "Symbol e zero one five f". This is unfortunately dependent on the speech synthesizer used, and I wouldn't know if the characters were there if I was just reading a document, so this isn't much of an advantage all things considered.

llm_trw · 6 months ago

Ironically enough I have a script that strips all non-ascii characters from my screen reader because I found that _all_ online text was polluted with invisible and annoying to listen to characters.

Mine (NVDA) isn't annoying about non-ASCII symbols, interestingly enough. But for something like this form of Unicode "abuse" (?), if you throw a ton of them into a message or something, they become "lines" I have to arrow past because my screen reader will otherwise remain silent on those lines unless I use say-all (continuous reading for those who don't use screen readers).

kevinsync · 6 months ago

StegCloak [0] is in the same ballpark and takes this idea a step further by encrypting the hidden payload via AES-256-CTR -- pretty neat little trick

[0] https://github.com/KuroLabs/stegcloak

giancarlostoro · 6 months ago

There's a Better Discord plugin that I think uses this or something similar, so you could send completely encrypted messages, that look like nothing to everyone else. You'd need to share a password secret for them to decode it though.

johnisgood · 6 months ago

Could probably add OTR, too.

putna · 6 months ago

wow, thats neat.

Wanted to try on Cloudflare DNS TXT record. But Cloudflare is smart enough to decode when pasting in TXT field.

UltraSane · 6 months ago

DNS only supports ASCII for record values. It has a hack to support unicode domain names using Punycode

vzaliva · 6 months ago

The title lis little misleading: "Note that the base character does not need to be an emoji – the treatment of variation selectors is the same with regular characters. It’s just more fun with emoji."

Using this approach with non-emoji characters makes it more stealth and even more disturbing.

LorenPechtel · 6 months ago

I don't see this as all that disturbing. A detector for it wouldn't be hard to write (variant on something that doesn't actually have variants, flag it!), it seems to me it could be useful for signing things.