RFC 9839 and Bad Unicode

integralid · 20 hours ago

I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.

On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.

csande17 · 20 hours ago

Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:

- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"

- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were

OCTAGRAM · 4 minutes ago

Seed7 uses UTF-32. Ada standard library has got UTF-32 for I/O data and file names. Ada is such a language where almost nothing disappears in standard library, so 8-bit and UTF-16 I/O and/or file names are all still there.

mort96 · 18 hours ago

> - "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.

alright2565 · 18 hours ago

> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

stuartjohnson12 · 19 hours ago

> "WTF-8", aka "the JavaScript string type"

This sequence of characters is a work of art.

dcrazy · 19 hours ago

Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

zahlman · 18 hours ago

>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.

Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)

Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)

Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:

  >>> x = '\ud800\udc00'
  >>> x
  '\ud800\udc00'
  >>> print(x)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

They're even disallowed for an explicit encode to utf-16:

  >>> x.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed

But this can be overridden:

  >>> x.encode('utf-16-le', 'surrogatepass')
  b'\x00\xd8\x00\xdc'

Which subsequently allows for decoding that automatically interprets surrogate pairs:

  >>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le')
  >>> y
  '𐀀'
  >>> len(y)
  1
  >>> ord(y)
  65536

Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux):

  $ cat cmdline.py 
  #!/usr/bin/python
  
  import binascii, sys
  for arg in sys.argv[1:]:
      abytes = arg.encode(sys.stdin.encoding, 'surrogateescape')
      ahex = binascii.hexlify(abytes)
      print(ahex.decode('ascii'))
  $ ./cmdline.py foo
  666f6f
  $ ./cmdline.py 日本語
  e697a5e69cace8aa9e
  $ ./cmdline.py $'\x01\x00\x02'
  01
  $ ./cmdline.py $'\xff'
  ff
  $ ./cmdline.py ÿ
  c3bf

It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:

  >>> b'\xff'.decode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
  >>> b'\xff'.decode('utf-8', 'surrogateescape')
  '\udcff'

Joker_vD · 17 hours ago

Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files. I understand that you may want to store some "ANSI coloring markup" (it's not "VT100 colors" — the VT series was monochrome until VT525 of 1994), sure, but it's then, arguably, not a plain text anymore, is it? It's in a text markup format of sorts, not unlike Markdown, only the one that uses a different encoding that dips into the C0 range. Just because your favourite output device can display it prettily when you cat your data into it doesn't really mean it's a plain text.

Yes, I do realize that there is a lot of text markup formats that encode into plain text, for better interoperability.

cesarb · 11 hours ago

> Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files.

It is (or, at least, used to be) common to have FF characters on plain text files, as a signal for your (dot matrix) printer to advance to the next page. So I'd add at least FF to that list.

Aaron2222 · 7 hours ago

The DEC VT241 from 1984 had colour.

https://terminals-wiki.org/wiki/index.php/DEC_VT240

https://www.1000bit.it/ad/bro/digital/DECVT240.pdf

singpolyma3 · 15 hours ago

Why ban emoji in username?

pas · 13 hours ago

I think for username it's fine, where a bit of restraint makes sense is for billing/shipping/legal-ish data.

TheRealPomax · 18 hours ago

I think you missed the part where the RFC is about which Unicode is bad for protocols and data formats, and so which Unicode you should avoid when designing those from now on, with an RFC to consult to know which ones those are. It has nothing to do with "what if I have a file with X" or "what if I want Y in usernames", it's about "what should I do if I want a normal, well behaved, unicode-text-based protocol or data format".

It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.

So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.

TacticalCoder · 13 hours ago

> In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

Usernames are a bad examples. Because at the point you mention, you may as well only allow a subset of visible ASCII. Which a lot of sites do and that works perfectly fine.

But for stuff like family names you have to restrict so many thing otherwise you'll have little-bobby-zalgo-with-hangul-modifiers breaking havoc.

Unicode is the problem. And workarounds are sadly needed due to the clusterfuck that Unicode is.

Like TFA shows. Like any single homographic attack using Unicode characters shows.

If Unicode was good, it wouldn't regularly be frontpage of HN.

CharlesW · 20 hours ago

> I like the idea, just don't buy the argumentation or examples in the blog post.

Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.

It seems like you think this standard is JSON-specific?

doug_durham · 18 hours ago

I thought the question was pretty substantive. What layer in the code stack should make the decisions about what characters to allow? I had exactly the same question. If the library declares that it will filter out certain subsets then that allows me to choose a different library if needed. I would hate to have this RFC blindly implemented randomly just because it's a standard.

OCTAGRAM · 16 minutes ago

> The first code point is zero, in Unicode jargon U+0000. In human-readable text it has no meaning, but it will interfere with the operation of certain programming languages.

This part encourages more active usage of U+0000, so that programmers of certain programming languages get a message that they are not welcome

JimDabell · 20 hours ago

> PRECISion · You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode. It didn’t; here’s RFC 8264: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols; the first PRECIS predecessor was published in 2002. 8264 is 43 pages long, containing a very thorough discussion of many more potential Bad Unicode issues than 9839 does.

I’d also suggest people check out the accompanying RFCs 8265 and 8266:

PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols:

— https://www.rfc-editor.org/rfc/rfc8264

Preparation, Enforcement, and Comparison of Internationalized Strings: Representing Usernames and Passwords

— https://www.rfc-editor.org/rfc/rfc8265

Preparation, Enforcement, and Comparison of Internationalized Strings Representing Nicknames:

— https://www.rfc-editor.org/rfc/rfc8266

Generally speaking, you don’t want usernames being displayed that can change the text direction, or passwords that have different byte representations depending on the device that was used to type it in. These RFCs have specific profiles to avoid that.

I think for these kinds of purposes, failing closed is more secure than failing open. I’d rather disallow whatever the latest emoji to hit the streets is from usernames than potentially allow it to screw up every page that displays usernames.

singpolyma3 · 15 hours ago

The problem with failing closed is that you end up 20 years later still not supporting emoji from 20 years ago and users get annoyed...

Waterluvian · 20 hours ago

I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.

The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.

csande17 · 20 hours ago

Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".

weinzierl · 19 hours ago

I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.

I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.

It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.

estebank · 19 hours ago

The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]

I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.

1: https://trojansource.codes/

arp242 · 19 hours ago

I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).

Etheryte · 20 hours ago

As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.

kps · 20 hours ago

Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

eviks · 19 hours ago

Indeed, though a lot of that complexity like surrogates and control codes aren't due to attempts to write language, that's just awful designs preserved for posterity

ivanjermakov · 12 hours ago

Unicode sucks, but it sucks less than every other encoding standard.

ninkendo · 19 hours ago

It seems like most of these are handled by just rejecting invalid UTF-8 byte sequences (ideally, erroring out altogether) when interpreting a string as UTF-8. I mean, unpaired surrogates, or any surrogate for that matter, is already illegal as a UTF-8 byte sequence. Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.

The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.

doug_durham · 18 hours ago

That seems reasonable. It should be up to the application implementer to make that choice and not a lower level more general purpose library. I haven't run into any JSON parsers for usernames only code.

zzo38computer · 11 hours ago

I do not agree that Unicode is good. I think that Unicode is not good.

I also think that, regardless of the character set, what to include (control characters, graphic characters, maximum length, etc) will have to depend on the specific application anyways, so trying to include/exclude in JSON doesn't work as well.

Giving a name to a specific subset (or sometimes a superset, but usually subset) of Unicode (or any other character sets, such as ASCII or ISO 2022 or TRON code) can be useful, but don't assume it is good for everyone or even is good for most things, because it isn't.

RFC 9839 does give names to some subsets of Unicode, which may sometimes be useful, but should not automatically assume that is right for what you will be making. My opinion is to consider to not use or require Unicode.

msgodel · 11 hours ago

It's really the combining characters that are the problem. That changed it from a character set to a DSL for describing characters.

ks2048 · 20 hours ago

It's worth noting that Unicode already defines a "General Category" for all code points that categorizes some of these types of "weird" characters.

https://en.wikipedia.org/wiki/Unicode_character_property#Gen...

e.g. in Python,

   import unicodedata
   print(unicodedata.category(chr(0)))
   print(unicodedata.category(chr(0xdead)))

Shows "Cc" (control) and "Cs" (surrogate).

arp242 · 19 hours ago

Excluding all of "legacy controls" not just as literals but also escaped strings (e.g. "\u0027") seems too much. C1 is essentially unused AFAIK and that's okay, but a number of C0 characters do see real-world use (escape, EOF, NUL). IMHO there are valid and reasonable use cases to use some of them.

NelsonMinar · 16 hours ago

I've made good use of unusual C0 characters like U+001E (Record Separator). I think it makes sense to exclude them from documents but they can be useful in text data streams.

senderista · 10 hours ago

Agreed, I would be very annoyed to see separator characters arbitrarily rejected by software I don't control. I think these characters are seriously underused, considering all the issues with in-band separators.

msgodel · 8 hours ago

I've seen program source code with form feeds (U+000C) in it. Apparently Emacs has built in support for using them for navigation so adjacent things occasionally contain them.