Unicode Normalization Forms: When ö ≠ ö

Reading about unicode has made me much, much more circumspect about the meaning of != in languages, and what fall-through behavior should look like. Unicode domain names lasted for a hot minute until someone registered microsoft.com with Cyrillic letters.

Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both. It's been living rent free in my head for almost two decades and I can't agree with it, nor can I laugh it off.

It might have been better if the 'code pages' idea was refined instead of eliminated (that is, the string uses one or more code pages, not the process). I don't know what the right answer is, but I know Every X is a Y almost always gets us into trouble.

drdaeman · 4 years ago

> but for most of us we would be better off

That's simple - it is provably wrong. While relatively uncommon there are plenty of examples that would contradict this statement. And it's not about being able to encode the Rosetta Stone - non-scientists mix languages all the time, from Carmina Burana to Blinkenlights. They even make meaningful portmanteau words and write them with characters from multiple unrelated writing systems, like "заshitано" (see - Latin and Cyrillic scripts in the same single word!)

otabdeveloper4 · 4 years ago

You miss the point. The basic unit of ASCII v2 (aka 'Unicode') should have been the codepage, not the codepoint. Having a stateful stream of codepage-symbol pairs is not a problem - in practice, all Unicode encodings ended up being stateful anyways, except in a shitty way that doesn't help to encode any semantic information.

Nitramp · 4 years ago

I'm intrigued, what's заshitано?

cyphar · 4 years ago

Aside from all of the other issues mentioned, for some languages it's not clear what language something is purely based on codepoints.

For languages that have Latin-derived writing systems, it's not uncommon to use English letters (without diacritics) to write the language -- how would that be handled? In addition, thanks to Han unification (though this would've been a problem anyway -- loads of characters would've been unified anyway) all similar CJK Hanzi/Kanji/漢字 characters are mapped to the same codepoint regardless of language. This means that for some sentences it is entirely possible for you to not know whether a sentence fragment is Chinese or Japanese without more context and a native-like understanding of the language.

Also in many languages English words are written verbatim meaning that you can have sentences like (my Japanese is not perfect, this is just an example):

> あの芸人のYouTube動画を見たの？面白すぎるww

And (at least in Japanese) there are loads of other usage of Latin characters aside from loan words that would be too unwieldy to write in katakana -- "w" is like "lol", and fair few acronyms (BGM = Background Music, CM = Advertisement, TKG = 卵かけご飯 = (Raw) Egg on Rice). There are other languages that have similar "issues", but unfortunately I can't give any more examples because the only other language I speak (Serbian) writes everything (even people's names) phonetically.

As an aside -- if anyone ever has to support CJK languages (in subtitles for instance), please make sure you use the right fonts. While Unicode has encoded Han characters with the same codepoint, in different languages the characters are often drawn differently and you need to use the right font for the corresponding language (and area -- Mandarin speakers in different areas write some characters differently -- 返 is different in every CJK locale). Many media players and websites do not handle this correctly and it is fairly frustrating -- the net result is that Japanese is often displayed using Chinese fonts which makes it uncomfortable to read (it's still obvious what the character is, it's just off-putting).

kingcharles · 4 years ago

Yeah, for the same reason that CJK designers often use absolutely abhorrent fonts for English words on packaging and printed media, is the same reason Westerners use terrible fonts for CJK. They are working with a language and culture they have no knowledge of, and think that if the characters look sorta right, then they must be readable and look good.

I've made so many horrible localization errors in the past because I had to translate things into 30 different languages and I can only barely read a handful of them, so I just copy and paste whatever the translators give me.

Incidently, I saw Japanese text recently that was quoting both English AND Arabic in the same sentence. And this was in a block of vertical text. That is literally a worst-case scenario I think. You have RTL-vertical text also containing RTL- and LTR-horizontal text. And unlike English, which when placed into vertical Japanese text, you can essentially choose whether you want the characters going vertically or sideways - you can't do that with Arabic as the letters must be joined together, I don't believe you can break them down and stack them on top of each other.

Why can't we all just convert to Esperanto? ;)

danudey · 4 years ago

In our Jenkins system, we have remote build nodes return data back to the primary node via environment variable-style formatted files (e.g. FOO=bar), so when I had to send back a bunch of arbitrary multi-line textual data, I decided to base64 encode it. Simple enough.

On *nix systems, I ran this through the base64 command; the data was UTF8, which meant that in practice it was ASCII (because we didn't have any special characters in our commit messages).

On Windows systems... oh god. The system treats all text as UTF-16 with whatever byte order, and it took me ages to figure out how to get it to convert the data to UTF-8 before encoding it. Eventually it started working, and it worked for a while until it didn't for whatever reason. I ended up tearing out all the code and just encoding the UTF-16 in base64 and then processing that into UTF-8 on the master where I had access to much saner tools.

Generally speaking, "Unicode" works great in most cases, but when you're dealing with systems with weird or unusual encoding habits, like Windows using UTF-16 or MySQL's "utf8" being limited to three bytes per unicode character instead of four, everything goes out the window and it's the wild west all over again.

smitty1e · 4 years ago

We had to ETL .csv data that must have originated in SQLServer.

The utf-16 fact about Windows was apparently unknown to my predecessor.

Who wrote some nasty c-language binary to copy the data, knock the upper byte off of each character ahead, and save the now ASCII text to a new file of the mysql load.

The encoding='utf-16' argument was all that was needed.

For want of a nail. . .

pshc · 4 years ago

The perils of valuing backwards compatibility above all else… imagine having to use UTF16 in this day and age. Happy 2022!

vbezhenar · 4 years ago

UTF-16 is a simple encoding. It should take a few dozens of LoC to convert to UTF-8. At least if you don’t need extreme performance with AVX, etc.

Deleted Comment

BlueTemplar · 4 years ago

Funnily base64 suffers from a related issue that the likes of base58 correct : l and I or O and 0 looking similar or even identical depending on the font !

int_19h · 4 years ago

You can already map Unicode ranges to "code pages" of sorts, so how would that help?

Thing is, people who are not linguists do want to mix languages. It's very common in some cultures to intersperse the native language with English. But even if not, if the language in question uses a non-Latin alphabet, there are often bits and pieces of data that have to be written down in Latin. So that "most of us" perspective is really "most of us in US and Western Europe", at best.

For domains and such, what I think is really needed is a new definition of string equality that boils down to "are people likely to consider these two the same?". So that would e.g. treat similarly-shaped Latin/Greek/Cyrillic letters the same.

jrochkind1 · 4 years ago

Oh, you can do far more than "code pages of sorts". Unicode has a variety of metadata available about each codepoint. The things that are "code pages of sorts" are maybe "block" (for ö "Latin-1 Supplement"), and "plane" (for ö it's "Basic Multilingual Plane"), but those are really mostly administrative and probably not what want.

But you also have "Script" (for ö "Latin). Some characters belong to more than one script though. Unicode will tell you that.

Unicode also has a variety of algorithms available already written. One of the most relevant ones here is... normalization. To compare two strings in the broadest semantic sense of "are people likely to consider these the same", you want want a "compatibility" normalization. NFKC or NFKD. They will for instance make `1` and `¹`[superscript] the same, which is definitely one kind of "consider these the same" -- very useful for, say, a search index.

That won't be iron-clad, but that will be better than trying to role your own algorithm involving looking at character metadata yourself! But it won't get you past intentional attacks using "look-alike" characters that are actually different semantically but look similar/indistinguishable depending on font. The trick is "consider these the same" really, it turns out, depends on context and purpose, it's not always the same.

Unicode also has a variety of useful guides as part of the standard, including the guide to normalization https://unicode.org/reports/tr15/ and some guides related to security (such as https://unicode.org/reports/tr36/ and http://unicode.org/reports/tr39/), all of which are relevant to this concern, and suggest approaches and algorithms.

Unicode has a LOT of very clever stuff in it to handle the inherently complicated problem of dealing with the entire universe of global languages that Unicode makes possible. It pays to spend some time with em.

rurban · 4 years ago

You don't need a new definition, you just need to follow the official Unicode security guidelines.

I recommend the Moderate Restrictive Security profile for identifiers for mixed scripts. TR39. Plus allow Greek with Latin. This way you can identify Cyrillic, Greek, CFK or any recommended script, but are not allowed to mix Cyrillic with Greek. And you still can write math with Greek Symbols.

What we don't have are a standard string library to compare or find strings. wcscmp does not do normalization. There is no wcsfc (foldcase) for case insensitivity. There's no wcsnorm or wcsnfc. I'm maintaining such a library.

coreutils, diff, grep, patch, sed and friends all cannot find Unicode strings, they have no string support. They can only mimic filesystems, finding binary garbage. Strings are so rthi g different than pure ASCII or BINARY garbage. Strings have an encoding and are Unicode.

Filesystems are even worse because they need to treat filenames as identifiers, but do not. Nobody cares about TR31, TR39, TR36 and so on.

Here is an overview of the sad state of Unicode unsafeties in programming languages: https://github.com/rurban/libu8ident/blob/master/c11.md

BlueTemplar · 4 years ago

Yeah, Greek alphabet is used a lot in sciences. It's really annoying that we're only starting to get proper support now. (Including on keyboards : http://norme-azerty.fr/en/ )

obua · 4 years ago

I am trying to formalise this with Cosmopolitan Identifiers (https://obua.com/publications/cosmo-id/3/). These identifiers consist of words and symbols. Symbols are normalised based on how they look like, and so Latin / Cyrillic / Greek symbols that look alike are mapped to the same symbol. Words are normalised differently, so that "Tree" and "tree" map to the same normal form. As a symbol, "T" and "t" are obviously different. I am not totally happy with the concept yet, I have implemented a fourth, simpler iteration of that concept as a Typescript package: https://www.npmjs.com/package/cosmo-id .

One of the problems is, how do you distinguish symbols and words? A simple way to do this is to classify something as a symbol if it is just a single character, and as a word otherwise. For example, "α-β" would consist of two symbols, separated by a hyphen, but "αβ" is a word and normalised to "av" based on some convention on how to "latinise" greek words.

LAC-Tech · 4 years ago

Sprinkling English with foreign words is really, really common. I'm in New Zealand and people do it all the time. And even in the states, right? Don't want two different strings because someone writes an English sentence about how much they love jalapeño.

izacus · 4 years ago

Think of just something simple like writing an immigrants name inside a sentence. It's kinda funny that people in SV, full of immigrants, never seem to think of putting their own or coworkers name in a String.

cgriswald · 4 years ago

I'm not a linguist and that will probably be readily apparent. The word jalapeño leaves me wondering how distinct a boundary a language can possess or how one can sort out which language an individual word belongs to outside the context of the rest of the text or speech.

In English, jalapeño is correctly spelled with or without the eñe (and AFAIK the letter doesn't have a name in English, you have to use the Spanish name). So, there's an English word that doesn't use the letters assigned to the English alphabet. How do we place the word? Well, obviously English borrowed the word from Spanish, so it's a Spanish word. Well, no, it's only the Spanish adjectivization of Nahautl words used to name the place called Xalapa...

WalterBright · 4 years ago

My grandfather's thesis was auf Deutsch and is sprinkled with French and Latin words.

teej · 4 years ago

It’s very common in the online messaging I’ve seen in English, Spanish, Chinese, and Russian.

wisty · 4 years ago

What's a word? (A quick test - how many words were in the previous sentence, maybe 3 or 4 depending on whether the 's is part of a word; so can we talk about Jóhannesson's foreign policy?).

It's hard enough to know what a letter is in unicode. Breaking things into words is just another massive headache.

david-gpu · 4 years ago

> Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both.

Presumably the person who wrote it speaks a single language.

Just because something is not useful to them, it doesn't mean it is not useful in general. There are millions of polyglots as well as documents that include words and names in multiple scripts.

jerf · 4 years ago

I think in that case the idea would either be that you should then have an array of strings, each of which may have its own language set, or that the string should be labelled as "containing Latin and Cyrillic", but still not able to include arbitrary other characters from Unicode. And multi-lingual text still generally breaks on words... Kilobytes of Latin text with a single Cyrillic character in the middle of a word is very suspicious, in a way that kilobytes of Latin text with a single Cyrillic word isn't.

Of course you'd always need an "unrestricted" string (to speak to the rest of the system if necessary), but there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together. Those exceptions can be treated as exceptions.

naniwaduni · 4 years ago

> Presumably the person who wrote it speaks a single language.

Presumably the person who wrote it speaks English.

matheusmoreira · 4 years ago

Of course living in denial makes it easy to ignore harsh realities. Unfortunately for them, humans don't work that way. Things aren't gonna spontaneously change just to make their life easier. The software adapts to us, not the other way around. People complain about the complexities of dates and times but they still make every effort to get it right because it matters.

If a programming language allows text processing but can't even properly compare unicode text, it is buggy and needs to be fixed. If an operating system can't deal with unicode, it's buggy and needs to be fixed.

yongjik · 4 years ago

Reminds me of the good old days with EUC-KR, KSC 5601, and all those different encoding schemes I've successfully repressed in my memory for years. Yes, you could probably assert that a piece of string was either Korean or English but never anything else... because the system was incapable of representing it.

I'm not exactly sure how a code page is supposed to help us here. Developers have trouble supporting multiple languages when they're all in the Unicode Standard. Supporting code pages for languages they've never heard of? Not a chance.

numpad0 · 4 years ago

I'd guess a standardized codepage marker like a "start of CP[932]” ｉｓｇｏｉｎｇｔｏｂｅｎｅｃｅｓｓａｒｙ CP[1252] at each CP switches but it might be just a necessity. Han unification is a well known problem to Far Eastern but Unicode normalization problem is basically the same as that.

mr_luc · 4 years ago

Heh, funny, I'm implementing this exact thing at the moment, oddly enough -- rather, implementing a security check that provides that same guarantee you mention, Mixed Script protections.

In Unicode spec terms, 'UTS 39 (Security)' contains the description of how to do this, mostly in section 5, and it relies on 'UTX 24 (Scripts)'.

It's more nuanced than your example but only slightly. If you replace "German" with "Japanese" you're talking about multiple scripts in the same 'writing system', but the spec provides files with the lists of 'sets of scripts' each character belongs to.

The way that the spec tells us to ensure that the word 'microsoft' isn't made up of fishy characters is that we just keep the intersection of each character's augmented script sets. If at the end, that intersection is empty, that's often fishy -- ie, there's no intersection between '{Latin}, {Cyrillic}'.

However, the spec allows the legit uses of writing systems that use more than one script; the lookup procedure outlined in the spec could give script sets like '{Jpan, Kore, Hani, Hanb}, {Jpan, Kana}' for two characters, and that intersection isn't empty; it'd give us the answer "Okay, this word is contained within the Japanese writing system".

Deleted Comment

nine_k · 4 years ago

Where I work and communicate, mixing 2, 3, and sometimes 4 writing systems is pretty normal; I have 3 keyboard layouts on my phone (Latin that covers English and occasional Spanish, Cyrillic, and Japanese).

In any case, there are emoji which are expected to be a part of text.

On one hand, it would be great to separate areas of different encodings inside a string. But character codes are already such separators.

Two things need to go though: the assumption of linear-time index-based access to characters in a string, and the custom to compare strings as byte arrays.

The first us already gone from several advanced string implementations. The second is harder: e.g. Linux filesystems support Unicode by being encoding-agnostic and handling names as byte sequences. Reworking that would be hard if practical at all.

tsimionescu · 4 years ago

> and the custom to compare strings as byte arrays

I think more generally, the idea that a langauge std lib can provide string equality for human langauge strings is just silly. String equality is an extremely context dependent, fuzzy operation, and should be handled by each context differently. For example for Unicode hostname to certificate mapping, hostname equality should be handled by rendering the hostname in several common web fonts and checking if the resulting bitmaps are similar. If they are, then assigning different certificates to these equal hostanmes should not be done.

Of course, in other contexts, there are different rules. For example, if looking up song names, the strings "jalapeno" and "jalapeño" should be considered equal, in English text at least.

Someone · 4 years ago

That doesn’t make sense to me. Even disregarding cases where people mix languages (how do you write a dictionary? If the answer is “just create a data structure combining multiple strings”, shouldn’t we standardize how to do that?), all languages share thousands of symbols such as currency symbols, mathematical symbols, Greek and Hebrew alphabets (to be used in math books written in the language), etc. So, even languages such as Greek and English share way more symbols than that they have unique ones.

josefx · 4 years ago

> being able to mix arbitrary languages into a single String object

Unless I missed something that is impossible with Unicode. Mixing multiple languages would require a way to specify the language used for case conversion, sorting and font rendering settings mid string and I don't think that Unicode has that. For example try to write a program that correctly uppercase a single string containing both an English i and a Turkish i in your favorite Unicode supporting language, the code point is the same for both, and you generally only get to specify one language globally or per function call.

wongarsu · 4 years ago

You can write a string with words from multiple languages, you just can't easily modify it with operations like case conversion. But sorting shouldn't depend on the origin language anyway, it depends on the language of the reader. All words in an English dictionary are sorted in "English" order

aliceryhl · 4 years ago

I mix languages all the time.

otabdeveloper4 · 4 years ago

> It might have been better if the 'code pages' idea was refined instead of eliminated

Obviously yes, it would have been better.

But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet. So here we are.

gumby · 4 years ago

> But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet.

This is not even remotely true.

Why isn’t the answer just “Don’t unicode normalise the file name”?

I thought the generally recommended way to deal with file names is to treat as a block of bytes (to the extent that e.g. rust has an entirely separate string type for OS provided strings), or just to allow direct encoding/decoding but not normalisation or alteration.

jrochkind1 · 4 years ago

Well, precisely because if you don't normalize the filenames, ö ≠ ö. You could have two files with different filenames, `göteborg.txt` and `göteborg.txt`, and they are different files with different filenames.

Or you could have one file `göteborg.txt`, and when you try to ask for it as `göteborg.txt`, the system tells you "no file by that name".

Unicode normalization is the solution to this. And the unicode normalization algorithms are pretty good. The bug in this case is that the system did not apply unicode normalization consistently. It required a non-default config option to be turned on to do so? I don't really understand what's going on here, but it sounds like a bug in the system to me that this would be a non-default config option.

Dealing with the entire universe of human language is inherently complicated. But unicode gives us some actually pretty marvelous tools for doing it consistently and reasonably. But you still have to use them, and use them right, and with all software bugs are possible.

But I don't think you get fewer crazy edge cases by not normalizing at all. (In some cases you can even get security concerns, think about usernames and the risk of `jöhn` and `jöhn` being two different users...). I know that this is the choice some traditional/legacy OSs/file systems make, in order to keep pre-unicode-hegemony backwards compat. It has problems as well. I think the right choice for any greenfield possibilities is consistent unicode normalization, so `göteborg.txt` and `göteborg.txt` can't be two different files with two different filenames.

[btw I tried to actually use the two common different forms of ö in this text; I don't believe HN normalizes them so they should remain.]

nieve · 4 years ago

It looks like instead of the config option switching everything to use the same normalization it keeps a second copy of the name in a database to compare to. What a horrible kludge, I wonder how they even got into this situation of using different normalization in different parts of the system?

tialaramex · 4 years ago

In terms of what filenames are neither Windows nor Linux (I don't know for sure with MacOS but I doubt it) actually guarantee you any sort of characters.

Linux filenames are a sequence of non-zero bytes (they might be ASCII, or at least UTF-8, they might be an old 8-bit charset, but they also might just be arbitrary non-zero bytes) and Windows file names are a sequence of non-zero 16-bit unsigned integers, which you could think of as UTF-16 code units but they don't promise to encode UTF-16.

Probably the files have human readable names, but, maybe not. If you're accepting command line file names it's not crazy to insist on human readable (thus, Unicode) names, but if you process arbitrary input files you didn't create, particularly files you just found by looking around on disks unsupervised - you need to accept that utter gibberish is inevitable sooner or later and you must cope with that successfully.

Rust's OSStr variants match this reality.

atoav · 4 years ago

This is what I found quite refreshing about Rust — instead of choosing one of the following:

  A) The programmer is a almighty god who knows everything, we just expose him to the raw thing
  
  B) The programmer is a immature toddler who cannot be trusted, so we handle things for them

What Rust does is more among the lines of "you might already know this, but anyways here is a reminder that you, the programmer need to take some decision about this".

zekica · 4 years ago

macOS is interesting: some APIs normalize filenames while others don't. And it causes some very interesting bugs.

One example is when you submit a file in Safari it doesn't normalize the file name while js file.name does.

GlitchMr · 4 years ago

Filenames in HFS+ filesystem (an old filesystem used by Mac OS X) are normalized with a proprietary variant of NFD - this is a filesystem feature. APFS removed this feature.

stefan_ · 4 years ago

Sure but at some point you might want to create a file and frequently using user input or filter files using some user provided query string, the kind of use cases that unicode normalization was invented for. So the whole "opaque blob of bytes" filesystem handling is nice if all you want is to not silently corrupt files, but it is very obviously not even covering 10% of normal use cases. Rust isn't being super smart, it just has its hands thrown up in the air.

pavlov · 4 years ago

The most common desktop file systems are case-insensitive, which complicates the picture.

Pxtl · 4 years ago

Still, it looks like the right thing to do is let the filesystem do the filesystem's job. The filesystem should be normalizing unicode and enforceing the case-insensitivity and whatnot, but just the filesystem. Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.

mjevans · 4 years ago

Case insensitivity is a braindead behavior. If desired it should be a fallback path selecting the best match, not the first resort.

arka2147483647 · 4 years ago

That works for programmers, but not for users. There could be several files with the same name, buth with different encodings. Worse, depending on how your terminal encodes user input, some of them migth not be typable.

zarzavat · 4 years ago

From the users perspective I don't want any normalisation at all. It's good as long as you only have one file system but as soon as you get multiple file systems with conflicting rules (which includes transferring files to other people) it becomes hell. Unfortunately we are stuck with that hell.

alkonaut · 4 years ago

Falls over on the fact that I don’t want to be able to write these two files in the same dir. if I write file ö1.txt and ö1.txt then I want to be warned that the file exists even of the encoding is different when I use two different apps but try to write the same file.

The same applies for a.txt and A.txt on case insensitive file systems (as someone pointed out the most common desktop file systems are).