Reading about unicode has made me much, much more circumspect about the meaning of != in languages, and what fall-through behavior should look like. Unicode domain names lasted for a hot minute until someone registered microsoft.com with Cyrillic letters.
Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both. It's been living rent free in my head for almost two decades and I can't agree with it, nor can I laugh it off.
It might have been better if the 'code pages' idea was refined instead of eliminated (that is, the string uses one or more code pages, not the process). I don't know what the right answer is, but I know Every X is a Y almost always gets us into trouble.
That's simple - it is provably wrong. While relatively uncommon there are plenty of examples that would contradict this statement. And it's not about being able to encode the Rosetta Stone - non-scientists mix languages all the time, from Carmina Burana to Blinkenlights. They even make meaningful portmanteau words and write them with characters from multiple unrelated writing systems, like "заshitано" (see - Latin and Cyrillic scripts in the same single word!)
You miss the point. The basic unit of ASCII v2 (aka 'Unicode') should have been the codepage, not the codepoint. Having a stateful stream of codepage-symbol pairs is not a problem - in practice, all Unicode encodings ended up being stateful anyways, except in a shitty way that doesn't help to encode any semantic information.
Aside from all of the other issues mentioned, for some languages it's not clear what language something is purely based on codepoints.
For languages that have Latin-derived writing systems, it's not uncommon to use English letters (without diacritics) to write the language -- how would that be handled? In addition, thanks to Han unification (though this would've been a problem anyway -- loads of characters would've been unified anyway) all similar CJK Hanzi/Kanji/漢字 characters are mapped to the same codepoint regardless of language. This means that for some sentences it is entirely possible for you to not know whether a sentence fragment is Chinese or Japanese without more context and a native-like understanding of the language.
Also in many languages English words are written verbatim meaning that you can have sentences like (my Japanese is not perfect, this is just an example):
> あの芸人のYouTube動画を見たの?面白すぎるww
And (at least in Japanese) there are loads of other usage of Latin characters aside from loan words that would be too unwieldy to write in katakana -- "w" is like "lol", and fair few acronyms (BGM = Background Music, CM = Advertisement, TKG = 卵かけご飯 = (Raw) Egg on Rice). There are other languages that have similar "issues", but unfortunately I can't give any more examples because the only other language I speak (Serbian) writes everything (even people's names) phonetically.
As an aside -- if anyone ever has to support CJK languages (in subtitles for instance), please make sure you use the right fonts. While Unicode has encoded Han characters with the same codepoint, in different languages the characters are often drawn differently and you need to use the right font for the corresponding language (and area -- Mandarin speakers in different areas write some characters differently -- 返 is different in every CJK locale). Many media players and websites do not handle this correctly and it is fairly frustrating -- the net result is that Japanese is often displayed using Chinese fonts which makes it uncomfortable to read (it's still obvious what the character is, it's just off-putting).
Yeah, for the same reason that CJK designers often use absolutely abhorrent fonts for English words on packaging and printed media, is the same reason Westerners use terrible fonts for CJK. They are working with a language and culture they have no knowledge of, and think that if the characters look sorta right, then they must be readable and look good.
I've made so many horrible localization errors in the past because I had to translate things into 30 different languages and I can only barely read a handful of them, so I just copy and paste whatever the translators give me.
Incidently, I saw Japanese text recently that was quoting both English AND Arabic in the same sentence. And this was in a block of vertical text. That is literally a worst-case scenario I think. You have RTL-vertical text also containing RTL- and LTR-horizontal text. And unlike English, which when placed into vertical Japanese text, you can essentially choose whether you want the characters going vertically or sideways - you can't do that with Arabic as the letters must be joined together, I don't believe you can break them down and stack them on top of each other.
In our Jenkins system, we have remote build nodes return data back to the primary node via environment variable-style formatted files (e.g. FOO=bar), so when I had to send back a bunch of arbitrary multi-line textual data, I decided to base64 encode it. Simple enough.
On *nix systems, I ran this through the base64 command; the data was UTF8, which meant that in practice it was ASCII (because we didn't have any special characters in our commit messages).
On Windows systems... oh god. The system treats all text as UTF-16 with whatever byte order, and it took me ages to figure out how to get it to convert the data to UTF-8 before encoding it. Eventually it started working, and it worked for a while until it didn't for whatever reason. I ended up tearing out all the code and just encoding the UTF-16 in base64 and then processing that into UTF-8 on the master where I had access to much saner tools.
Generally speaking, "Unicode" works great in most cases, but when you're dealing with systems with weird or unusual encoding habits, like Windows using UTF-16 or MySQL's "utf8" being limited to three bytes per unicode character instead of four, everything goes out the window and it's the wild west all over again.
We had to ETL .csv data that must have originated in SQLServer.
The utf-16 fact about Windows was apparently unknown to my predecessor.
Who wrote some nasty c-language binary to copy the data, knock the upper byte off of each character ahead, and save the now ASCII text to a new file of the mysql load.
The encoding='utf-16' argument was all that was needed.
Funnily base64 suffers from a related issue that the likes of base58 correct : l and I or O and 0 looking similar or even identical depending on the font !
You can already map Unicode ranges to "code pages" of sorts, so how would that help?
Thing is, people who are not linguists do want to mix languages. It's very common in some cultures to intersperse the native language with English. But even if not, if the language in question uses a non-Latin alphabet, there are often bits and pieces of data that have to be written down in Latin. So that "most of us" perspective is really "most of us in US and Western Europe", at best.
For domains and such, what I think is really needed is a new definition of string equality that boils down to "are people likely to consider these two the same?". So that would e.g. treat similarly-shaped Latin/Greek/Cyrillic letters the same.
Oh, you can do far more than "code pages of sorts". Unicode has a variety of metadata available about each codepoint. The things that are "code pages of sorts" are maybe "block" (for ö "Latin-1 Supplement"), and "plane" (for ö it's "Basic Multilingual Plane"), but those are really mostly administrative and probably not what want.
But you also have "Script" (for ö "Latin). Some characters belong to more than one script though. Unicode will tell you that.
Unicode also has a variety of algorithms available already written. One of the most relevant ones here is... normalization. To compare two strings in the broadest semantic sense of "are people likely to consider these the same", you want want a "compatibility" normalization. NFKC or NFKD. They will for instance make `1` and `¹`[superscript] the same, which is definitely one kind of "consider these the same" -- very useful for, say, a search index.
That won't be iron-clad, but that will be better than trying to role your own algorithm involving looking at character metadata yourself! But it won't get you past intentional attacks using "look-alike" characters that are actually different semantically but look similar/indistinguishable depending on font. The trick is "consider these the same" really, it turns out, depends on context and purpose, it's not always the same.
Unicode has a LOT of very clever stuff in it to handle the inherently complicated problem of dealing with the entire universe of global languages that Unicode makes possible. It pays to spend some time with em.
You don't need a new definition, you just need to follow the official Unicode security guidelines.
I recommend the Moderate Restrictive Security profile for identifiers for mixed scripts. TR39. Plus allow Greek with Latin. This way you can identify Cyrillic, Greek, CFK or any recommended script, but are not allowed to mix Cyrillic with Greek. And you still can write math with Greek Symbols.
What we don't have are a standard string library to compare or find strings. wcscmp does not do normalization. There is no wcsfc (foldcase) for case insensitivity. There's no wcsnorm or wcsnfc.
I'm maintaining such a library.
coreutils, diff, grep, patch, sed and friends all cannot find Unicode strings, they have no string support. They can only mimic filesystems, finding binary garbage. Strings are so rthi g different than pure ASCII or BINARY garbage. Strings have an encoding and are Unicode.
Filesystems are even worse because they need to treat filenames as identifiers, but do not. Nobody cares about TR31, TR39, TR36 and so on.
Yeah, Greek alphabet is used a lot in sciences. It's really annoying that we're only starting to get proper support now. (Including on keyboards : http://norme-azerty.fr/en/ )
I am trying to formalise this with Cosmopolitan Identifiers (https://obua.com/publications/cosmo-id/3/). These identifiers consist of words and symbols. Symbols are normalised based on how they look like, and so Latin / Cyrillic / Greek symbols that look alike are mapped to the same symbol. Words are normalised differently, so that "Tree" and "tree" map to the same normal form. As a symbol, "T" and "t" are obviously different. I am not totally happy with the concept yet, I have implemented a fourth, simpler iteration of that concept as a Typescript package: https://www.npmjs.com/package/cosmo-id .
One of the problems is, how do you distinguish symbols and words? A simple way to do this is to classify something as a symbol if it is just a single character, and as a word otherwise. For example, "α-β" would consist of two symbols, separated by a hyphen, but "αβ" is a word and normalised to "av" based on some convention on how to "latinise" greek words.
Sprinkling English with foreign words is really, really common. I'm in New Zealand and people do it all the time. And even in the states, right? Don't want two different strings because someone writes an English sentence about how much they love jalapeño.
Think of just something simple like writing an immigrants name inside a sentence. It's kinda funny that people in SV, full of immigrants, never seem to think of putting their own or coworkers name in a String.
I'm not a linguist and that will probably be readily apparent. The word jalapeño leaves me wondering how distinct a boundary a language can possess or how one can sort out which language an individual word belongs to outside the context of the rest of the text or speech.
In English, jalapeño is correctly spelled with or without the eñe (and AFAIK the letter doesn't have a name in English, you have to use the Spanish name). So, there's an English word that doesn't use the letters assigned to the English alphabet. How do we place the word? Well, obviously English borrowed the word from Spanish, so it's a Spanish word. Well, no, it's only the Spanish adjectivization of Nahautl words used to name the place called Xalapa...
What's a word? (A quick test - how many words were in the previous sentence, maybe 3 or 4 depending on whether the 's is part of a word; so can we talk about Jóhannesson's foreign policy?).
It's hard enough to know what a letter is in unicode. Breaking things into words is just another massive headache.
> Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both.
Presumably the person who wrote it speaks a single language.
Just because something is not useful to them, it doesn't mean it is not useful in general. There are millions of polyglots as well as documents that include words and names in multiple scripts.
I think in that case the idea would either be that you should then have an array of strings, each of which may have its own language set, or that the string should be labelled as "containing Latin and Cyrillic", but still not able to include arbitrary other characters from Unicode. And multi-lingual text still generally breaks on words... Kilobytes of Latin text with a single Cyrillic character in the middle of a word is very suspicious, in a way that kilobytes of Latin text with a single Cyrillic word isn't.
Of course you'd always need an "unrestricted" string (to speak to the rest of the system if necessary), but there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together. Those exceptions can be treated as exceptions.
Of course living in denial makes it easy to ignore harsh realities. Unfortunately for them, humans don't work that way. Things aren't gonna spontaneously change just to make their life easier. The software adapts to us, not the other way around. People complain about the complexities of dates and times but they still make every effort to get it right because it matters.
If a programming language allows text processing but can't even properly compare unicode text, it is buggy and needs to be fixed. If an operating system can't deal with unicode, it's buggy and needs to be fixed.
Reminds me of the good old days with EUC-KR, KSC 5601, and all those different encoding schemes I've successfully repressed in my memory for years. Yes, you could probably assert that a piece of string was either Korean or English but never anything else... because the system was incapable of representing it.
I'm not exactly sure how a code page is supposed to help us here. Developers have trouble supporting multiple languages when they're all in the Unicode Standard. Supporting code pages for languages they've never heard of? Not a chance.
I'd guess a standardized codepage marker like a "start of CP[932]” is going to be necessary CP[1252] at each CP switches but it might be just a necessity. Han unification is a well known problem to Far Eastern but Unicode normalization problem is basically the same as that.
Heh, funny, I'm implementing this exact thing at the moment, oddly enough -- rather, implementing a security check that provides that same guarantee you mention, Mixed Script protections.
In Unicode spec terms, 'UTS 39 (Security)' contains the description of how to do this, mostly in section 5, and it relies on 'UTX 24 (Scripts)'.
It's more nuanced than your example but only slightly. If you replace "German" with "Japanese" you're talking about multiple scripts in the same 'writing system', but the spec provides files with the lists of 'sets of scripts' each character belongs to.
The way that the spec tells us to ensure that the word 'microsoft' isn't made up of fishy characters is that we just keep the intersection of each character's augmented script sets. If at the end, that intersection is empty, that's often fishy -- ie, there's no intersection between '{Latin}, {Cyrillic}'.
However, the spec allows the legit uses of writing systems that use more than one script; the lookup procedure outlined in the spec could give script sets like '{Jpan, Kore, Hani, Hanb}, {Jpan, Kana}' for two characters, and that intersection isn't empty; it'd give us the answer "Okay, this word is contained within the Japanese writing system".
Where I work and communicate, mixing 2, 3, and sometimes 4 writing systems is pretty normal; I have 3 keyboard layouts on my phone (Latin that covers English and occasional Spanish, Cyrillic, and Japanese).
In any case, there are emoji which are expected to be a part of text.
On one hand, it would be great to separate areas of different encodings inside a string. But character codes are already such separators.
Two things need to go though: the assumption of linear-time index-based access to characters in a string, and the custom to compare strings as byte arrays.
The first us already gone from several advanced string implementations. The second is harder: e.g. Linux filesystems support Unicode by being encoding-agnostic and handling names as byte sequences. Reworking that would be hard if practical at all.
> and the custom to compare strings as byte arrays
I think more generally, the idea that a langauge std lib can provide string equality for human langauge strings is just silly. String equality is an extremely context dependent, fuzzy operation, and should be handled by each context differently. For example for Unicode hostname to certificate mapping, hostname equality should be handled by rendering the hostname in several common web fonts and checking if the resulting bitmaps are similar. If they are, then assigning different certificates to these equal hostanmes should not be done.
Of course, in other contexts, there are different rules. For example, if looking up song names, the strings "jalapeno" and "jalapeño" should be considered equal, in English text at least.
That doesn’t make sense to me. Even disregarding cases where people mix languages (how do you write a dictionary? If the answer is “just create a data structure combining multiple strings”, shouldn’t we standardize how to do that?), all languages share thousands of symbols such as currency symbols, mathematical symbols, Greek and Hebrew alphabets (to be used in math books written in the language), etc. So, even languages such as Greek and English share way more symbols than that they have unique ones.
> being able to mix arbitrary languages into a single String object
Unless I missed something that is impossible with Unicode. Mixing multiple languages would require a way to specify the language used for case conversion, sorting and font rendering settings mid string and I don't think that Unicode has that. For example try to write a program that correctly uppercase a single string containing both an English i and a Turkish i in your favorite Unicode supporting language, the code point is the same for both, and you generally only get to specify one language globally or per function call.
You can write a string with words from multiple languages, you just can't easily modify it with operations like case conversion. But sorting shouldn't depend on the origin language anyway, it depends on the language of the reader. All words in an English dictionary are sorted in "English" order
> It might have been better if the 'code pages' idea was refined instead of eliminated
Obviously yes, it would have been better.
But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet. So here we are.
> But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet.
A major problem with Unicode is that it gives you strange ideas about text: that you can somehow take human text encoded in Unicode and answer questions like "how many letters does this have" or "are these two pieces of text different" or "split this text into words" in a way that works generically for any langauge or context.
These are all myths, and APIs for such things are bugs. The only thing you can meaningfully do with two pieces of arbitrary Unicode text is to say if they are byte-by-byte equal. For any other operation, you need to have specific business logic.
For example, are "Ionuț" and "Ionut" and "Ionutz" the same string or different strings? There is no generic answer: depending on the intended business logic, they may be identical or not (e.g. if we consider these to be Romanian names, they should be considered identical for search purposes, but probably not identical for storage purposes, where you want to remember exactly how the person spelled their name).
A related problem is that most langauges have no separate types for Text/String on one hand, and Symbol on the other. Text or other Strings should be opaque human text, that can only be interpreted by specific code, offering almost no API (only code point iteration). Symbols should be a restricted subset of Unicode that can offer fuller features, such as lengths, equality, separation into words etc. This would be the type of JSON key names used in serialization and deserialization, for example.
What they should have done is not that strange. Text is merely an ordered collection of characters. If you just assigned each character (aka grapheme) a number, text becomes a sequence of numbers. The first two questions you pose, "how many letters does this have" and "are these two pieces of text different" are trivially answered by such a representation. Unicode's fuck up is the managed to come up with something that can not reliably answer those two questions.
In fact what Unicode has end up with is so horrible, it's a major exercise in coding just to answer a simple question like "is there an 'o' in this sentence", as in Python3's "'o' in sentence" does not always return the right result.
Unicode's starting point was all wrong. There is an encoding that did a perfectly good job of mapping graphemes to numbers: ISO-10646. In fact Unicode is based on it, by then committed their original sin: they decided all the proposed ISO-10646 encodings (ie, how the numbers are encoding into byte streams) were crap, so they released a standard that combined two concepts that should have remained orthogonal: codepoints and encoding those codepoints to a binary stream.
Now it's true ISO-10646 proposed encodings were undercooked. That became painfully apparent when Ken Thompson came up with utf-8. But no biggie right: utf-8 was just another ISO-10646 encoding, just let it take over naturally. The Unicode solution to the encoding problem was to first decide we would never need more than 2^16 codepoints, then wrap it up in "one true encoding everyone can use": UCS2. Windows and Java, among others, bought the concept, and have paid the price ever since.
They were wrong of course. 2^16 was not enough. So they replaced the USC2 encoding with UTF-16 which was sort of backwards compatible. But not one UTF-16, oh no, that would be too simple. We got UFT-16LE and UTF-16BE. Notice what has happened here: take identical pieces of text, encode them as valid Unicode, and end up with two binary objects that were different. Way to go boys!
But that wasn't the worst of it: they managed to screw up UTF-16 so badly it didn't expand the code space to the 2^32 points, just 2^20. And in case you can't guess what happens next, I tell you: turns out there are more than 2^20 grapheme's out there.
What to do? Well there are a lot of characters that are "minor variants” of each other, like, like o and ö. Now Unicode already had a single code point for ö but to make it all fit and be uniform they decided "Combining Diaeresis” was the way these things should be done in future. So now the correct way to represent ö is a code point that says "add an umlaut to the next character (provided it isn't another diaeresis)" followed by the code point for o. But as the original codepoint for ö still exists, we can have two identical graphemes that don't compare as equal under Unicode, which is how we get to ö ≠ ö.
So it's not only Python3 "'o' in sentence" that doesn't always always work. We arrived at the point that "'ö' in sentence" can't be done without some heavy lifting that must be done by a library. Just to make it plain: some CPU's can do "'o' in sentence" in a single instruction. That simple design decison have lost is orders of magnitude in CPU efficiency.
I know these are strong words, but IMO this is a brain dead, monumental fuckup, making things acre feet, furlong fortnights look positively sane. It's time to abandon Unicode, and it's “Combining Diaeresis” in particular and go back to basics: ISO-10646 and utf-8. UTF-8 provides a 28 bit encoding space, which is more than enough to realise the one the single guiding principle that ISO-10646 was founded on: one codepoint per grapheme.
It won’t happen of course, so as a programmer I’ll have to deal with the shit sandwich the Unicode consortium has served up for the rest of my life.
While one codepoint per grapheme would be nice, it still wouldn't solve text. There are also problems like RTL and LTR writing systems that need to be combined into the same text.
And, many of the examples I gave earlier will not go away. The problem of similar URLs using different characters would be smaller, but not gone - microsoft.com and mícrosoft.com still look too similar. Text search should still support alternate spellings (color and colour). People's names would still have multiple legally identical spellings.
A fun related issue that could occur: applying NFD to a string can make it longer, so a sanitiser that limits file names to 255 UTF-16 code units but doesn’t first normalise to NFD could fail on HFS+.
This could occur on systems that normalise to NFC as well: NFC lengthens some strings, e.g. 𝅗𝅥 (U+1D15E MUSICAL SYMBOL HALF NOTE) normalises to 𝅗𝅥 (U+1D157 MUSICAL SYMBOL VOID NOTEHEAD, U+1D165 MUSICAL SYMBOL COMBINING STEM) in both NFC and NFD (similar happens in various Indic scripts, pointed Hebrew, and the isolated case of U+2ADC FORKING which is special for reasons UAX #15 explains), but I don’t think there are any file systems that actually normalise to NFC? (APFS prefers NFC, but doesn’t normalise at the file system level.)
The remaining concern would be that NFC could take more UTF-8 code units than NFD despite adding a character, but in practice this doesn’t occur (checked on NormalizationTest-3.2.0.txt).
APFS doesn’t “prefer” anything - it it will not change the bytes passed to NFC or NFD. The bytes passed for creation are stored as is ( HFS will store the NFD form on disk if you pass the NFC form to it). However APFS is normalization insensitive (if you create a NFC name on disk , you won’t be able to create the NFD version and you will be able to the name by both the NFC and NFD variants) just as HFS is - they both use different mechanisms to achieve normalization insensitivity.
That's a really great point about the string-length and not often addressed. You might even be able to force some sort of buffer overflow with that I guess.
Why isn’t the answer just “Don’t unicode normalise the file name”?
I thought the generally recommended way to deal with file names is to treat as a block of bytes (to the extent that e.g. rust has an entirely separate string type for OS provided strings), or just to allow direct encoding/decoding but not normalisation or alteration.
Well, precisely because if you don't normalize the filenames, ö ≠ ö. You could have two files with different filenames, `göteborg.txt` and `göteborg.txt`, and they are different files with different filenames.
Or you could have one file `göteborg.txt`, and when you try to ask for it as `göteborg.txt`, the system tells you "no file by that name".
Unicode normalization is the solution to this. And the unicode normalization algorithms are pretty good. The bug in this case is that the system did not apply unicode normalization consistently. It required a non-default config option to be turned on to do so? I don't really understand what's going on here, but it sounds like a bug in the system to me that this would be a non-default config option.
Dealing with the entire universe of human language is inherently complicated. But unicode gives us some actually pretty marvelous tools for doing it consistently and reasonably. But you still have to use them, and use them right, and with all software bugs are possible.
But I don't think you get fewer crazy edge cases by not normalizing at all. (In some cases you can even get security concerns, think about usernames and the risk of `jöhn` and `jöhn` being two different users...). I know that this is the choice some traditional/legacy OSs/file systems make, in order to keep pre-unicode-hegemony backwards compat. It has problems as well. I think the right choice for any greenfield possibilities is consistent unicode normalization, so `göteborg.txt` and `göteborg.txt` can't be two different files with two different filenames.
[btw I tried to actually use the two common different forms of ö in this text; I don't believe HN normalizes them so they should remain.]
It looks like instead of the config option switching everything to use the same normalization it keeps a second copy of the name in a database to compare to. What a horrible kludge, I wonder how they even got into this situation of using different normalization in different parts of the system?
In terms of what filenames are neither Windows nor Linux (I don't know for sure with MacOS but I doubt it) actually guarantee you any sort of characters.
Linux filenames are a sequence of non-zero bytes (they might be ASCII, or at least UTF-8, they might be an old 8-bit charset, but they also might just be arbitrary non-zero bytes) and Windows file names are a sequence of non-zero 16-bit unsigned integers, which you could think of as UTF-16 code units but they don't promise to encode UTF-16.
Probably the files have human readable names, but, maybe not. If you're accepting command line file names it's not crazy to insist on human readable (thus, Unicode) names, but if you process arbitrary input files you didn't create, particularly files you just found by looking around on disks unsupervised - you need to accept that utter gibberish is inevitable sooner or later and you must cope with that successfully.
This is what I found quite refreshing about Rust — instead of choosing one of the following:
A) The programmer is a almighty god who knows everything, we just expose him to the raw thing
B) The programmer is a immature toddler who cannot be trusted, so we handle things for them
What Rust does is more among the lines of "you might already know this, but anyways here is a reminder that you, the programmer need to take some decision about this".
Filenames in HFS+ filesystem (an old filesystem used by Mac OS X) are normalized with a proprietary variant of NFD - this is a filesystem feature. APFS removed this feature.
Sure but at some point you might want to create a file and frequently using user input or filter files using some user provided query string, the kind of use cases that unicode normalization was invented for. So the whole "opaque blob of bytes" filesystem handling is nice if all you want is to not silently corrupt files, but it is very obviously not even covering 10% of normal use cases. Rust isn't being super smart, it just has its hands thrown up in the air.
Still, it looks like the right thing to do is let the filesystem do the filesystem's job. The filesystem should be normalizing unicode and enforceing the case-insensitivity and whatnot, but just the filesystem. Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.
That works for programmers, but not for users. There could be several files with the same name, buth with different encodings. Worse, depending on how your terminal encodes user input, some of them migth not be typable.
From the users perspective I don't want any normalisation at all. It's good as long as you only have one file system but as soon as you get multiple file systems with conflicting rules (which includes transferring files to other people) it becomes hell. Unfortunately we are stuck with that hell.
Falls over on the fact that I don’t want to be able to write these two files in the same dir. if I write file ö1.txt and ö1.txt then I want to be warned that the file exists even of the encoding is different when I use two different apps but try to write the same file.
The same applies for a.txt and A.txt on case insensitive file systems (as someone pointed out the most common desktop file systems are).
Java is terrible in this regard, as most file APIs use "java.lang.String" to identify the filename, which most of the time depends on the system property "file.encoding". With the result that there will be files that you can never read from a java application if the filename encoding does not match the java file.encoding encoding.
Duolingo doesn't handle Unicode normalisation for certain languages, and it's incredibly frustrating. Here's one example[0] (Vietnamese) and I know it's the case for Yiddish as well.
A filesystem accepting only NFD should be filed as bug. They can normalize it internally to NFD, as Apples previous HFS+ did.
But even worse than that is Python's NFKC, which normalizes ℌ to H and so on. The recommended normalizations are NFC for offline normalization (like in compiled languages and databases) and NFD for online, where speed trump's space. unicode.org talking that much about NFKC was a big mistake. NFKC is crazy and doesn't even roundtrip. The whole TR31 XID_Start/Continue sets are mostly because of NFKC issues, not so about stability. But people bought it for its stability argument.
Also note that C++23 will most likely enforce NFC identifiers only. Same problem as with this filesystem. My implementation was to accept all normal. forms and store it internally and in the object files as NFC. The C ABI should declare it also. Currently they don't care as much as Linux filesystems: Nada. Identifiers being unidentifiable
Nope, the lack of normalization on both accounts by the SMB server caused the issue. It could have normalized before emitting but it definitely should have normalized on receiving for comparison.
I think that in the ls->read workflow, Nextcloud shouldn't normalize the response from SMB and should issue back to SMB whatever SMB returned to Nextcloud.
According to Unicode, it should be allowed to and the SMB server should be able to handle it. That's kind of the point of normalization, they're meant to be done before all comparisons so that exactly this doesn't happen. Your suggestion is just premature optimization, i.e. eliminating a redundancy.
NFD is fine when you dont have much time and can afford the space. NFC is about 3x slower and smaller. Forcing clients never works.
Be tolerant what you accept and strict what you write. In the case of C++23 enforcing NFC my mind is twisted. It would allow heavy tokenizer optimizations, but this is an offline compiler, where you don't really need that.
The problem are the compatible variants, NFKC and NFKD. But then you have usecases where you need them, and more to actually find strings. Levenshtein should not be the default when searching for strings.
Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both. It's been living rent free in my head for almost two decades and I can't agree with it, nor can I laugh it off.
It might have been better if the 'code pages' idea was refined instead of eliminated (that is, the string uses one or more code pages, not the process). I don't know what the right answer is, but I know Every X is a Y almost always gets us into trouble.
That's simple - it is provably wrong. While relatively uncommon there are plenty of examples that would contradict this statement. And it's not about being able to encode the Rosetta Stone - non-scientists mix languages all the time, from Carmina Burana to Blinkenlights. They even make meaningful portmanteau words and write them with characters from multiple unrelated writing systems, like "заshitано" (see - Latin and Cyrillic scripts in the same single word!)
For languages that have Latin-derived writing systems, it's not uncommon to use English letters (without diacritics) to write the language -- how would that be handled? In addition, thanks to Han unification (though this would've been a problem anyway -- loads of characters would've been unified anyway) all similar CJK Hanzi/Kanji/漢字 characters are mapped to the same codepoint regardless of language. This means that for some sentences it is entirely possible for you to not know whether a sentence fragment is Chinese or Japanese without more context and a native-like understanding of the language.
Also in many languages English words are written verbatim meaning that you can have sentences like (my Japanese is not perfect, this is just an example):
> あの芸人のYouTube動画を見たの?面白すぎるww
And (at least in Japanese) there are loads of other usage of Latin characters aside from loan words that would be too unwieldy to write in katakana -- "w" is like "lol", and fair few acronyms (BGM = Background Music, CM = Advertisement, TKG = 卵かけご飯 = (Raw) Egg on Rice). There are other languages that have similar "issues", but unfortunately I can't give any more examples because the only other language I speak (Serbian) writes everything (even people's names) phonetically.
As an aside -- if anyone ever has to support CJK languages (in subtitles for instance), please make sure you use the right fonts. While Unicode has encoded Han characters with the same codepoint, in different languages the characters are often drawn differently and you need to use the right font for the corresponding language (and area -- Mandarin speakers in different areas write some characters differently -- 返 is different in every CJK locale). Many media players and websites do not handle this correctly and it is fairly frustrating -- the net result is that Japanese is often displayed using Chinese fonts which makes it uncomfortable to read (it's still obvious what the character is, it's just off-putting).
I've made so many horrible localization errors in the past because I had to translate things into 30 different languages and I can only barely read a handful of them, so I just copy and paste whatever the translators give me.
Incidently, I saw Japanese text recently that was quoting both English AND Arabic in the same sentence. And this was in a block of vertical text. That is literally a worst-case scenario I think. You have RTL-vertical text also containing RTL- and LTR-horizontal text. And unlike English, which when placed into vertical Japanese text, you can essentially choose whether you want the characters going vertically or sideways - you can't do that with Arabic as the letters must be joined together, I don't believe you can break them down and stack them on top of each other.
Why can't we all just convert to Esperanto? ;)
On *nix systems, I ran this through the base64 command; the data was UTF8, which meant that in practice it was ASCII (because we didn't have any special characters in our commit messages).
On Windows systems... oh god. The system treats all text as UTF-16 with whatever byte order, and it took me ages to figure out how to get it to convert the data to UTF-8 before encoding it. Eventually it started working, and it worked for a while until it didn't for whatever reason. I ended up tearing out all the code and just encoding the UTF-16 in base64 and then processing that into UTF-8 on the master where I had access to much saner tools.
Generally speaking, "Unicode" works great in most cases, but when you're dealing with systems with weird or unusual encoding habits, like Windows using UTF-16 or MySQL's "utf8" being limited to three bytes per unicode character instead of four, everything goes out the window and it's the wild west all over again.
The utf-16 fact about Windows was apparently unknown to my predecessor.
Who wrote some nasty c-language binary to copy the data, knock the upper byte off of each character ahead, and save the now ASCII text to a new file of the mysql load.
The encoding='utf-16' argument was all that was needed.
For want of a nail. . .
Deleted Comment
Thing is, people who are not linguists do want to mix languages. It's very common in some cultures to intersperse the native language with English. But even if not, if the language in question uses a non-Latin alphabet, there are often bits and pieces of data that have to be written down in Latin. So that "most of us" perspective is really "most of us in US and Western Europe", at best.
For domains and such, what I think is really needed is a new definition of string equality that boils down to "are people likely to consider these two the same?". So that would e.g. treat similarly-shaped Latin/Greek/Cyrillic letters the same.
But you also have "Script" (for ö "Latin). Some characters belong to more than one script though. Unicode will tell you that.
Unicode also has a variety of algorithms available already written. One of the most relevant ones here is... normalization. To compare two strings in the broadest semantic sense of "are people likely to consider these the same", you want want a "compatibility" normalization. NFKC or NFKD. They will for instance make `1` and `¹`[superscript] the same, which is definitely one kind of "consider these the same" -- very useful for, say, a search index.
That won't be iron-clad, but that will be better than trying to role your own algorithm involving looking at character metadata yourself! But it won't get you past intentional attacks using "look-alike" characters that are actually different semantically but look similar/indistinguishable depending on font. The trick is "consider these the same" really, it turns out, depends on context and purpose, it's not always the same.
Unicode also has a variety of useful guides as part of the standard, including the guide to normalization https://unicode.org/reports/tr15/ and some guides related to security (such as https://unicode.org/reports/tr36/ and http://unicode.org/reports/tr39/), all of which are relevant to this concern, and suggest approaches and algorithms.
Unicode has a LOT of very clever stuff in it to handle the inherently complicated problem of dealing with the entire universe of global languages that Unicode makes possible. It pays to spend some time with em.
I recommend the Moderate Restrictive Security profile for identifiers for mixed scripts. TR39. Plus allow Greek with Latin. This way you can identify Cyrillic, Greek, CFK or any recommended script, but are not allowed to mix Cyrillic with Greek. And you still can write math with Greek Symbols.
What we don't have are a standard string library to compare or find strings. wcscmp does not do normalization. There is no wcsfc (foldcase) for case insensitivity. There's no wcsnorm or wcsnfc. I'm maintaining such a library.
coreutils, diff, grep, patch, sed and friends all cannot find Unicode strings, they have no string support. They can only mimic filesystems, finding binary garbage. Strings are so rthi g different than pure ASCII or BINARY garbage. Strings have an encoding and are Unicode.
Filesystems are even worse because they need to treat filenames as identifiers, but do not. Nobody cares about TR31, TR39, TR36 and so on.
Here is an overview of the sad state of Unicode unsafeties in programming languages: https://github.com/rurban/libu8ident/blob/master/c11.md
One of the problems is, how do you distinguish symbols and words? A simple way to do this is to classify something as a symbol if it is just a single character, and as a word otherwise. For example, "α-β" would consist of two symbols, separated by a hyphen, but "αβ" is a word and normalised to "av" based on some convention on how to "latinise" greek words.
In English, jalapeño is correctly spelled with or without the eñe (and AFAIK the letter doesn't have a name in English, you have to use the Spanish name). So, there's an English word that doesn't use the letters assigned to the English alphabet. How do we place the word? Well, obviously English borrowed the word from Spanish, so it's a Spanish word. Well, no, it's only the Spanish adjectivization of Nahautl words used to name the place called Xalapa...
It's hard enough to know what a letter is in unicode. Breaking things into words is just another massive headache.
Presumably the person who wrote it speaks a single language.
Just because something is not useful to them, it doesn't mean it is not useful in general. There are millions of polyglots as well as documents that include words and names in multiple scripts.
Of course you'd always need an "unrestricted" string (to speak to the rest of the system if necessary), but there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together. Those exceptions can be treated as exceptions.
Presumably the person who wrote it speaks English.
If a programming language allows text processing but can't even properly compare unicode text, it is buggy and needs to be fixed. If an operating system can't deal with unicode, it's buggy and needs to be fixed.
I'm not exactly sure how a code page is supposed to help us here. Developers have trouble supporting multiple languages when they're all in the Unicode Standard. Supporting code pages for languages they've never heard of? Not a chance.
In Unicode spec terms, 'UTS 39 (Security)' contains the description of how to do this, mostly in section 5, and it relies on 'UTX 24 (Scripts)'.
It's more nuanced than your example but only slightly. If you replace "German" with "Japanese" you're talking about multiple scripts in the same 'writing system', but the spec provides files with the lists of 'sets of scripts' each character belongs to.
The way that the spec tells us to ensure that the word 'microsoft' isn't made up of fishy characters is that we just keep the intersection of each character's augmented script sets. If at the end, that intersection is empty, that's often fishy -- ie, there's no intersection between '{Latin}, {Cyrillic}'.
However, the spec allows the legit uses of writing systems that use more than one script; the lookup procedure outlined in the spec could give script sets like '{Jpan, Kore, Hani, Hanb}, {Jpan, Kana}' for two characters, and that intersection isn't empty; it'd give us the answer "Okay, this word is contained within the Japanese writing system".
Deleted Comment
In any case, there are emoji which are expected to be a part of text.
On one hand, it would be great to separate areas of different encodings inside a string. But character codes are already such separators.
Two things need to go though: the assumption of linear-time index-based access to characters in a string, and the custom to compare strings as byte arrays.
The first us already gone from several advanced string implementations. The second is harder: e.g. Linux filesystems support Unicode by being encoding-agnostic and handling names as byte sequences. Reworking that would be hard if practical at all.
I think more generally, the idea that a langauge std lib can provide string equality for human langauge strings is just silly. String equality is an extremely context dependent, fuzzy operation, and should be handled by each context differently. For example for Unicode hostname to certificate mapping, hostname equality should be handled by rendering the hostname in several common web fonts and checking if the resulting bitmaps are similar. If they are, then assigning different certificates to these equal hostanmes should not be done.
Of course, in other contexts, there are different rules. For example, if looking up song names, the strings "jalapeno" and "jalapeño" should be considered equal, in English text at least.
Unless I missed something that is impossible with Unicode. Mixing multiple languages would require a way to specify the language used for case conversion, sorting and font rendering settings mid string and I don't think that Unicode has that. For example try to write a program that correctly uppercase a single string containing both an English i and a Turkish i in your favorite Unicode supporting language, the code point is the same for both, and you generally only get to specify one language globally or per function call.
Obviously yes, it would have been better.
But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet. So here we are.
This is not even remotely true.
These are all myths, and APIs for such things are bugs. The only thing you can meaningfully do with two pieces of arbitrary Unicode text is to say if they are byte-by-byte equal. For any other operation, you need to have specific business logic.
For example, are "Ionuț" and "Ionut" and "Ionutz" the same string or different strings? There is no generic answer: depending on the intended business logic, they may be identical or not (e.g. if we consider these to be Romanian names, they should be considered identical for search purposes, but probably not identical for storage purposes, where you want to remember exactly how the person spelled their name).
A related problem is that most langauges have no separate types for Text/String on one hand, and Symbol on the other. Text or other Strings should be opaque human text, that can only be interpreted by specific code, offering almost no API (only code point iteration). Symbols should be a restricted subset of Unicode that can offer fuller features, such as lengths, equality, separation into words etc. This would be the type of JSON key names used in serialization and deserialization, for example.
In fact what Unicode has end up with is so horrible, it's a major exercise in coding just to answer a simple question like "is there an 'o' in this sentence", as in Python3's "'o' in sentence" does not always return the right result.
Unicode's starting point was all wrong. There is an encoding that did a perfectly good job of mapping graphemes to numbers: ISO-10646. In fact Unicode is based on it, by then committed their original sin: they decided all the proposed ISO-10646 encodings (ie, how the numbers are encoding into byte streams) were crap, so they released a standard that combined two concepts that should have remained orthogonal: codepoints and encoding those codepoints to a binary stream.
Now it's true ISO-10646 proposed encodings were undercooked. That became painfully apparent when Ken Thompson came up with utf-8. But no biggie right: utf-8 was just another ISO-10646 encoding, just let it take over naturally. The Unicode solution to the encoding problem was to first decide we would never need more than 2^16 codepoints, then wrap it up in "one true encoding everyone can use": UCS2. Windows and Java, among others, bought the concept, and have paid the price ever since.
They were wrong of course. 2^16 was not enough. So they replaced the USC2 encoding with UTF-16 which was sort of backwards compatible. But not one UTF-16, oh no, that would be too simple. We got UFT-16LE and UTF-16BE. Notice what has happened here: take identical pieces of text, encode them as valid Unicode, and end up with two binary objects that were different. Way to go boys!
But that wasn't the worst of it: they managed to screw up UTF-16 so badly it didn't expand the code space to the 2^32 points, just 2^20. And in case you can't guess what happens next, I tell you: turns out there are more than 2^20 grapheme's out there.
What to do? Well there are a lot of characters that are "minor variants” of each other, like, like o and ö. Now Unicode already had a single code point for ö but to make it all fit and be uniform they decided "Combining Diaeresis” was the way these things should be done in future. So now the correct way to represent ö is a code point that says "add an umlaut to the next character (provided it isn't another diaeresis)" followed by the code point for o. But as the original codepoint for ö still exists, we can have two identical graphemes that don't compare as equal under Unicode, which is how we get to ö ≠ ö.
So it's not only Python3 "'o' in sentence" that doesn't always always work. We arrived at the point that "'ö' in sentence" can't be done without some heavy lifting that must be done by a library. Just to make it plain: some CPU's can do "'o' in sentence" in a single instruction. That simple design decison have lost is orders of magnitude in CPU efficiency.
I know these are strong words, but IMO this is a brain dead, monumental fuckup, making things acre feet, furlong fortnights look positively sane. It's time to abandon Unicode, and it's “Combining Diaeresis” in particular and go back to basics: ISO-10646 and utf-8. UTF-8 provides a 28 bit encoding space, which is more than enough to realise the one the single guiding principle that ISO-10646 was founded on: one codepoint per grapheme.
It won’t happen of course, so as a programmer I’ll have to deal with the shit sandwich the Unicode consortium has served up for the rest of my life.
And, many of the examples I gave earlier will not go away. The problem of similar URLs using different characters would be smaller, but not gone - microsoft.com and mícrosoft.com still look too similar. Text search should still support alternate spellings (color and colour). People's names would still have multiple legally identical spellings.
This could occur on systems that normalise to NFC as well: NFC lengthens some strings, e.g. 𝅗𝅥 (U+1D15E MUSICAL SYMBOL HALF NOTE) normalises to 𝅗𝅥 (U+1D157 MUSICAL SYMBOL VOID NOTEHEAD, U+1D165 MUSICAL SYMBOL COMBINING STEM) in both NFC and NFD (similar happens in various Indic scripts, pointed Hebrew, and the isolated case of U+2ADC FORKING which is special for reasons UAX #15 explains), but I don’t think there are any file systems that actually normalise to NFC? (APFS prefers NFC, but doesn’t normalise at the file system level.)
The remaining concern would be that NFC could take more UTF-8 code units than NFD despite adding a character, but in practice this doesn’t occur (checked on NormalizationTest-3.2.0.txt).
I thought the generally recommended way to deal with file names is to treat as a block of bytes (to the extent that e.g. rust has an entirely separate string type for OS provided strings), or just to allow direct encoding/decoding but not normalisation or alteration.
Or you could have one file `göteborg.txt`, and when you try to ask for it as `göteborg.txt`, the system tells you "no file by that name".
Unicode normalization is the solution to this. And the unicode normalization algorithms are pretty good. The bug in this case is that the system did not apply unicode normalization consistently. It required a non-default config option to be turned on to do so? I don't really understand what's going on here, but it sounds like a bug in the system to me that this would be a non-default config option.
Dealing with the entire universe of human language is inherently complicated. But unicode gives us some actually pretty marvelous tools for doing it consistently and reasonably. But you still have to use them, and use them right, and with all software bugs are possible.
But I don't think you get fewer crazy edge cases by not normalizing at all. (In some cases you can even get security concerns, think about usernames and the risk of `jöhn` and `jöhn` being two different users...). I know that this is the choice some traditional/legacy OSs/file systems make, in order to keep pre-unicode-hegemony backwards compat. It has problems as well. I think the right choice for any greenfield possibilities is consistent unicode normalization, so `göteborg.txt` and `göteborg.txt` can't be two different files with two different filenames.
[btw I tried to actually use the two common different forms of ö in this text; I don't believe HN normalizes them so they should remain.]
Linux filenames are a sequence of non-zero bytes (they might be ASCII, or at least UTF-8, they might be an old 8-bit charset, but they also might just be arbitrary non-zero bytes) and Windows file names are a sequence of non-zero 16-bit unsigned integers, which you could think of as UTF-16 code units but they don't promise to encode UTF-16.
Probably the files have human readable names, but, maybe not. If you're accepting command line file names it's not crazy to insist on human readable (thus, Unicode) names, but if you process arbitrary input files you didn't create, particularly files you just found by looking around on disks unsupervised - you need to accept that utter gibberish is inevitable sooner or later and you must cope with that successfully.
Rust's OSStr variants match this reality.
One example is when you submit a file in Safari it doesn't normalize the file name while js file.name does.
The same applies for a.txt and A.txt on case insensitive file systems (as someone pointed out the most common desktop file systems are).
[0]: https://forum.duolingo.com/comment/17787660/Bug-Correct-Viet...
But even worse than that is Python's NFKC, which normalizes ℌ to H and so on. The recommended normalizations are NFC for offline normalization (like in compiled languages and databases) and NFD for online, where speed trump's space. unicode.org talking that much about NFKC was a big mistake. NFKC is crazy and doesn't even roundtrip. The whole TR31 XID_Start/Continue sets are mostly because of NFKC issues, not so about stability. But people bought it for its stability argument.
I'm just writing a library and linter for such issues: https://github.com/rurban/libu8ident
Also note that C++23 will most likely enforce NFC identifiers only. Same problem as with this filesystem. My implementation was to accept all normal. forms and store it internally and in the object files as NFC. The C ABI should declare it also. Currently they don't care as much as Linux filesystems: Nada. Identifiers being unidentifiable
Nope, the lack of normalization on both accounts by the SMB server caused the issue. It could have normalized before emitting but it definitely should have normalized on receiving for comparison.
The problem are the compatible variants, NFKC and NFKD. But then you have usecases where you need them, and more to actually find strings. Levenshtein should not be the default when searching for strings.