What every software developer must know about Unicode in 2023

There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.

The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace." Like so many other things in Unicode, the correct answer is use-case dependent.

(And for this reason, String iteration should be based on codepoints--it's the fundamental level on which Unicode works, and whatever algorithm you want to use to derive the correct answer for your purpose will be based on codepoint iteration. hsivonen's article (https://hsivonen.fi/string-length/), linked in this one, does try to explain why extended grapheme clusters is the wrong primitive to use in a language.)

raphlinus · 2 years ago

Agreed. And one more consideration is that (extended) grapheme cluster boundaries vary from one version of Unicode to another, and also allow for "tailoring." For example, should "อำ" be one grapheme cluster or two? It's two on Android but one by Unicode recommendation and is the behavior on mac. So in applications where a query such as length needs to have one definitive answer which cannot change by context, counting (extended) grapheme clusters is the wrong way to go.

lifthrasiir · 2 years ago

In fact, the name "extended" grapheme cluster should give it away. There was already a major revision to UAX #29 so that the original version is now referred as to "legacy". Your example is exactly this case: the second character, U+0E33 THAI CHARACTER SARA AM, prohibits a cluster boundary now but previously didn't [1].

[1] Relevant specifications: https://unicode.org/reports/tr29/#SpacingMark and https://unicode.org/reports/tr29/#GB9a

pif · 2 years ago

> An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace."

I'm sorry, but I fail to see how "This visually displays as a single unit" could ever differ from "Display size in a monospace font" or "Thing that gets deleted when you hit backspace".

jfultz · 2 years ago

A couple of cases I'm aware of...

* Coding ligatures often display as a single glyph (maybe occupying a single-width character space, or maybe spread out over multiple spaces), but are composed of multiple glyphs. The ligature may "look" like a single character for purposes of selection and cursoring, but it can act like multiple characters when subject to backspacing.

* Similarly, I've seen keyboard interfaces for various languages (e.g., Hindi) where standard grapheme cluster rules bind together a group of code points, but the grapheme cluster was composed from multiple key presses (which typically add one code point each to the cluster). And in some such interfaces I've seen, the cluster can be decomposed by an equal number of backspace presses. I don't have a good sense of how much a monospaced Hindi font makes sense, but it's definitely a case where a "character" doesn't always act "character-like".

mattnewton · 2 years ago

> Display size in a monospace font

Some clusters are going to be multiple characters wide.

> thing that gets deleted when you hit backspace

Some clusters are meant to be composted of multiple keystrokes and a natural editing experience would allow users to delete the last stroke.

Look into how Korean works.

jcranmer · 2 years ago

See, e.g., https://github.com/xi-editor/xi-editor/issues/655 for why backspace isn't the same as extended grapheme cluster.

As for "display size in monospace font", emojis and CJK characters are usually two units wide, not one (although, to be honest, there's a fair amount of bugs in the Unicode properties that define this).

RichieAHB · 2 years ago

Here’s a good example of the test cases used for backspaces in Android[1]. It’s definitely more involved than just deleting a grapheme cluster.

[1] https://android.googlesource.com/platform/frameworks/base/+/...

yeputons · 2 years ago

Here is a full article of such examples: https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

Discussion on HN: https://news.ycombinator.com/item?id=31858311

orphea · 2 years ago

If you type "a", combine it with "´", then change your mind and hit backspace, you probably want to end up with "a" even through "á" was a thing "visually displayed as a single unit".

layer8 · 2 years ago

In terminals there is a distinction between single-width and double-width characters (east-asian characters, in particular). E.g. the three characters

    A美C

would take up the width of four ASCII monospace characters, the “美” being double-width.

Similarly, for composed characters like say the ligature “ﬀ”, you may want to backspace as if it was two “f”s (which logically it is, and decomposes to in NFKD normalization).

MatmaRex · 2 years ago

﷽

𒐫

𒈙

꧅

wbl · 2 years ago

If I have a text with niqqudim I am going to want to think of the niqqudim differently when editing despite the fact they are entwined with the consonants.

barkingcat · 2 years ago

characters that alter their appearance to be one or more display units depending on the characters that are next to it (before and after). that would be a very crude example, but these types of characters appear all the time in human language

riggsdk · 2 years ago

There are libraries that help with iterating both code-points and grapheme clusters... - but are there any of them that can help decide what to do for example when pressing backspace given an input string and a cursor position? Or any other text editing behavior. This use-case-dependent behavior must have some "correct" behavior that is standardized somewhere?

Like a way to query what should be treated like a single "symbol" when selecting text? Basically something that could help out users making simple text-editors. There are so many bad implementations out there that does it incorrectly so there must be some tools/libraries to help with this? Not only for actual applications but for people making games as well where you want users to enter names, chat or other text. Not all platforms make it easy (or possible) to embed a fully fledged text editing engine for those use-cases.

I can imagine that typing a multi-code-point character manually by hand would allow the user to undo their typing mistake by a single backspace press when they are actively typing it, but after that if you return to the symbol and press backspace that it would delete the whole symbol (grapheme cluster).

For example if you manually entered the code points for the various family combination emojis (mother, son, daughter) you could still correct it for a while - but after the fact the editor would only see it as a single symbol to be deleted with a single backspace press?

Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'. (Not sure that is the way in which you would normally type those characters but it seems possible to do it with unicode that way).

PeterisP · 2 years ago

I'd argue that you must use grapheme clusters for text editing and cursor position, because here are popular characters (like ö you used as example) which can be either one or two codepoints depending on the normalization choice, but the difference is invisible to the user and should not matter to the user, so any editor should behave exactly the same for ö as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) and ö as a sequence of U+006F (LATIN SMALL LETTER O) and U+0308 (COMBINING DIAERESIS).

Furthermore, you shouldn't assume that there is any relationship between how unicode constructs a combined character from codepoints with how that character is typed, even at the level of typing you're not typing unicode codepoints - they're just a technical standard representation of "text at rest", unicode codepoints do not define an input method. Depending on your language and device, a sequence of three or more keystrokes may be used to get a single codepoint, or a dedicated key on keyboard or a virtual button may spawn a combined character of multiple codepoints as a single unit; you definitely can't assume that the "last codepoint" corresponds to "last user action" even if you're writing a text editor - much of that can happen before your editor receives that input from e.g. OS keyboard layout code; your editor won't know whether I input that ö from a dedicated key, a 'chord' of 'o' key with a modifier, or a sequence of two keystrokes (and if so, whether 'o' was the first keystroke or the second, opposite of how the unicode codepoints are ordered).

jcranmer · 2 years ago

Some platforms (e.g., Android) have methods specifically for asking how to edit a string following a backspace. However, there's no standard Unicode algorithm to answer the question (and I strongly suspect that it's something that's actually locale-dependent to a degree).

On further reflection, probably the best starting point for string editing on backspace is to operate on codepoints, not grapheme clusters. For most written languages, the various elements that make up a character are likely to be separate codepoints. In Latin text, diacritics are generally precomposed (I mean, you can have a + diacritic as opposed to precomposed ä in theory, but the IME system is going to spit out ä anyways, even if dead keys are used). But if you have Indic characters or Hangul, the grapheme cluster algorithm is going to erroneously combine multiple characters into a single unit. The issue is that the biggest false positive for a codepoint-based algorithm is emoji, and if you're a monolingual speaker whose only exposure to complex written scripts is Unicode emoji, you're going to incorrectly generalize it for all written languages.

Etherlord87 · 2 years ago

IMHO backspace is not an undo key. Use CTRL+Z if you want to undo converting a grapheme to another grapheme with a diacritic character. Backspace should just delete it.

On the other hand, a ligature shouldn't be deleted entirely with just one backspace. It's two letters after all, just connected.

So how do we distinguish when codepoints are separate graphemes, and when they constitute a single grapheme? Based on if they they can still be recognized as separate within the glyph? If they combine horizontally vs vertically (along the text direction or orthogonal?) What about e.g. "¼" - are those 3 graphemes? What about "%" and "‰"? What about "&" ("et" ligature)? It seems you can't run away from being arbitrary…

gumby · 2 years ago

> Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'.

This is a good example because in German I would expect 'o' + '¨' + <delete> to leave no character at all while in French I would expect 'e' + '`' + <delete> to leave the e behind because in my mind it was a typo.

The rendering of brahmic- and arabic-derived scripts makes these choices even more interesting.

layer8 · 2 years ago

Behavior that depends on whether you edited something else in between, or that depends on timing, is just bad. Either always backspace grapheme clusters, or else backspace characters, possibly NFC-normalized. I could also imagine having something like Shift+Backspace to backspace NFKD-normalized characters when normal Backspace deletes grapheme clusters.

As for selection and cursor movement, grapheme clusters would seem to be the correct choice. Same for Delete. An editor may also support an “exploded” view of separate characters (like WordPerfect Reveal Codes) where you manipulate individual characters.

hgs3 · 2 years ago

Everybody loves to debate what "character" means but nobody ever consults the standard. In the Unicode Standard a "character" is an abstract unit of textual data identified by a code point. The standard never refers to graphemes as "characters" but rather as user-perceived characters which the article omits.

lucideer · 2 years ago

I'm not Korean but seeing that said of the Hangul example definitely made me pause - I doubt Koreans think of that example as a single grapheme (open to correction), though it is an excellent example all the same since it demonstrates the complexity of defining "units" consistently across language.

It reminds me a little of Open Street Map's inconsistent administrative hierarchies ("states", "countries", "counties", etc. being represented at different administrative "levels" in their hierarchy for each geographical area), and how that hinders consistency in styling- font size, zoom levels, etc. being generally applied by level.

lifthrasiir · 2 years ago

As a native Korean, I can confirm that "각" is perceived as a single character. But the example itself is bad anyway because everyone use a precomposed form U+AC01 instead of U+1100 U+1161 U+11A8 instead (they are canonically equivalent). This is more clear when you also consider a compatibility representation "ㄱㅏㄱ" U+3131 U+314F U+3131, which is same to "각" after compatibility normalizations (NFKC or NFKD), but perceived as three atomic characters in general.

haberman · 2 years ago

In that case, it sounds like `length` on Unicode strings simply shouldn't exist, since there is no obvious right answer for it. Instead there should be `codepointCount`, `graphemeCount`, etc.

kalleboo · 2 years ago

There are basically 2 places where programmers mostly want the "length" of a string:

1. To save storage space or avoid pathological input, they want to limit the "length" of text input fields. E.g., not allow a name to be 4 KB long

2. To fit something on screen

Developers mostly used to western languages can approximate both with "number of letters", but the correct answers are

For 1. Limit to bytes to avoid people building infinite zalgo characters, but be intelligent about it - don't just crop the byte array not taking into account graphemes.

For 2. This sucks, especially for the web, but the only really correct answer here is to render it and check.

Did I miss any other common cases?

Etherlord87 · 2 years ago

You're absolutely correct! `length` is ambiguous - you shouldn't have a `time` argument in a `sleep` function; you should have `milliseconds` and/or `seconds` etc.

astrange · 2 years ago

String iteration should be based on whatever you want to iterate on - bytes, codepoints, grapheme clusters, words or paragraphs. There's no reason to privilege any one of these, and Swift doesn't do this.

"Length" is a meaningless query because of this, but you might want to default to whatever approximates width in a UI label, so that's grapheme clusters. Using codepoints mostly means you wish you were doing bytes.

b3morales · 2 years ago

> There's no reason to privilege any one of these, and Swift doesn't do this.

Strange thing to say: Swift String count property is the count of extended grapheme clusters. The documentation is explicit:

> A string is a collection of extended grapheme clusters, which approximate human-readable characters. [emphasis in original]

mananaysiempre · 2 years ago

> thing that gets deleted when you hit backspace

Is there a canonical source for this part, by the way? Xi copied the logic from Android[1] (per the issue you linked downthread), which is reasonable given its heritage but seems suboptimal generally, and I vaguely remember that CLDR had something to say about this too, but I don’t know if there’s any sort of consensus on this problem that’s actually written down anywhere.

[1] https://github.com/xi-editor/xi-editor/pull/837

grumpyprole · 2 years ago

> And for this reason, String iteration should be based on codepoints

Why not offer both and be clear about it? Rather than just "length", why not call them code points? The Python docs for "len" which can be called on a unicode string say "Return the length (the number of items) of an object.". It doesn't look like a clear and easy to use API to me.

archgoon · 2 years ago

If you insist that `len` shouldn't be defined on strings, and the default iterator should be undefined in python then:

  for c in "Hello":
    pass

should throw an exception. Also

  if word[0] == 'H':
     pass

should throw an exception.

This would have been an extremely controversial suggestion when python3 came out to say the least.

Codepoints is a natural way of defining unicode strings in python, and it mostly works the way you expect once you give it a bit of thought. It is lower level than, say, grapheme clusters, but its more well defined and it provides the proper primitives for dealing with all use cases.

nerdbert · 2 years ago

> for this reason, String iteration should be based on codepoints

This leads you to the problem where you'll get different results iterating over

n a ï v e

n a ̈ i v e

And I can't see how that's ever going to be a useful outcome.

If you normalize everything first, then you can sidestep this to some degree, but then in effect your normalization has turned codepoint iteration into grapheme iteration for most common Latin-script text characters.

> People are not limited to a single locale. For example, I can read and write English (USA), English (UK), German, and Russian. Which locale should I set my computer to?

Ideally - the "English-World" locale is supposedly meant for us, cosmopolitans. It's included with Windows 10 and 11.

Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granularly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.

A less-spoken about problem is Windows' system-wide setting for the default legacy codepage. I happen to use single-language legacy (non-Unicode) apps made by people from a number of very different countries. Some apps (e.g. I can remeber the Intel UHD Windows driver config app) even use this setting (ignoring the system locale and system UI language) to detect your language and render their whole UI in it.

> English (USA), English (UK)

This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

By the way I wonder how do string capitalization and comparision functions manage to work on computers of people who use both English and Turkish actively (Turkish locale distinguishes between dotted and undotted İ).

lucideer · 2 years ago

As an Irish person, while we have en_IE which is great (and solves most of the problems you list re: Euro-centric defaults + English), I'd still quite like to have an even more broad / trans-language / "cosmopolitan" locale to use.

I mainly type in English but occasionally other languages - I use a combination of Mac & Linux - macOS has an (off-by-default but enable-able) lang-changer icon in the tray that is handy enough, but still annoying to have to toggle. Linux is much worse.

Mac also has quite a nice long-press-to-select-special character that at least makes for accessible (if not efficient) typing in multiple languages while using an English locale. Mobile keyboards pioneered this (& Android's current one even does simultanous multi-lang autocomplete, though it severely hurts accuracy).

---

> I doubt many English speakers care to distinguish between English dialects.

I think you'll find the opposite to be true. US English spellings & conventions are quite a departure from other dialects, so typing fluidly & naturally in any non-US dialect is going to net you a world of autocorrect pain in en_US. To the extent it renders many potentially essential spelling & grammar checkers completely unusable.

l72 · 2 years ago

I write in multiple languages daily on Linux, including English, Russian, and Chinese. Switching keyboards (at least with gnome) is a simple super-space.

While in my default (English) layout, it is easy enough to add in accents other characters using the compose key (right alt). So right-alt+'+a = á or right-alt+"+u = ü. I much prefer this over the long press as I can do it quickly and seamlessly without having to wait on feedback. Granted, it is not as discoverable, but once you are comfortable, it in my opinion is a better system.

jdblair · 2 years ago

I can 2nd this as an American who now resides in Europe. My first laptop I brought with me, and was defaulted to en_US, but my replacement is en_GB (Apple doesn't have en_NL, for good reason).

I don't find it "unusable", though. I could change it back to en_US, but it has actually been interesting to see all of my American spellings flagged by autocorrect. Each time I write authorize instead of authorise it is an act of stubborn group affinity!

TRiG_Ireland · 2 years ago

> US English spellings & conventions are quite a departure from other dialects.

As far as the written, formal language is concerned, English really has only three dialects: US American, Canadian, and everywhere else. There are some other subtle differences (such as "robots" for traffic lights in South Africa, or "minerals" for fizzy drinks in Ireland¹), but that's pretty much it.

¹ Yes, this isn't just slang in Ireland: the formal, pre-recorded announcements on trains use it: "A trolley service will operate to your seat, serving tea, coffee, minerals and snacks." The corresponding Irish announcement renders it mianraí. Food service on trains stopped during covid and has not yet resumed, so I'm working from distant memory now.

DoughnutHole · 2 years ago

> I doubt many English speakers care to distinguish between English dialects

It's worthwhile purely for the sake of autocorrect/typo highlighting in text-editing software. I don't miss the days of spelling a word correctly in my version of English but still being stuck with the visual noise of red highlighting up and down the document because it doesn't conform to US English.

BoxOfRain · 2 years ago

Yeah I'd rather not have my British English dialect seen as second-class in a world of American English ideally which is what having a red document full of 'errors' implies in those sorts of situations.

It's sometimes not a trivial distinction either, for example I've heard of cases where surprised British redditors have found themselves banned from American subreddits for being homophobic when they were actually talking innocently enough about cigarettes!

notatoad · 2 years ago

> I doubt many English speakers care to distinguish between English dialects

I think you'd be surprised how many english (UK) people will get pissed off when their spell-checker starts removing the "u" from colour or flavour, or how many English (US) people get pissed off when the spellchecker starts suggesting random "u"s to words.

additionally to that, locale isn't just about language. English (US) and English (UK) decides whether your dates get formatted DD-MM-YY or MM-DD-YY, whether your numbers have the thousands broken by commas or spaces, and a host of other localization considerations with a lot more significance than just the dialect of english.

TRiG_Ireland · 2 years ago

I'd really like an en-GB-oxendict (British English but favouring -ize over -ise) locale for formal writing.

aksss · 2 years ago

I worked for BP for a while (well, as a contracted coder) and I got quite used to the UK spell check correcting everything to its idiom. Everything seemed wrong once I returned a world that dismissed the value of the letter 'U' and preferred the letter 'Z' over 'S'. Also missed the normalizing of drinking beer at lunch.

uxp8u61q · 2 years ago

> I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time

> I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

Well, you just explained what this plethora of options is about. It's not just about how you spell flavor/flavour. It's a lot of different defaults for how you expect your OS to present information to you. Default paper size, but also how to write date and time, does the week start on Monday, Sunday, or something else, etc.

grotorea · 2 years ago

> Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granulatly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.

There's the English (Denmark) locale for that on some platfoms.

qwerty456127 · 2 years ago

Thank you very much, I'll give it a try.

__d · 2 years ago

I write daily in US English, Australian English, and Austrian German. Most of the time, a specific document is in one dialect/language or another: not mixed, although sometimes that's not true.

I can understand that the conflation of spelling, word choices, time and date formatting, default paper sizes, measurement units, etc, etc, is convenient, and works a lot of the time, but it really doesn't work for me at all.

That said, I appreciate that I occupy a very small niche.

lmm · 2 years ago

> I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects.

Most people in the UK care - a population nearly twice that of California, and larger than the native speakers of any non-top-20 language. If you care enough to support e.g. Italian you should support en_UK.

masklinn · 2 years ago

> English (USA), English (UK)

> This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects.

While that is generally (though not always) true, I would assume it's really a stand in for the much more relevant zh locales.

It is also rather relevant to es locales (america spanish has diverged quite a bit from europe spanish hence the creation of es-419), definitely french (canadian french, to a lesser extend belgian and swiss), and german (because swiss german). And it might be relevant for ko if north korea ever stops being what it is.

Deleted Comment

fomine3 · 2 years ago

Unix style locale (set by env vars) is flexible, that can set per app. Android and iOS now support locale per app recently IIRC. Windows locale settings is global and some requires reboot.

hahn-kev · 2 years ago

As much as I appreciate that I always wondered how many programs actually respect all those tweaks.

dizhn · 2 years ago

i İ

ı I

I symphatize with people who get this wrong. (I just saw some YouTube video have a title TÜRKIYE in a segment)

Even google keyboard can't seem to distinguish between I and İ. When I type "It", it suggests "İt's" which is quite pathetic.

ipsi · 2 years ago

> This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

I definitely do. The biggest difference, as everyone else, has pointed out is the US vs UK spellings.

Realistically, though, beyond that country is a poor indicator for everything else. I want to use DD/MM/YYYY date format in English, but DD.MM.YYYY date format in German. I want to use $1,000 in English, but 1.000 $ in German. This isn't dependent on the country I live in, this is dependant on a combination of a country and language - that could be the country I'm living in, or the country I grew up in (mostly US date format vs not), and it's either the language I'm actively typing in, or the language of the document I'm reading, or the language I'm thinking in (but a computer can't exactly handle that).

Trying to guess the correct combination is tricky, especially if a document is in two languages (e.g., a quotation), and users are lazy and won't switch their IME unless they have to.

What this means in slightly more practical terms is that setting a single "locale" for my device doesn't make sense, but rather I should be able to choose a locale per language (or possibly spelling preferences by language and formatting options by language as separate choices). I'd then pick a language to use the device in, and it would use that languages locale, and tell apps that this language is the preferred language. If an app doesn't provide my preferred language, pull the preferred locale from settings for a language it does support, otherwise use the default set by developers. For some apps, it's a bit more complex, particularly if I'm creating content. GMail or Office would be two good examples, where the UI language might be in English, but the emails or documents are in German, or a combination of German and English.

Even then, I'm sure there are people who need something even more flexible than that.

At the moment, if I set my language to English but my Country to Germany on my iPhone, for example, things occasionally get confused. My UK banking app, for example, pulled the decimal separator from my locale settings for a while and then refused to work because "£9,79" (or whatever it was) isn't a valid amount of money, and I couldn't see a way to fix that without switching my Country in the phone settings. I imagine they fixed it by ignoring my configured locale and always using en-GB, thus defeating the whole point of a locale in the first place.

So yeah, these days it's fairly common to not have a single "locale" that you work in - it's quite possible to want to use two or more but nothing is really set up to handle that well.

my \emoji = "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]"; say emoji; #Will print the character say emoji.chars; # 1 because on character say emoji.codes; # 5 because five code points say emoji.encode('UTF8').bytes; # 17 because encoded utf8 say emoji.encode('UTF16').bytes; # 14 because encoded utf16

https://archive.ph/LtKk0

neonate · 2 years ago

http://web.archive.org/web/20231002163213/https://tonsky.me/...

This is quite a good write up. An answer to one of the author's questions:

> Why does the ﬁ ligature even have its own code point? No idea.

On of the principles of Unicode is round trip compatibility. That is you should be able to read in a file encoded with some obsolete coding system and write it out again properly. Maybe frob it a bit with your unicode-based tools first. This is a good principle, though less useful today.

So the ﬁ ligature was in a legacy encoding system and thus must be in Unicode. That's also why things like digits with a circle around them exist: they were in some old Japanese character set. Nowadays we might compose them with some zwj or even just leave them to some higher level formatting (my preference).

sdrothrock · 2 years ago

> they were in some old Japanese character set

This implies that they're obsolete, but they're not -- they're still in very common use today. You can type them in Japanese by typing まる (maru, circle) and the number, then pick it out of the IME menu. Some IMEs will bring them up if you just type the number and go to the menu, too. :)

Fair enough. I was thinking of them as obsolete, but shouldn't since you do see them a surprising amount in Japan.

ufo · 2 years ago

What do the Japanese use the circled numbers for?

WorldMaker · 2 years ago

> So the ﬁ ligature was in a legacy encoding system and thus must be in Unicode.

Most of the pre-composed latin ligatures are generally from EBCDIC codepages. People in the ancient Mainframe era wanted nice typesetting too, but computer fonts with ligature support were a much later invention.

You can see ﬁ and several others directly in EBCDIC code page 361:

https://en.wikibooks.org/wiki/Character_Encodings/Code_Table...

Thanks. Some alphabets have precomposed ligatures that aren't really letters, like old German alphabets with tz, ch, ss (I only know how to type the last one, ß, because the others have died out over the last hundred years).

Actually in German (at least) ä, ö and ü really are actually ligatures for ae, oe, and ue -- the scribes started to write the E's on their sides above the base letters, and over time the superscript "E"s became dots or dashes. Often they are described the other way around: "you can type oe if you can't type ö." That's what my kid was told in school!

But Ö and ß aren't really part of the alphabet in German, while, say, in Swedish, ä and ö became actual letters of the alphabet. English got W that way too.

gwervc · 2 years ago

The circled digits as code points are very nice to have precisely because they are available in applications that don't support them otherwise... which is actually most of the software I can think of (Notepad, Apple Notes, chat applications, most websites, etc).

My point was that, had they not been legacy characters (or had RT compatibility been disregarded) Unicode could still have supported them as composed characters. Though I personally still feel they are a kind of ligature or graphic, but luckily for everyone else I’m not the dictator of the world :-).

We should be careful: someone on HN could write a proposal that they should be considered pre composed forms that should also have an un-composed sequence… so there could in future be not just 1 in a circle but 1 ZWJ circle, circle ZWJ 1 all considered the same…I can imagine some HN readers being pranksters like that.

swores · 2 years ago

Can you write them with iOS keyboard? Or when you say Apple Notes and chat apps you just mean from desktop?

Edit ①: seems the answer is not with the default iOS keyboard, but possible to paste it and perhaps possible with a third party keyboard that I'm not keen on trying (unless I hear of a keyboard that's both genuinely useful / better than default, and that doesn't send keystrokes to the developer - though I can't remember if the latter is even a risk on iOS, better go search about that next..)

glandium · 2 years ago

> That's also why things like digits with a circle around them exist: they were in some old Japanese character set.

Replace "digits with a circle around them" with "emojis" and that's also true.

davidham · 2 years ago

Is it just me, or is anyone else seeing what looks like the mouse pointer of everyone else reading the page, like 1,000 little ants on the screen

toastercat · 2 years ago

Anytime tonsky's site gets posted here, I'm reminded by how awful it is, which is ironic given his UI/UX background. The site's lightmode is a blinding saturated yellow, and if you switch into darkmode, it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.

ericmcer · 2 years ago

I don't think he added moving cursors all over the page because he thought it was good UI/UX, he knows what he is doing.

LordDragonfang · 2 years ago

It's deeply ironic that an article about dealing with text properly has images which are part of the article text and yet have no alt-text, rendering parts of the article unreadable in reader mode if the server is slow.

spacechild1 · 2 years ago

It is obviously a joke (and a good one, I dare say). The fact that people seem to take it seriously says something about the contemporary state of webdesign :)

coldpie · 2 years ago

Works like a normal website with JavaScript disabled. I didn't even know it did fancy junk until reading the comments here. NoScript saves the day again! I don't know how people can browse the web without it.

johnnyanmac · 2 years ago

>it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.

not even a proper flashlight. it updates when the mouse moves, so you're SOL if you scroll on desktop.

torgard · 2 years ago

Well, I thought it was fun.

Arech · 2 years ago

I'd say this annoying trick is highly appropriate for the topic!

wirelesspotat · 2 years ago

Yep, the website opens a websocket connection[0] and sends the mouse position every 1 second

[0] WS connection is on `wss://tonsky.me/pointers?id=XXXXXX&page=/blog/unicode/&platform=XXX`

aragonite · 2 years ago

I've been drawing circles for over a minute now and no one has joined me yet, so I conclude those movements are random rather than made by intelligent beings. :)

extraduder_ire · 2 years ago

I did the same for a while while I was reading. From another comment, the position only seems to update once a second, so it'll be hard for someone to notice your movements.

KyleBerezin · 2 years ago

That makes me think of this old gem https://imgur.com/gallery/BgKFcI9

wizofaus · 2 years ago

It's quite possibly the worst web page presentation I've come across in a long time - aside from the fact it looks like some bug has caused my OS to leave a random trail of mouse pointers all over the screen, some of them even move around, making me doubt my sanity when I'm quite sure I'm holding the mouse still. And the less said about the colours the better. There's no way I was going to put up with that long enough to read all the text on it.

pookha · 2 years ago

Good times. If you click on the sun switch the entire UI gets zeroed out and you get to use on:hover mouse shtick to read the UI through a fuzzy radius. Is Yoko Ono designing websites now?

WD-42 · 2 years ago

It's a joke. It made me laugh.

hwillis · 2 years ago

turned off javascript as soon as I saw it. Like trying to read with twenty mosquitos in your face.

sebstefan · 2 years ago

hey be nice to my mouse cursor

neurostimulant · 2 years ago

If you're using firefox, toggling reader view should do the trick.

ilyt · 2 years ago

I see nice crisp black text on white background because apparently server melted down

hot_gril · 2 years ago

I saw that, except half the images weren't loading, and there was just one mouse pointer.

zzzeek · 2 years ago

yeah....why on earth would someone want their webpage to do this, especially if they have text they'd presumably want you to read?

It's cute, and provides a hint of human connection that is otherwise absent on the web "hey, another human is reading this too!" which you probably know but something about seeing the pointer move makes it feel real.

Probably not the greatest during a hacker news hug of death, but if I read that article some other time and saw one of the moving pointers, I would think it was really cool.

fragmede · 2 years ago

Have you ever read with other people, like in school or a book club, or been somewhere that there were other people around? It's an interesting move by the author; the loneliness epidemic hasn't gone unnoticed.

eg https://www.npr.org/2023/05/02/1173418268/loneliness-connect...

arendtio · 2 years ago

Too bad the Linux and the Mac pointer look so similar. But when you give them different background colors, it becomes more obvious which platform dominates, like:

  .pointer.l {
    background-color: green;
  }

nigma1337 · 2 years ago

Distracted me from reading the article, I just started chasing other people around.

dystroy · 2 years ago

That's what's missing. When I click on a pointer, its owner should have the article replaced with a "GAME OVER" message.

pests · 2 years ago

I know which site you are talking about before even clicking the article :(

lbltavares · 2 years ago

It's fun specially for folks like me who have ADHD. But there should be a button to disable it

neonsunset · 2 years ago

Yes, reading the article is impossible with erratic movement on the screen.

scruss · 2 years ago

as someone with a visual processing disorder, this is like having a page scream at me. Repeatedly. Never do this

lifeinthevoid · 2 years ago

yup, pretty annoying

876978095789789 · 2 years ago

Yeah, it's extremely obnoxious.

ormax3 · 2 years ago

first thing I did before reading the article, using uBO to block JS on the page

dekken_ · 2 years ago

not just you, this is what my other comment is about (indirectly)

oefrha · 2 years ago

> many Chinese, Japanese, and Korean logograms that are written very differently get assigned the same code point

This leads to absolutely horrendous rendering of Chinese filenames in Windows if the system locale isn’t Chinese. The characters seem to be rendered in some variant of MS Gothic and it’s very obviously a mix of Chinese and Japanese glyphs (of somewhat different sizes and/or stroke widths IIRC). I think the Chinese locale avoids the issue by using Microsoft YaHei UI.

tr888 · 2 years ago

What on EARTH is that mouse cursor thing all about? Why would you even bother writing this, then making it impossible to read properly?

oliwarner · 2 years ago

It's tracking every visitors' cursor and sharing it with every other visitor.

Why would a frontend developer demonstrate their ability to do frontend programming on their personal, not altogether super-serious blog? I meant that rhetorically but it's a flex. I agree, not the best design in the world if you're catering for particular needs, but simple and fun enough. You should check out dark mode.

In that vein, I think it's okay if we let people have fun. That might not work for everyone, but why should we let perfect be the worst enemy of fun?

dathinab · 2 years ago

> Why would

because it shows that they don't understand important design aspects

while it doesn't really show off their technical skills because it could be some plugin or copy pasted code, only someone who looks at the code would know better. But if someone care enough about you to look at your code you don't need to show of that skill on you normal web-site and can have some separate tech demo.

> okay if we let people have fun

yes people having fun is always fine especially if you don't care if anyone ever reads your blog or looks at it for whatever reason (e.g. hiring)

but the moment you want people to look at it for whatever reason then there is tension

i.e. people don't get hired to have fun

and if you want others to read you blog you probably shouldn't assault them with constant distractions

eerikkivistik · 2 years ago

I stopped in the middle of reading the post just for this. It was so distracting I was unable to focus on the text. It's a fun gimmick, but the result is that someone who wanted to read the post, stopped in the middle.

(sarcasm)

It's revenge against anyone with certain kinds of visual impairments and/or concentration issues because the ex-spouse of the author which turned out to be a terrible person had such.

(sarcasm try 2)

It's revenge against anyone using JS on the net with the author trying to subtle hint that JS is bad.

(realistic)

It's probably on of:

- the website is a static view of some collaborative tool which has that functionality build in by default

- some form of well intended but not well working functionality add to the site as it was some form of school/study project, in that case I'm worried about the author suffering unexpected very much higher cost due to it ending up on HN ...

tonsky · 2 years ago

Hi, author here. In case you really want to know: no, it’s custom-made and works exactly as intended. There are two main reasons:

1. Fun. Modern internet is boring, most blog posts are just black text on white background. Hard to remember where you read what. And you can’t really have fun without breaking some expectations.

2. Sense of community. Internet is a lonely place, and I don’t necessarily like that. I like the feeling of “right now, someone else reading the same thing as I do”. It’s human presence transferred over the network.

I understand not everybody might like it. Some people just like when things are “normal” and everything is the same. Others might not like feeling of human presence. For those, I’m not hiding my content, reader mode is one click away, I make sure it works very well.

As for “unexpectedly ended up on HN”, it’s not at all unexpected. Practically every one of my general topic articles ends up here. It’s so predictable I rely on HN to be my comment section.

wonger_ · 2 years ago

The author has several other writeups:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

The cursors will only be a problem during front page HN traffic. And the opt-out for people who care is reader mode / disable js / static mirror. Not sure if there's any better way to appease the fun-havers and the plain content preferrers at the same time. Maybe a "hide cursors" button on screen? I, for one, had a delightful moment poking other cursors.

Luctct · 2 years ago

I don't know what you people are talking about. I'm just glad I always browse with Javascript turned off. If you didn't see the writing on the wall and permanently turn Javascript off around 2006, you have no right to complain about anything.

Meanwhile, ironic irony is ironic: "Hey, idiots! Learn to use Unicode already! Usability and stuff! Oh, btw, here is some extremely annoying Javascript pollution on your screen because we are all still children, right? Har har! Pranks are so kewl!!!1!"

zbtaylor1 · 2 years ago

Are you alright?

yrro · 2 years ago

I got a good laugh out of it.

> The only modern language that gets it right is Swift:

I disagree.

What is the "right" things is use-case dependent.

For UI it's glyph bases, kinda, more precise some good enough abstraction over render width. For which glyphs are not always good enough but also the best you can get without adding a ton of complexity.

But for pretty much every other use-case you want storage byte size.

I mean in the UI you care about the length of a string because there is limited width to render a strings.

But everywhere else you care about it because of (memory) resource limitations and costs in various ways. Weather that is for bandwidth cost, storage cost, number of network packages, efficient index-ability, etc. etc. In rare cases being able to type it, but then it's often us-ascii only, too.

patrickas · 2 years ago

That is why I like the way Raku handles it.

It has distinct .chars .codes and .bytes that you can specify depending on the use case. And if you try to use .length is complains asking you to use one of the other options to clarify your intent.

ssokolow · 2 years ago

*nod*

Rust was given as one of the examples and Rust's .len() behaviour is chosen based on three very reasonable concerns:

1. They want the String type to be available to embedded use-cases, where it's not reasonable to require the embedding of the quite large unicode tables needed to identify grapheme boundaries. (String is defined in the `alloc` module, which you can use in addition to `core` if your target has a heap allocator. It's just re-exported via `std`.)

2. They have a policy of not baking stuff that is defined by politics/fiat (eg. unicode codepoint assignments) into stuff that requires a compiler update to change. (Which is also why the standard library has no timezone handling.)

3. People need a convenient way to know how much memory/disk space to allocate to store a string verbatim. (Rust's `String` is just a newtype wrapper around `Vec<u8>` with restricted construction and added helper functions.)

That's why .len() counts bytes in Rust.

Just like with timezone definitions, Rust has a de facto standard place to find a grapheme-wise iterator... the unicode-segmentation crate.

Swift made an effort to handle grapheme clusters but severely over-complicated strings by exposing performance details to users. Look at the complex SO answers to what should be simple questions, like finding a substring: https://news.ycombinator.com/item?id=32325511 , many of which changed several times between Swift versions

I was working on an app in Swift that needed full emoji support once. Team ended up writing our own string lib that stores things as an array of single-character Swift strings.

marcellus23 · 2 years ago

> many of which changed several times between Swift versions

This was true while Swift was developing but it's been stable now for several years. At some point that complaint is no longer valid.

Also, realized "needed full emoji support" sounds silly. It needed to do a lot of string manipulation, with extended grapheme clusters in mind, mainly for the purpose of emojis.

Arguably, you don’t need any (default) length at all, just different views or iterators. When designing a string type today, I wouldn’t add any single distinguished length method.

galad87 · 2 years ago

Swift string type has got many different views, like UTF-8, UTF-16, Unicode Scalar, etc… so if you want to count the bytes or cut over a specific byte you still can.

that's not the issue

defaults matter

as in they should things you can just use by-default without thinking about it

as swift is deeply rooted in UI design having a default of glyphs make sense

and as rust is deeply rooted in unix server and system programming utf-8 bytes make a lot of sense

through the moment your language becomes more general purpose you could argue having a default in any way is wrong and it should have multiple more explicit methods.