UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.
Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.
Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?
It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.