This is a bug in lsp-mode, a third-party Emacs package which is an LSP client. Since Emacs version 29, Emacs will include in its standard distribution a different LSP client package, eglot, which does not seem to have this bug:
Ha as soon as he mentioned LSP I knew it was going to be about the insane fact that it sends all text as UTF-8 (and Rust uses UTF-8) but it uses UTF-16 code points for column indexes. (I assume that is what it is anyway; I did not make it to the end, sorry.)
They've had an open bug about it for years but I think most people just ignore the issue because to handle it correctly you have to convert to UTF-16 and back (again after VSCode has already done it) and it's just not really worth dealing with until Microsoft fixes their end.
Actually I just went back and double checked and Microsoft did actually fix it last year! You can now negotiate to get positions in Unicode code points instead. Rejoice!
Except, as you'll find people in these very comments arguing, if you don't implement utf16 anyways then you're not a compliant LSP.
Don't forget that utf16/UCS2 is Embrance, Extend, Extinguish at its finest. Literally the whole point of utf16 - not necessarily the explicit one, but the point behind the motivations people had for choosing what they chose - was incompatibility with other software supporting unicode back in the day.
So this all justifies my general stance of "find a Unicode expert" when questions of unicode come up. But that's pretty wilfully blind (although practical).
I would like to understand why UTF-32 didn't catch on as The Standard Unicode for the modern world. it seems that - albeit memory wasteful - it would sidestep a lot of these issues.
The answer is in the question really. If you've got a big pile of mostly-ascii data, quadrupling memory/storage to encode it as UTF32 is going to be a pretty tough sell
How many "big piles of mostly-ascii data" are there though? (Does anyone want to write a script which searches /dev/mem and categorises pages of RAM into ascii-or-not-ascii so we can get some meaningful numbers? :P )
(If you’re doing number-crunching on giant CSVs, maybe I can see it being important, but all the ascii files on my desktop that I can think of are pretty trivial)
One of the root problems here is that a concept like "character" is extremely underdefined. What you want to do in terms of counting the memory a string will take up, indicating a particular spot in the string, mapping to arrow keys or the backspace key in a visual program, knowing how far to indent to drop a caret in a visual error report, knowing the width of the string in a visual (especially fixed-width) text display. For ASCII text, you can use the same value to represent all possible slightly different definitions, but in non-ASCII, you need to use different definitions, and there's no one true definition that solves all use cases (no, not even grapheme clusters).
That's not quite right. UTF-8 is not arbitrary length.
Officially, it's at most four bytes, of which 21 bits are usable for encoding codepoints - so that's an upper limit of 2^21 codepoints.
There is an initial byte encoding the length as a series of ones, so if you went ahead and extended the standard to simply allow more bytes, you could get up to 8 bytes, of which 48 bits would be usable.
I can see that a six-byte version with 31 data bits was previously standardised before they settled on four.
I guess you could extend it further by allowing more than one initial byte encoding the length, then it would be arbitrary length. But at that point I'm not sure if it loses its self-synchronising ability, and in any case it would be a different standard at that point.
It's pointless to have char32_t if you still need to pull several megabytes of ICU to normalize the string first in order to remove characters spanning over multiple codepoints. UTF32 is arguably dangerous because of this, it's yet another attempt to replicate ASCII but with Unicode. The only sane encoding out there is UTF-8, and that's it. If you have to always assume your string is not really splittable without a library, you won't do dangerous stuff such as assuming `wcslen(L"menù") == 4`.
This all justifies my stance of "don't add complexity unless you absolutely need it."
People are surprised, confused, and sometimes even offended by the fact that I do almost all my work with a plain text editor and a terminal. I have the same sentiment towards those who insist upon large complex fragile stacks of tools and then wonder why they spend so much time chasing down bugs in those rather than working on what they actually intended to.
This justifies my stance of "find the most normal setup and use it".
This is a bug in the less popular third-party lsp package for emacs, which is already quite unpopular.
I use VSCode, an enormously complex system. But so many other people use it there tends not to be this sort of bug. And in the rare case there is one I just wait a few days until someone else solves it.
Isn't the problem here that both UTF-8 and UTF-16 were being used at once, with incorrect conversion between offsets. I don't see how adding another encoding would help here.
Sure if, everyone used UTF-32 for everything then these problems would go away but they would also go away if everyone used UTF-8, and most uncompressed files would be 4 times smaller.
The LSP protocol sends indexes. Insanely, those indexes are in terms of UTF-16 code units. Emacs's LSP client implementation here is sending the wrong index: 8 for the emoji's index, but 9 for the index of the next "character". But an emoji spans two UTF-16 code units, so the next index is 10.
Rust-analyzer simply crashes here, but it's been fed hot garbage by the editor. One might argue it shouldn't crash. TFA digs into the details around that, too, because Amos leaves no stone unturned.
I imagine this "UTF-16 code unit indexing" decision is just an artifact of the fact that that's how JavaScript works with strings, and LSP comes from VSCode.
Nitpick: lsp-mode is not “Emacs's LSP client”. Emacs recently chose to include “eglot-mode” as part of Emacs itself, and eglot-mode must therefore be considered to be Emacs’ official LSP client, not “lsp-mode”.
https://github.com/joaotavora/eglot/blob/e501275e06952889056...
They've had an open bug about it for years but I think most people just ignore the issue because to handle it correctly you have to convert to UTF-16 and back (again after VSCode has already done it) and it's just not really worth dealing with until Microsoft fixes their end.
Actually I just went back and double checked and Microsoft did actually fix it last year! You can now negotiate to get positions in Unicode code points instead. Rejoice!
Don't forget that utf16/UCS2 is Embrance, Extend, Extinguish at its finest. Literally the whole point of utf16 - not necessarily the explicit one, but the point behind the motivations people had for choosing what they chose - was incompatibility with other software supporting unicode back in the day.
I would like to understand why UTF-32 didn't catch on as The Standard Unicode for the modern world. it seems that - albeit memory wasteful - it would sidestep a lot of these issues.
> memory wasteful
The answer is in the question really. If you've got a big pile of mostly-ascii data, quadrupling memory/storage to encode it as UTF32 is going to be a pretty tough sell
(If you’re doing number-crunching on giant CSVs, maybe I can see it being important, but all the ascii files on my desktop that I can think of are pretty trivial)
Everything else has.
And then is UTF-16 which has all the pains of UTF-8 with none of the advantages of UTF-32
Officially, it's at most four bytes, of which 21 bits are usable for encoding codepoints - so that's an upper limit of 2^21 codepoints.
There is an initial byte encoding the length as a series of ones, so if you went ahead and extended the standard to simply allow more bytes, you could get up to 8 bytes, of which 48 bits would be usable.
I can see that a six-byte version with 31 data bits was previously standardised before they settled on four.
I guess you could extend it further by allowing more than one initial byte encoding the length, then it would be arbitrary length. But at that point I'm not sure if it loses its self-synchronising ability, and in any case it would be a different standard at that point.
People are surprised, confused, and sometimes even offended by the fact that I do almost all my work with a plain text editor and a terminal. I have the same sentiment towards those who insist upon large complex fragile stacks of tools and then wonder why they spend so much time chasing down bugs in those rather than working on what they actually intended to.
This is a bug in the less popular third-party lsp package for emacs, which is already quite unpopular.
I use VSCode, an enormously complex system. But so many other people use it there tends not to be this sort of bug. And in the rare case there is one I just wait a few days until someone else solves it.
Sure if, everyone used UTF-32 for everything then these problems would go away but they would also go away if everyone used UTF-8, and most uncompressed files would be 4 times smaller.
Rust-analyzer simply crashes here, but it's been fed hot garbage by the editor. One might argue it shouldn't crash. TFA digs into the details around that, too, because Amos leaves no stone unturned.
> High surrogates are D800-DB7F
Akshually, high surrogates extend all the way to DBFF.
https://unicode-table.com/en/blocks/high-surrogates/