Readit News logoReadit News
csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
MyOutfitIsVague · a day ago
You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.

Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.

It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.

edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146

csande17 · a day ago
The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.

Deleted Comment

csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
layer8 · a day ago
The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.
csande17 · a day ago
Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...
csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
alright2565 · a day ago
> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

csande17 · a day ago
I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)

Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
dcrazy · a day ago
Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

csande17 · a day ago
IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.
csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
integralid · a day ago
I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.

On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.

csande17 · a day ago
Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:

- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"

- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were

csande17 commented on RFC 9839 and Bad Unicode   tbray.org/ongoing/When/20... · Posted by u/Bogdanp
Waterluvian · a day ago
I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.

The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.

csande17 · a day ago
Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".
csande17 commented on Hyprland – An independent, dynamic tiling Wayland compositor   hypr.land/... · Posted by u/AbuAssar
bcye · 14 days ago
TIL 37signals officially (?) develops a distro
csande17 · 14 days ago
Two distros, actually: https://omakub.org/
csande17 commented on Try and   ygdp.yale.edu/phenomena/t... · Posted by u/treetalker
mikepurvis · 14 days ago
I disagree. GP is laying is laying out reasonable scenarios that are a few dropped/implied words away from the otherwise incoherent ones. For my part, this one is very grating to my ears:

"Try and tell the truth"

Since it clearly should be "try to tell the truth"

However this one, while similar in construction, doesn't actually sound nearly as bad:

"Try and finish the assignment"

It can be fixed the same way ("try to finish") but it also accept GP's form too, which would be "try (to work hard) and (see if you can) finish the assignment". As I say, for whatever reason this second example sounds much more reasonable to me— I think at least in part my brain is much more accepting of a word that feels dropped than one that's misused.

csande17 · 14 days ago
There's a more standard, general rule in English grammar that web searches tell me is called "delayed right constituent coordination". It lets you read sentences like "He washed and dried the clothes" as "He washed [the clothes] and dried the clothes." The same object gets applied to both verbs.

I suspect that's what you're applying to these sentences. "Try and finish the assignment" makes some sense under this rule if you read it as "Try [the assignment,] and finish the assignment" -- an "assignment" is a thing that makes sense to "try". ("He tried [sushi,] and liked sushi" works for the same reason.) But "Try [the truth,] and tell the truth" doesn't work -- it doesn't make sense to interpret "trying" the truth as some separate action you're taking before you "tell" it.

So probably you just don't have the article's special try-and "pseudo-coordination" rule in your dialect.

csande17 commented on Try and   ygdp.yale.edu/phenomena/t... · Posted by u/treetalker
SoftTalker · 14 days ago
What about "try ya" as in "try ya some of that there salad." Something my mother-in-law would say.
csande17 · 14 days ago
That sounds like a "personal dative" to me: https://ygdp.yale.edu/phenomena/personal-datives

u/csande17

KarmaCake day4034March 23, 2015
About
Opinions expressed here do not reflect the views of my employer or its customers, partners, or minions. I'm not a lawyer.
View Original