Deleted Comment
Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.
Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.
Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).
In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.
And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.
On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.
- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"
- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"
- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"
- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were
The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.
"Try and tell the truth"
Since it clearly should be "try to tell the truth"
However this one, while similar in construction, doesn't actually sound nearly as bad:
"Try and finish the assignment"
It can be fixed the same way ("try to finish") but it also accept GP's form too, which would be "try (to work hard) and (see if you can) finish the assignment". As I say, for whatever reason this second example sounds much more reasonable to me— I think at least in part my brain is much more accepting of a word that feels dropped than one that's misused.
I suspect that's what you're applying to these sentences. "Try and finish the assignment" makes some sense under this rule if you read it as "Try [the assignment,] and finish the assignment" -- an "assignment" is a thing that makes sense to "try". ("He tried [sushi,] and liked sushi" works for the same reason.) But "Try [the truth,] and tell the truth" doesn't work -- it doesn't make sense to interpret "trying" the truth as some separate action you're taking before you "tell" it.
So probably you just don't have the article's special try-and "pseudo-coordination" rule in your dialect.
Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.
It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.
edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146