account42 (u/account42)

account42 commented on The attr() function in CSS now supports types amitmerchant.com/attr-fun... · Posted by u/speckx

account42 · 2 days ago

I can't wait for someone to make a CSS style sheet that lets you specify all properties through HTML attributes as if CSS was never invented and we were still using good old <font size=2> just with all the styling additions since then.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

eru · 2 days ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

account42 · 2 days ago

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

baq · 2 days ago

when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.

account42 · 2 days ago

You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

josephg · 2 days ago

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

account42 · 2 days ago

> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.

You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.

This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

simonask · 2 days ago

This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

account42 · 2 days ago

Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

bluecalm · 2 days ago

What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?

account42 · 2 days ago

With UTF-8 you can implement them on top of bytes.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

thrdbndndn · 2 days ago

I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

account42 · 2 days ago

It's not rare at all - multi-code point emojis are pretty standard these days.

And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

xg15 · 2 days ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

account42 · 2 days ago

> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

afiori · 2 days ago

I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

account42 · 2 days ago

But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

account42 commented on It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) hsivonen.fi/string-length... · Posted by u/program

xigoi · 2 days ago

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

account42 · 2 days ago

Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.