It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019)

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

arcticbull · 6 months ago

Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

ramses0 · 6 months ago

"Unicode is JPG for ASCII" is an incredibly great metaphor.

size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

account42 · 6 months ago

> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

baq · 6 months ago

ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

craftkiller · 6 months ago

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines

Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.

account42 · 6 months ago

> in the global international connected computing world it doesn’t fit at all.

I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

flohofwoe · 6 months ago

ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.

Just never ever use Extended ASCII (8-bits with codepages).

bigstrat2003 · 6 months ago

> in the global international connected computing world it doesn’t fit at all.

Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.

Deleted Comment

eru · 6 months ago

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

xelxebar · 6 months ago

> Number of monospaced font character blocks this string will take up on the screen

Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

xg15 · 6 months ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull · 6 months ago

Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.

If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

account42 · 6 months ago

> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

mseepgood · 6 months ago

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

jlarocco · 6 months ago

It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.

The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.

perching_aix · 6 months ago

> Number of monospaced font character blocks this string will take up on the screen

To predict the pixel width of a given text, right?

One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.

I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.

oefrha · 6 months ago

(Some but not all) terminal emulators are capable of rendering CJK perfectly aligned with Latin even when mixing fonts. Browsers are fundamentally incapable of that because aligning characters in different fonts wasn’t a goal at all. VS Code being a webview under the hood means it inherited this fundamental incapability.* Therefore, don’t hold your breath.

* I'm talking about the DOM route, not <canvas> obviously. VS Code is powere by Monaco, which is DOM-based, not canvas-based. You can "Developer: Toggle Developer Tools" to see the DOM structure under the hood.

** I should further qualify my statement as browsers are fundamentally incapable of this if you use native text node rendering. I have built a perfectly monospace mixed CJK and Latin interface myself by wrapping each full width character in a separate span. Not exactly a performance-oriented solution. Also IIRC Safari doesn’t handle lengths in fractional pixels very well.

guappa · 6 months ago

What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?

xigoi · 6 months ago

In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.

taneq · 6 months ago

If you're playing at this level, you need to define:

- letter

- word

- 5 :P

BobbyTables2 · 6 months ago

Very true. Rust’s handling of strings was an eye opener for me.

Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.

Counting Unicode characters is actually a disservice.

Semaphor · 6 months ago

FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.

tomsmeding · 6 months ago

It's not unlikely that what you would ideally use here is the number of grapheme clusters. What is the length of "ë"? Either 1 or 2 codepoints depending on the encoding (combining [1] or single codepoint [2]), and either 1 byte (Latin-1), 2 bytes (UTF-8 single-codepoint) or 3 bytes (UTF-8 combining).

The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.

[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB

TZubiri · 6 months ago

How about for iterating every character in a string in order to find a specific character combination? I need (or the iterator needs) to know the length of the string and what the boundaries of each characters are.

bluecalm · 6 months ago

What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?

account42 · 6 months ago

With UTF-8 you can implement them on top of bytes.

capitainenemo · 6 months ago

FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length

zwnow · 6 months ago

I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.

dwb · 6 months ago

The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.

jibal · 6 months ago

The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.

int_19h · 6 months ago

Humans speak many different languages. Not all of them are English, and not all of them have writing systems which make it meaningful to talk about "string length" without disambiguating further.

bigstrat2003 · 6 months ago

I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.

zahlman · 6 months ago

> I have, on the other hand, always wanted the string length.

In an environment that supports advanced Unicode features, what exactly do you do with the string length?

wredcoll · 6 months ago

Which length? Bytes? Code points? Graphemes? Pixels?

justsomehnguy · 6 months ago

Guessing from the other comments you missed the byte length for the codepoints.

When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.

thrdbndndn · 6 months ago

I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

account42 · 6 months ago

It's not rare at all - multi-code point emojis are pretty standard these days.

And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.

sigmoid10 · 6 months ago

I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.

There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

deathanatos · 6 months ago

> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.

You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.

zahlman · 6 months ago

> or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem.

The unit is perfectly meaningful.

It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)

Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.

I don't understand what you mean by "USV count".

> but what is a character?

It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.

> …but "5" or "7"? Where do those even come from?

From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.

> Again: "character in the implementation" is a meaningless concept.

"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

perching_aix · 6 months ago

As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.

I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.

Now of course:

- it coming in handy once for my specific random workload doesn't mean it's good design

- my specific workload may not be rational (am a dingus sometimes)

- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome

- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.

But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.