Readit News logoReadit News
unshavedyak · a year ago
On the note of small strings, Compact String[1] was i believed released after this article and has a nifty trick. Where Smol and Smart can fit 22 and 23 bytes, CompactStr can fit 24! Which is kinda nutty imo, that's the full size of the normal String on the stack.. but packed with actual string data.

There's a nice explanation on their readme[2]. Love tricks like this.

[1]: https://github.com/ParkMyCar/compact_str

[2]: https://github.com/ParkMyCar/compact_str?tab=readme-ov-file#...

parkmycar · a year ago
Hey I'm the author of compact_str, thanks for the kind words!

Fun fact, it was this fasterthanlime post that originally inspired me to play around with small strings and led to the creation of compact_str.

nextaccountic · a year ago
Do you think it is possible to integrate this with an existing string interner in the Rust ecosystem? Or one would need to roll their own?

The goal would be to compare two strings using only the 24-byte values, without touching the allocated part (if it points into a string larger than 24 bytes)

The first thing that makes me think this is impossible is that you only define an owned variant for your string type. But your can cleverly wrap a &'static str rather than allocate on heap, so if the interner can give a &'static str I think this could work, with some boilerplate. Namely: for strings smaller than 24 bytes don't intern and build a CompactString inline, for larger strings, intern, get a &'static str, then build a CompactString out of it. Then, two check if two strings are equal, you just compare its 24 bytes, and don't need to touch the interner for this at all.

However this only works if the interner actually leaks memory. If it returns you a &'a str that isn't 'static, then you can't store it on CompactString, unless your lib also provided a borrowed variant.

Also, to think about it, since interned strings are immutable (they are not a "string builder" like String is), you don't really need the capacity value for them, just the length. So it suggests a 16 bytes size, not 24. (but one could imagine an application where it's optimal to store 24 bytes inline rather than 16 anyway)

I think that this could be achieved with an 16 bytes &str wrapper, maybe something like &CompactStr, that works just like your CompactString, but wraps a &str instead (and offers no owned variant). Maybe such a thing could conceivably be included in the compact_string crate?

(Maybe make it 24 bytes even if it "wastes" some bytes, just to make it bitwise compatible with CompactString, so that borrowing CompactString into &CompactStr is zero cost - and then just zero out the remaining 8 bytes when a &CompactStr is stored on the heap)

[0] I was reading this post https://dev.to/cad97/string-interners-in-rust-797 that was written in response to this fasterthanlime post, but it contrasts interner with small string optimization, when you actually could have both!

tomcam · a year ago
I've been wondering something for ages. Where did you get the 24 byte number, and how does it compare in Unicode terms? That is, did you analyze a large corpus and determine that 24 bytes was right for the largest number of strings? And does it come out to, say, 10 Unicode characters? Whenever I think about designing a new language, this very issue pops up.
hvenev · a year ago
Is 26 characters next? I think the number of bytestrings of length <= 26 that are also valid UTF-8 is only 0.021 * 256^24.
conaclos · a year ago
compact_str is a fantastic crate, used by many projects. Do you know the byteyarn crate [0]? This could be nice to add this to the `Similar Crates` section if it makes sense.

[0] https://docs.rs/byteyarn/latest/byteyarn/

fasterthanlime · a year ago
That’s amazing! I’m so glad you did :)
dataflow · a year ago
Note for C++ developers: Their trick is only possible because the strings are UTF-8 and not null-terminated. It wouldn't work as a drop-in for standard strings in C++.
neonsunset · a year ago
Null-terminated strings really are a mistake. Make vectorized algorithms problematic by forcing them to account for page size and paged memory in general as well as always scan for NUL, cannot be easily sliced without re-allocation, are opposite to how languages with better string primitives define them and in general don't save much by passing a single pointer over ptr + length.
kloop · a year ago
std::string isn't null terminated (or at least it isn't guaranteed to be, I don't think it's forbidden for an implementation to do that).

That's why the c_str method exists, so you can get a pointer to a null terminated character array

mananaysiempre · a year ago
More of a knock against C than C++, seeing as today’s C++ tends to prefer `string_view`s or at worst iterator pairs, neither of which use null-termination. (Saying that as someone who prefers C to C++ and avoids Rust exactly because of “better C++” vibes—I don’t want a better version of C++, if anything I want a much much simpler one that’s worse at some things.)

That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero. For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.

To be clear: none of this makes null termination not dumb.

silisili · a year ago
This is interesting, thanks.

> 0b11111110 - All 1s with a trailing 0, indicates heap allocated

> 0b11XXXXXX - Two leading 1s, indicates inline, with the trailing 6 bits used to store the length

I stared at this for too long, as it allows collision. Then I realized you'd never set the third bit, it should probably have been written 0b110XXXXX and recorded that 5 bits are used for length. Right or did I understand it wrong?

tialaramex · a year ago
Probably this isn't helpful anyway - what's actually going on is more complicated and is explained later at a high level or I'll try now:

Rust has "niches" - bit patterns which are never used by that type and thus can be occupied by something else in a sum type (Rust's enum) which adds to that type. But stable Rust doesn't allow third parties to promise arbitrary niches exist for a type they made.

However, if you make a simple enumeration of N possibilities that automatically has a niche of all the M-N bit patterns which weren't needed by your enumeration in the M value machine integer that was chosen to store this enumerated type (M will typically be 256 or 65536 depending on how many things you enumerated)

So, CompactString has a custom enum type LastUtf8Char which it uses for the last byte in its data structure - this has values V0 through V191 corresponding to the 192 possible last bytes of a UTF-8 string. That leaves 64 bit patterns unused. Then L0 through L23 represent lengths - inline strings of length 0 to 23 inclusive which didn't need this last byte (if it was 24 then that's V0 through V191). Now we've got 40 bit patterns left.

Then one bit pattern (the pattern equivalent to the unsigned integer 216) signifies that this string data lives on the heap, the rest should be interpreted accordingly, and another (217) signifies that it's a weird static allocation (I do not know why you'd do this)

That leaves 38 bit patterns unused when the type is finished using any it wanted so there's still a niche for Option<CompactString> or MyCustomType<CompactString>

shikon7 · a year ago
Maybe you need it if pointers are 16 bytes?
mastax · a year ago
It’s nice that rustc’s niche optimization lets smolstr be implemented with a simple enum, rather than having to do some unsafe union bit packing[0]. The only concession that had to be made to the compiler is using an enum for the InlineSize value to show that the last 3 bits of that aren’t used.

[0]: https://github.com/rust-analyzer/smol_str/blob/fde86a5c0cb8f...

conaclos · a year ago
Readers who like this article may also like a more recent one [0]. It designs a compact string with extra capabilities. The crate was released under the name byteyarn [1]

[0] https://mcyoung.xyz/2023/08/09/yarns/

[1] https://docs.rs/byteyarn/latest/byteyarn/

masklinn · a year ago
[2020]. I wouldn't be surprised if the field had changed a fair bit since.
brigadier132 · a year ago
Ok, I dived into the rabbit hole a bit and found this:

https://github.com/rosetta-rs/string-rosetta-rs?tab=readme-o...

Click the charts for actually readable data.

From a quick scan, it seems like hipstr is the current fastest string implementation and is actively maintained.

epage · a year ago
Also, smol_str was removed from the comparison because matklad, the author of smol_str, suggests ecow

> I’d rather say the opposite: users of those crates should switch to ecow. It is exactly what smol_str would have been, if it were a proper crate with a stable API, rather than an implementation detail of rust-analyzer. > > It’s a drop-in replacement for String, with O(1) clone and SSO, and I believe this is all you need. Other crates either have needlessly restricted API (no mutation), questionable implementation choices, or a bunch of ad hoc traits in the API.

https://www.reddit.com/r/rust/comments/117ksvr/ecow_compact_...

skavi · a year ago
Those benchmarks aren’t great.

A real world-ish workload made generic over all these types would provide much more interesting results. Especially if heap usage was also tracked.

IshKebab · a year ago
And has the best name!
abhorrence · a year ago
Sadly it seems like some of the images have broken since it was originally posted. :(
01HNNWZ0MV43FF · a year ago
This capture looks a little better. Even the CSS looks off on the live site. Maybe someone should ping him? I think he's on Mastodon https://web.archive.org/web/20201002084042/https://fastertha...
cute_boi · a year ago
Actually previous design was better to my eyes.
pronoiac · a year ago
I emailed him.
fasterthanlime · a year ago
Thanks for the heads up!

I fixed the SVG graphs and made a couple layout updates based on the feedback here and some earlier feedback from my subreddit, see the before-after here.[1]

I've been mostly focused on function lately, the redesign is, let's say, a work in progress.

(Oh also, I use they/them pronouns these days [2])

[1] https://imgur.com/a/before-after-2024-08-24-PAmeHFX [2] https://fasterthanli.me/articles/state-of-the-fasterthanlime...

weinzierl · a year ago
Here [1] is a nice talk that discusses various options and trade-offs for small string and small vector optimization in Rust.

[1] https://m.youtube.com/watch?time_continue=2658&v=tLX_nvWD738...

dathinab · a year ago
on interesting thing to realize is that some small string types go beyond just the basic small len storage optimization

- compact_str e.g. depends on the string being valid utf-8, and in turn has larger short strings

- smol_str e.g. is a a enum over `[u8; CAP] | &'static str | Arc<str>` this means it avoids any allocations for static strings and has very fast clones, leading to similar perf. characteristics as string internalization in some use cases (like the use-cases it was designed for). But at the cost of it being immutable only and the heap allocation being slightly larger for the Rc.

Other interesting differences can be the handling of shrinking mutable Strings, do you re-inline it or not? What is better here is highly use-case dependent.

In the end there are many design decisions where there is no clear winner but it's a question of trade off with use-case specific preferences.

codedokode · a year ago
I wonder, is there a "string-like interface" in Rust or one has to rewrite all the code when changing string implementation? Also if you want to change implementation in half of the code, is there automatic convertion between implementations?
duped · a year ago
This is what the `AsRef` trait is great for

    impl AsRef<str> for MyCoolString {
        fn as_ref(&self) -> &str {
            ...
        }
    }
To be totally defensive, other library authors can write their APIs like

    fn my_cool_function(s: impl AsRef<str>) {
        let s: &str = s.as_ref();
    }
and now `s` is treated like any other &str. But even if library authors don't program defensively and use `s: &str` in the argument type, callers can always do this:

    my_cool_function(my_cool_string.as_ref());

dubi_steinkek · a year ago
It's possible but I would prefer to just pass &str at that point, leading to less monomorphizations and a simpler API
nextaccountic · a year ago
For read-only strings you pass around &str usually if you can borrow at all, and all UTF-8 string types should be compatible with it. AsRef is just a trait for "all types that can be converted into &str as needed" so if you use AsRef you need to have immutable strings too

But the thing about String is that it allows you to push data at the end (like Java's StringBuilder). I think there is no commonly used trait for that, no.

Note that not all owned string types can be appended at the end. Box<str> for example is just like String but without a .push_str() method. (This saves some bytes because it doesn't need a spare capacity)

If your string is not UTF-8 then usually you convert it at API boundaries, to deal with UTF-8 stuff internally (or at least WTF-8)