That documentation talks about all the benefits and "can mostly be used as a drop in replacement for String", but what are the tradeoffs? When cannot it be used?
It looks like there's a bunch of "garbage collection" type activity where you may have some bytes which were once part of a string but aren't now used, and you're always paying for the overhead of this optimisation even if it's useless for your problem.
Suppose you work only with 500-4000 byte strings, maybe they're short reviews, and each ends with a rating in star emoji, ***** is the best * is the worst. [[HN ate my star emoji of course]]
So your reviews never fit in the "optimised" string slot, but also the prefix is just opening words from a review, which in some review styles will be the start of seemingly unrelated anecdotes. "My grandfather used to tell me" it'll get to the review eventually, and you'll see why they're connected, but the suffix is useful and that's not stored in a "German string" data structure.
Or maybe you have a high turnover of somewhat related medium size strings, so then that garbage collection step costs quite a lot of overhead.
Any code built closely around String's power-of-2 reallocation pattern may have to be reworked. I don't think there's any case when it cannot be used as a String replacement at all, except maybe when interfacing with an API that expects a &mut String as an output parameter.
My uninformed guess is that at least it will cost you the branching because you need to check if it is inlined or not, and you pay that for every string. Branch prediction is likely very good for this case though.
But yeah, it's pretty ignorant to assume Rust can't do this since the best available examples (as with many things) are in Rust. CompactString is really nice. On a typical modern (64-bit) computer CompactString takes 24 bytes and holds up to 24 bytes of UTF-8 text inline, while also having a niche.
I guess the confusion arises because C++ people tend to assume that anywhere Rust differs from the practice in the C++ community it's a mistake, even though that's often because C++ made the wrong choice? Rust's &str is "just" &[u8] plus a rule about the meaning of these bytes, and Rust's String is correspondingly "just" Vec<u8> plus a rule about the meaning of those bytes. C++ couldn't have done the former because it only belatedly got the fat pointer slice reference (as the troubled std::span) years after having a string data type.
Rust didn't do this in the stdlib, but not because it's impossible, because it's a trade off and they wanted the provided stdlib type to be straightforward. If you need or even just want the trade off, you can just cargo add compact_str
It is not. It is not implemented in std::string::String, but (as pointed out elsewhere in this thread) there are other string implementations that have it.
It was decided explicitly against for the standard library, because not every optimization is universally good, and keeping String as a thin wrapper over Vec<T> is a good default.
> When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid.
In my mind this reads identical to "if you're a security practitioner, worry about this bit here."
Don't you lose the in-memory interop with other libraries by doing this? I'm thinking that duckdb will no longer be able to read polars data that has been loaded into memory, as it can currently do, due to duckdb supporting Arrow. Isn't the benefit of arrow that it's supported by many languages and libraries as a standard?
Will there be an option to use the "compatible" string format?
> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.
DuckDB has their own string type (quite similar to this) that deviates from Arrow (large)-string type, so it had to do a copy anyway. Nothing has changed on that front.
I wonder if something like arrow's custom extension types mechanism would streamline building novel data representations without full forks? It may highlight gaps in the extension mechanism
For similar reasons, we've been curious about new compression modes on indexes
> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)
> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.
This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.
the first n bytes are likely by far the most often accessed in practices, specifically for sorting & filtering, etc. Storing them inline is likely a huge optimization for little cost.
> As I mentioned above already pre-allocating the required size of data is hard. This leads to many reallocations and memcopy’s during the building of this type.
Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.
I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.
Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.
Given that you're writing a custom allocator, why were you trying to hand allocations over to malloc()? Why not entirely replace malloc() for the process?
(If you still need libc malloc for smaller non-growable allocations under the hood, you should be able to privately access it via dlopen()/dlsym() in your code, shouldn't you?)
> why were you trying to hand allocations over to malloc()?
So that they could be released with free(). For historical reasons, on Linux, most libraries don't generally provide foo_free() functions to be used for freeing objects returned from those libraries, everyone is supposed to use free(), under the tacit assumption that there is only one version of libc loaded in the process which everyone will use. The Windows world has somewhat better culture in this regard.
> Why not entirely replace malloc() for the process?
Author is not aware of https://docs.rs/compact_str/latest/compact_str/ or https://github.com/bodil/smartstring
https://github.com/pola-rs/polars/blob/32a2325b55f9bce81d019...
Suppose you work only with 500-4000 byte strings, maybe they're short reviews, and each ends with a rating in star emoji, ***** is the best * is the worst. [[HN ate my star emoji of course]]
So your reviews never fit in the "optimised" string slot, but also the prefix is just opening words from a review, which in some review styles will be the start of seemingly unrelated anecdotes. "My grandfather used to tell me" it'll get to the review eventually, and you'll see why they're connected, but the suffix is useful and that's not stored in a "German string" data structure.
Or maybe you have a high turnover of somewhat related medium size strings, so then that garbage collection step costs quite a lot of overhead.
But yeah, it's pretty ignorant to assume Rust can't do this since the best available examples (as with many things) are in Rust. CompactString is really nice. On a typical modern (64-bit) computer CompactString takes 24 bytes and holds up to 24 bytes of UTF-8 text inline, while also having a niche.
I guess the confusion arises because C++ people tend to assume that anywhere Rust differs from the practice in the C++ community it's a mistake, even though that's often because C++ made the wrong choice? Rust's &str is "just" &[u8] plus a rule about the meaning of these bytes, and Rust's String is correspondingly "just" Vec<u8> plus a rule about the meaning of those bytes. C++ couldn't have done the former because it only belatedly got the fat pointer slice reference (as the troubled std::span) years after having a string data type.
Rust didn't do this in the stdlib, but not because it's impossible, because it's a trade off and they wanted the provided stdlib type to be straightforward. If you need or even just want the trade off, you can just cargo add compact_str
It was decided explicitly against for the standard library, because not every optimization is universally good, and keeping String as a thin wrapper over Vec<T> is a good default.
In my mind this reads identical to "if you're a security practitioner, worry about this bit here."
Will there be an option to use the "compatible" string format?
From the article:
> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.
https://github.com/JuliaString/ShortStrings.jl
For similar reasons, we've been curious about new compression modes on indexes
Deleted Comment
Copying here:
> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)
> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.
This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.
Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.
I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.
Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.
(If you still need libc malloc for smaller non-growable allocations under the hood, you should be able to privately access it via dlopen()/dlsym() in your code, shouldn't you?)
So that they could be released with free(). For historical reasons, on Linux, most libraries don't generally provide foo_free() functions to be used for freeing objects returned from those libraries, everyone is supposed to use free(), under the tacit assumption that there is only one version of libc loaded in the process which everyone will use. The Windows world has somewhat better culture in this regard.
> Why not entirely replace malloc() for the process?
Now that's just rude.