Why Polars rewrote its Arrow string data type

> “short string optimization”: A short enough string can be stored “in place” [...] An optimization that’s impossible in Rust, by the way ;).

Author is not aware of https://docs.rs/compact_str/latest/compact_str/ or https://github.com/bodil/smartstring

jtrueb · 2 years ago

This is false. The polars api has used smart string for a long time.

https://github.com/pola-rs/polars/blob/32a2325b55f9bce81d019...

diggan · 2 years ago

That documentation talks about all the benefits and "can mostly be used as a drop in replacement for String", but what are the tradeoffs? When cannot it be used?

tialaramex · 2 years ago

It looks like there's a bunch of "garbage collection" type activity where you may have some bytes which were once part of a string but aren't now used, and you're always paying for the overhead of this optimisation even if it's useless for your problem.

Suppose you work only with 500-4000 byte strings, maybe they're short reviews, and each ends with a rating in star emoji, ***** is the best * is the worst. [[HN ate my star emoji of course]]

So your reviews never fit in the "optimised" string slot, but also the prefix is just opening words from a review, which in some review styles will be the start of seemingly unrelated anecdotes. "My grandfather used to tell me" it'll get to the review eventually, and you'll see why they're connected, but the suffix is useful and that's not stored in a "German string" data structure.

Or maybe you have a high turnover of somewhat related medium size strings, so then that garbage collection step costs quite a lot of overhead.

LegionMammal978 · 2 years ago

Any code built closely around String's power-of-2 reallocation pattern may have to be reworked. I don't think there's any case when it cannot be used as a String replacement at all, except maybe when interfacing with an API that expects a &mut String as an output parameter.

runeblaze · 2 years ago

My uninformed guess is that at least it will cost you the branching because you need to check if it is inlined or not, and you pay that for every string. Branch prediction is likely very good for this case though.

charliermarsh · 2 years ago

Was this section removed? I'm not seeing it in the linked post.

tialaramex · 2 years ago

It seems to be a quote from https://cedardb.com/blog/german_strings/ which is about this German Strings type (implemented in Polars)

But yeah, it's pretty ignorant to assume Rust can't do this since the best available examples (as with many things) are in Rust. CompactString is really nice. On a typical modern (64-bit) computer CompactString takes 24 bytes and holds up to 24 bytes of UTF-8 text inline, while also having a niche.

I guess the confusion arises because C++ people tend to assume that anywhere Rust differs from the practice in the C++ community it's a mistake, even though that's often because C++ made the wrong choice? Rust's &str is "just" &[u8] plus a rule about the meaning of these bytes, and Rust's String is correspondingly "just" Vec<u8> plus a rule about the meaning of those bytes. C++ couldn't have done the former because it only belatedly got the fat pointer slice reference (as the troubled std::span) years after having a string data type.

Rust didn't do this in the stdlib, but not because it's impossible, because it's a trade off and they wanted the provided stdlib type to be straightforward. If you need or even just want the trade off, you can just cargo add compact_str

ladyanita22 · 2 years ago

Why is it impossible in Rust? Do you have any source for that?

steveklabnik · 2 years ago

It is not. It is not implemented in std::string::String, but (as pointed out elsewhere in this thread) there are other string implementations that have it.

It was decided explicitly against for the standard library, because not every optimization is universally good, and keeping String as a thin wrapper over Vec<T> is a good default.

> As I mentioned above already pre-allocating the required size of data is hard. This leads to many reallocations and memcopy’s during the building of this type.

Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.

I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.

Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.

Karellen · 2 years ago

Given that you're writing a custom allocator, why were you trying to hand allocations over to malloc()? Why not entirely replace malloc() for the process?

(If you still need libc malloc for smaller non-growable allocations under the hood, you should be able to privately access it via dlopen()/dlsym() in your code, shouldn't you?)

Joker_vD · 2 years ago

> why were you trying to hand allocations over to malloc()?

So that they could be released with free(). For historical reasons, on Linux, most libraries don't generally provide foo_free() functions to be used for freeing objects returned from those libraries, everyone is supposed to use free(), under the tacit assumption that there is only one version of libc loaded in the process which everyone will use. The Windows world has somewhat better culture in this regard.

> Why not entirely replace malloc() for the process?

Now that's just rude.

Rygian · 2 years ago

> When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid.

In my mind this reads identical to "if you're a security practitioner, worry about this bit here."

kristianp · 2 years ago

Don't you lose the in-memory interop with other libraries by doing this? I'm thinking that duckdb will no longer be able to read polars data that has been loaded into memory, as it can currently do, due to duckdb supporting Arrow. Isn't the benefit of arrow that it's supported by many languages and libraries as a standard?

Will there be an option to use the "compatible" string format?

mau · 2 years ago

Arrow supports this string format: https://arrow.apache.org/docs/format/Columnar.html#variable-...

From the article:

> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.

ritchie46 · 2 years ago

DuckDB has their own string type (quite similar to this) that deviates from Arrow (large)-string type, so it had to do a copy anyway. Nothing has changed on that front.

emidln · 2 years ago

Wouldnt you just recompile your duckdb with the hacked Arrow?

suremarc · 2 years ago

Arrow is a cross-language standard. Polars is written in Rust but DuckDB is written in C++. So it's not that simple unfortunately.

xiaodai · 2 years ago

For short strings, this is basically the same concept as ShortStrings.jl in julia

https://github.com/JuliaString/ShortStrings.jl

lmeyerov · 2 years ago

I wonder if something like arrow's custom extension types mechanism would streamline building novel data representations without full forks? It may highlight gaps in the extension mechanism

For similar reasons, we've been curious about new compression modes on indexes

Deleted Comment

ismailmaj · 2 years ago

I might have missed it in the article but I'm not sure why the prefix is stored for strings that can't be inlined.

pgwhalen · 2 years ago

Ctrl-F for "Some motivations are as follows" under the "String view with short string optimizations" section here: https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...

Copying here:

> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)

> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.

This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.

make3 · 2 years ago

the first n bytes are likely by far the most often accessed in practices, specifically for sorting & filtering, etc. Storing them inline is likely a huge optimization for little cost.

NikkiA · 2 years ago

It's zero cost, since you want the pointer to be 64bit aligned anyway.