Interning strings saves a ton of space. I wish more programmers would use it.
Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.
Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.
I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.
Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.
Not only was the system incredibly fast, but it never took more than 150MB of memory to run.
This is mostly the real reason why interning gets used, to avoid long string comparisons over saving memory as such.
Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.
I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.
If your language supports good strong/weak references and containers thereof, cleaning up dead interned strings isn't hard. I'm not aware of any language that provides this out-of-the-box, unfortunately.
Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)
In Lisp, interning is not only used for saving on string comparisons. It's the basis of the symbol abstraction. Or perhaps not the basis, but interning is the only way symbols can correspond to a printed representation. Without interning, we don't have it that A and A are the same object.
A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.
Here's another perspective: I was once asked to improve a custom data caching system that took too much memory even though it had string interning. (At that time, eviction had to be used on factors other than memory used.) String interning certainly helped with memory use but it wasn't enough. Eventually my solution was to compress each record of the dataset in memory. I found that this saved more memory than interning individual strings within each record. At that time I picked Google's snappy compression algorithm since it was already a dependency of the project, but these days I might have picked zstd with a negative compression level or lz4.
This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.
We did a very similar thing in a previous project. Used protostuff to serialize each object, and snappy at the time to compress it. And a small proxy class to access it that decompressed on demand. Costs a few microseconds on read, but saved 4x the space (20GiB -> 5GiB IIRC). This was after interning etc.
I'm sure you can save a lot more space if you compress batches of objects and store it column oriented, but will take a bit longer to decompress individual objects.
Many compression algorithms are effectively string interning that works on general-purpose binary data and adaptively pick the common substrings that are most repeated and assign them the smallest bit representations. That's why formats like XML and JSON compress so well: all those repeated string keys get stored once and then become sub-byte entries in a lookup table.
So a fun story about how interning numbers can go wrong: When compiling the apple's ui designs from the xml based xib to a binary nib, their compiler uses lots of interning. It's pretty cool for avoiding 20 copies of an empty string for example. But then they messed up and did the same thing with numbers while ignoring the type... Which means if you have a value 5.0 as a size somewhere, then your sequentially assigned ids will be: 1, 2, 3, 4, 5.0, 6,...
> Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.
Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.
"String interning" is discussed in the Java Language Spec, going back to the 1990s, and wasn't novel there. It goes back at least to Lisp implementations from the 1970s. Probably even earlier than that.
"pointerizing" might be another name. It would be utterly unsurprising if the database/sql world had more than one other name as well. Usually, ideas that go back to the 1950s at the dawn of data representation have many names.
Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).
Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).
Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.
As mentioned else thread (at least here by
nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)
I have a Rust crate (ijson) which implements string interning and a generally much more efficient in-memory JSON representation. It probably doesn't compete with the author's specialized implementation, but if you just want a plug and play solution it's pretty good.
The best thing that JSON is good at is human readability. I would say human composibility as well, but YAML and others are better for that. As a machine data exchange and storage format, JSON is extremely inefficient.
I really dislike writing YAML. It might be a skill issue (as the kids say), but it just never landed. I'd rather write JSON, even with all its verbosity.
It never ceases to amaze me how good computer people are at solving problems which are not problems (readability of serialized representation) at the expense of multiple orders of magnitude more compute and storage.
A problem with explicit interning is that libraries, when creating objects, cannot make an informed decision whether to offer some runtime performance in exchange for the expectation of a decrease in memory usage.
And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.
I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.
Simple examples:
- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.
- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.
That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.
I did something similar to reduce the memory overhead of our Rust backend at a recent job; I noticed that when dealing with certain requests, we were using quite a lot of memory on duplicates of the same strings when processing things, so I made something I called an "alias cache" type that could be used to obtain an alias of an existing string if needed, and then introduced those to several places in the codebase where it was most helpful. This strategy also allowed some flexibility on how long strings were "cached" for, which let me tweak which layers of processing should share a cache versus creating their own (and letting the old one drop). Our use case definitely didn't have room for interning to use 2000x less memory, but I think in some of the pathological cases it reduced the max memory usage to around 50% of what it was before without significantly affecting runtime, which I was pretty pleased with!
Very interesting, I particularly enjoyed the handling of memory reduction with the Interned<T> comparison. How do you go about finding open source projects which expose fascinating data such as this one?
Very fun read guillaume! I never find the courage to work on my "this seems to be interesting weekend idea" due to my fear of time commitment, but I often find these HN write up somehow fulfilling.
Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.
Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.
I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.
Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.
Not only was the system incredibly fast, but it never took more than 150MB of memory to run.
Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.
I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.
Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)
A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.
This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.
Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.
Probably something to do with ANSI history.
It is even more painful than IEEE 754 ambiguity.
Deleted Comment
For instance, people writing blogs to explain Lisp, who then present some Javascript crud in which symbols are just character strings.
https://learn.microsoft.com/en-us/dotnet/api/system.string.i...
Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.
Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).
Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).
Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.
As mentioned else thread (at least here by nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)
Deleted Comment
And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.
I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.
Simple examples:
- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.
- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.
That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.