During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]
My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
These are the days I really love HN. Despite having been in this field for 30 some odd years, I'd never heard of "succinct data structures" until now. And had I not seen this post, maybe I never would have.
Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.
Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
As an application grows the features start to interact. We tend to not be paying much attention to how the old code 'fits into memory' while we are writing new code that also has to fit into memory.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
True, but it depends on what you mean with fitting in memory.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
There's fits in memory, and there's fits in memory.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
I think it depends more on the ratio between access time and how often you use the data. Adding two arrays that fit in L1 is already limited by access time. On Zen3, we can add two 32-byte vectors per cycle, but only store one of these per cycle. For matrix multiplication, we can do the two additions per cycle (or really c <- a * b + c) because we have to do multiple operations once we have loaded the data into registers.
I can see it be useful for data sets of a few dozen MBs as well.
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
This is great! I love both how far you can push this and get meaningful improvements, and how it's totally overkill for anything we'll ever be able to implement on a physical computer. The hardware focused approach is to use popcnt for a base size of 512 (since cache architecture will make you fetch that much memory if you touch the original array anyway). We then can store 1 UInt16 prefix sum per 512 bits (n/32 bits overall), and if we have more than 2^16 bits in total, we can store UInt64 prefixes every 2^16 bits (n/1024 bits overall).
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
templatetypedef answers are the best answers. I know going in that (1) there are going to be surprising insights, and (2) I'm going to understand them fully.
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
I'm working on making pthash faster and more practical. I can compile the data and code to C++, send efficiently store the keys also to be able to eliminate false positives.
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.
Does your language have the concept of streaming files?
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]
My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main
https://dblp.org/pid/n/GonzaloNavarro.html
Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.
Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
I can see it be useful for data sets of a few dozen MBs as well.
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
There is also HaskellWorks packages like https://hackage.haskell.org/package/hw-xml
Thank you for sharing this link!
It feels kinda magic implementing these algorithms because everything becomes so tiny!
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
https://github.com/rurban/pthash
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
Does your language have the concept of streaming files?
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/