Readit News logoReadit News
yurivish · a year ago
I also emailed Gonzalo Navarro once to ask a question, and we had a great discussion and ended up writing a paper together about the answer. [1]

Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf

During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]

My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.

[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.

[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main

sitkack · a year ago
Reading a Gonzalo Navarro paper is like going for walk, taking a shower and having a wonderful coffee. It literally sets the mind on fire.

https://dblp.org/pid/n/GonzaloNavarro.html

SoftTalker · a year ago
Well not literally.
__tidu · a year ago
the "technical note" link in the RLE bit vector section of the rust repo is broken (https://yuri.is/pdfing/weighted_range_quantile_queries.pdf 404s)
__tidu · a year ago
oh wait nvm just realised you linked a working archive link in your post... still worth updating the link in the repo for people who stumble upon it
mindcrime · a year ago
These are the days I really love HN. Despite having been in this field for 30 some odd years, I'd never heard of "succinct data structures" until now. And had I not seen this post, maybe I never would have.

Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.

Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.

qazxcvbnm · a year ago
Note that succinct data structures may not be faster than conventional structures if your dataset fits in memory http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for large datasets where storage access times dominate, succinct data structures win all around. In any case, succinct trees are works of art (If I recall https://arxiv.org/pdf/1805.11255 was a good exposition) (just look at how that RMQ works)!
hinkley · a year ago
As an application grows the features start to interact. We tend to not be paying much attention to how the old code 'fits into memory' while we are writing new code that also has to fit into memory.

Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.

As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.

yvdriess · a year ago
True, but it depends on what you mean with fitting in memory.

Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.

For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.

It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.

With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.

barrkel · a year ago
There's fits in memory, and there's fits in memory.

I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.

It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.

thomaskoopman · a year ago
I think it depends more on the ratio between access time and how often you use the data. Adding two arrays that fit in L1 is already limited by access time. On Zen3, we can add two 32-byte vectors per cycle, but only store one of these per cycle. For matrix multiplication, we can do the two additions per cycle (or really c <- a * b + c) because we have to do multiple operations once we have loaded the data into registers.

I can see it be useful for data sets of a few dozen MBs as well.

fegu · a year ago
Memory is expensive. In the cloud especially. Using a succinct structure could enable cheaper computing for specific tasks. This benefits everyone.
lostmsu · a year ago
topspin · a year ago
Yes, that's great. This part:

"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."

Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.

adgjlsfhk1 · a year ago
This is great! I love both how far you can push this and get meaningful improvements, and how it's totally overkill for anything we'll ever be able to implement on a physical computer. The hardware focused approach is to use popcnt for a base size of 512 (since cache architecture will make you fetch that much memory if you touch the original array anyway). We then can store 1 UInt16 prefix sum per 512 bits (n/32 bits overall), and if we have more than 2^16 bits in total, we can store UInt64 prefixes every 2^16 bits (n/1024 bits overall).

Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.

jbreckmckye · a year ago
That IS excellent - thank you
atiedebee · a year ago
This is great. A very understandable explanation, thanks for sharing!
akoboldfrying · a year ago
templatetypedef answers are the best answers. I know going in that (1) there are going to be surprising insights, and (2) I'm going to understand them fully.
kccqzy · a year ago
I first heard of the concept of succinct data structures from Edward Kmett, a famous Haskeller behind many popular Haskell libraries. He gave a talk on succinct data structures a long time ago: http://youtu.be/uA0Z7_4J7u8
eru · a year ago
Ed is great. He's not just a Haskeller, but has also done interesting work in the likes of C++ and 6502 assembly amongst others.
kreyenborgi · a year ago
His code on this seems to be https://hackage.haskell.org/package/structures

There is also HaskellWorks packages like https://hackage.haskell.org/package/hw-xml

mrkeen · a year ago
My Haskell attempt at Wavelet Matrices: https://github.com/jahaynes/waveletto
johnpmayer · a year ago
Whoa - I was there! This was at a small elm-lang meetup; Prezi hosted this while Evan was working there.

Thank you for sharing this link!

judofyr · a year ago
Succinct data structures are very fun! If anyone is interested, I've implemented some of this in Zig: https://github.com/judofyr/zini. The main thing this implements is a minimal perfect hash function which uses less than 4 bits per element (and can respond to queries in ~50 ns). As a part of that I ended up implementing on of these indexes for constant-time select(n): https://github.com/judofyr/zini/blob/main/src/darray.zig.

It feels kinda magic implementing these algorithms because everything becomes so tiny!

thomasmg · a year ago
For Java, C++, and Rust there is also https://sux.di.unimi.it/ maintained by Sebastiano Vigna, a professor from Italy.

Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.

The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.

rurban · a year ago
I'm working on making pthash faster and more practical. I can compile the data and code to C++, send efficiently store the keys also to be able to eliminate false positives.

https://github.com/rurban/pthash

cxie · a year ago
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?

The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.

This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.

MortyWaves · a year ago
When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.

Does your language have the concept of streaming files?

crazygringo · a year ago
If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.

But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.

Nothing prevents streaming in theory, it's just far more complicated to write that library.

cess11 · a year ago
The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.

In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.

I'd also like to mention that XSLT is an often underappreciated approach.

leafmeal · a year ago
There's a create blog from the creator of RhymeBrain that talks about Succinct Tries: https://stevehanov.ca/blog/index.php?id=120

I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.

CrimsonCape · a year ago
This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
senderista · a year ago
You store the structural information separately from the data. The data can be stored sequentially in some traversal order.
neuroelectron · a year ago
He touches on indexes but doesn't really mention the implementation. This is about the primitives.
sujayakar · a year ago
I really love this space: Navarro's book is an excellent survey.

Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/