Readit News logoReadit News
Posted by u/billziss 2 years ago
Show HN: Integer Map Data Structuregithub.com/billziss-gh/im...
This project presents a new data structure for storing ordered integer maps. The proposed data structure is a compressive, cache-friendly, radix tree that has performance comparable to an unordered map (`std::unordered_map`) and is an order of magnitude faster than an ordered map (`std::map`).
bts · 2 years ago
FWIW there is prior art here. e.g. see IntMap in Haskell: https://hackage.haskell.org/package/containers-0.7/docs/Data...
ww520 · 2 years ago
It's similar but not quite. IntMap in Haskell uses bit as the prefix unit with a span factor of 2. This uses nibble as the prefix unit with a span factor of 16. Also IntMap uses a bitmask for the range of prefix units in a node while this uses a subnet-mask style to get the prefix of a node.
kccqzy · 2 years ago
It hardly seems comparable given that this Haskell data structure must necessarily be persistent.
zogrodea · 2 years ago
I'm not sure what you mean by "necessarily must be persistent" here, but (under my interpretation which means only persistent implementations of this data structure are possible) that's not true.

The IntMap paper by Chris Okasaki acknowledges that the data structure is not a new one (it's a PATRICIA tree), but one that was around 30 years old at the date his paper was published and deserved to be better known. The functional/persistent implementation is what's novel about Okasaki's paper.

Edit: The data structure this submission is about looks like great work! Excited to see it and will probably attempt my own ports of it to other languages.

jemfinch · 2 years ago
This really doesn't seem to be comparing to comparable data structures. For int map specializations like this, the optimized alternatives are things like Judy (which is looking quite aged these days) or roaring bitmaps, not to mention that any C++ developer using "ordinary" maps will be using absl's SwissTable (flat_hash_map) or folly's F14 (F14FastMap) or perhaps absl::btree_map if order is important. Comparisons to std::map and std::unordered_map are simply too naive to make the case for this data structure.
ww520 · 2 years ago
Oh come on. Don't be too harsh. This is an ordered map. Most of the mentioned ones are unordered maps. They might be fast but they are unordered. The only comparables are absl::btree_map and Judy Array. Without benchmarking, my gut feeling is this will beat absl::btree_map. Trie usually beats BTree.
jemfinch · 2 years ago
I've written a lot of high performance/scale C++ code using a lot of data structures over the years, and ordered iteration has been very rarely needed; unordered data structures still rule the day in performance the vast majority of the time, and their lower constant factors very frequently outperform more specialized data structures. They're absolutely worth benchmarking against if the goal is actual uptake in the actual world.

In my experience, practically every single time I've used absl::btree_map, I've ended up reverting for performance reasons to either a flat hash map or, in some relatively rare cases, a sorted vector map (despite its O(n) insert/erase) because the constant factors are _so doggone low_. The experience remains: btree_map (or SkipLists, or whatever) has (in my experience) essentially never, in over a half million lines of C++, actually remained in the code.

Also, I presume (based on the implementation details, not based on actual use) that roaring bitmaps have some reasonable iteration API that would make them relevant even to the ordered comparison.

The important thing here is that if anyone wants to contend that someone should use their new data structure because it's better or faster or more optimal for some particular use case, it's important for them to demonstrate that they've thoroughly investigated the data structure space around their proposal. Comparison against std::map and std::unordered_map simply doesn't demonstrate the kind of comprehensive knowledge that I would expect from someone who claims that I should use their optimized integer map in my own code.

lichtenberger · 2 years ago
We're using a similar trie structure as the main document (node) index in SirixDB[1]. Lately, I got some inspiration for different page-sizes based on the ART and HAMT basically for the rightmost inner pages (as the node-IDs are generated by a simple sequence generator and thus also all inner pages (we call them IndirectPage) except for the rightmost are fully occupied (the tree height is adapted dynamically depending on the size of the stored data. Currently, always 1024 references are stored to indirect child pages, but I'll experiment with smaller sized, as the inner nodes are simply copied for each new revision, whereas the leaf pages storing the actual data are versioned themselfes with a novel sliding snapshot algorithm.

You can simply compute from a unique nodeId each data is assigned (64bit) the page and reference to traverse on each level in the trie through some bit shifting.

[1] https://github.com/sirixdb/sirix

NWoodsman · 2 years ago
Also will throw in to the mix, in C#:

https://julesjacobs.com/2014/11/11/immutable-vectors-csharp....

His implementation uses buffers of capacity 32, generics, and bit shifting to do lookups.

winrid · 2 years ago
Neat, thank you! I'd love to see how it compares to the libgdx IntMap[0].

[0] https://github.com/libgdx/libgdx/blob/master/gdx/src/com/bad...

repsilat · 2 years ago
This is a radix tree (ordered, does more allocations), that is a hash table. Also TFA is C/C++, libgdx looks like Java.
winrid · 2 years ago
yeah, just thought it'd be fun to compare :) ordered is a big difference.
AaronFriel · 2 years ago
Interesting! Reminds me a great deal of Judy Arrays: https://en.m.wikipedia.org/wiki/Judy_array

Judy Arrays are a radix trie with branching and a few node types designed to be cache line width optimized.

repsilat · 2 years ago
Looks rad, I was going to look into some b-trees for a use-case where I need an ordered map of things similar to integers and this might be better.

I couldn't immediately see, is there mention of whether insertions invalidate iterators? Maybe not strictly needed for my use-case but good to know.

ww520 · 2 years ago
This looks very good. The idea of using a subnet-mask style to compute the prefix of a node is pretty novel. I haven't seen anything like it. The choice of span factor of 16 is a good compromise between node size and tree depth. The node slot packing is amazing. Actually if you relax the restriction on 64-byte node to 128-byte node, you can get 64 bits per slot and will get a much higher limit for the item count. Newer CPU's are starting to support 128-byte cache line.
mrazomor · 2 years ago
If I understood it correctly, it's not much different from what `absl::flat_hash_map` does.

Look for "Metadata information"/control bits at https://abseil.io/about/design/swisstables

ww520 · 2 years ago
I believe it's quite different. If I'm not mistaken, flat_hash_map is a typical hash table, with a hash function to map the key to a bucket. Its novel part is using the 7 lower bits of the hashed value as a bit mask against the control bits to check the presence of the item. It's like a mini-Bloom-Filter where a false query means the item definitely not in the bucket while a positive query means the item might be in the bucket. A subsequent search through the bucket can confirm the existence of the item.

This int map is a trie with an integer based key.