Readit News logoReadit News
mizmar commented on SIMD within a register: How I doubled hash table lookup performance   maltsev.space/blog/012-si... · Posted by u/axeluser
teo_zero · 6 months ago
> Later version allowed to scan from arbitrary position by mirroring first bucket as last.

I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.

mizmar · 6 months ago
It's one of the improvements they claimed in the 2019 presentation. https://youtu.be/JZE3_0qvrMg?feature=shared&t=1054 Reporting 10% speedup on find, but 15% slowdown on insert. The speedup probably comes from using 4 more bits of hash, which leads to fewer collisions. And slowdown from more complicated code for the mirroring.

I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput. Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.

mizmar commented on SIMD within a register: How I doubled hash table lookup performance   maltsev.space/blog/012-si... · Posted by u/axeluser
mizmar · 6 months ago
Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented). Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).

Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.

I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)

u/mizmar

KarmaCake day10July 4, 2025View Original