Vectorized Bloom Filters for Advanced SIMD Processors (2014) [pdf]

This is a nice piece of work and still relevant. The source is available:

http://www.cs.columbia.edu/~orestis/vbf.c

It has very high throughput for L2-resident filters, as long as most queries return "false" and you can use a bulk API. It was, IIRC, about 4x faster than a hand-made "horizontal" SIMD Bloom filter, and 20x faster than cuckoo filters.

By "horizontal" SIMD, I am using the language of a follow-up paper by the same team at Columbia, "Rethinking SIMD Vectorization for In-Memory Databases", http://www.cs.columbia.edu/~orestis/sigmod15.pdf, http://www.cs.columbia.edu/~orestis/sigmod15source.zip. In that paper, they call "vertical" SIMD for hash-based containers "process[ing] a different input key per vector lane". "Horizontal" SIMD is putting the same key in each lane.

I suspect the results in this paper could be improved with more modern gather techniques on newer x86-64 processors.

psi-squared · 8 years ago

> I suspect the results in this paper could be improved with more modern gather techniques on newer x86-64 processors.

By that, do you mean the AVX2 'gather' type instructions? If not, I'd be interested to know what those techniques are.

As for AVX2 gathers, I had to look this up recently and it sounds like they're about as fast as manually unpacking the vector and performing scalar loads. That is to say, they're decidedly not fast. On the other hand, it sounds like (as of Skylake) they're bottlenecked on accesses to the L1 cache, so they're about as fast as they reasonably could be.

Source: https://stackoverflow.com/questions/21774454/how-are-the-gat...

Not sure about performance on Zen, but I would imagine it's similar?

jbapple · 8 years ago

> do you mean the AVX2 'gather' type instructions?

Yes, and AVX512, with two caveats:

1. This is processor-dependent. Different processors have different ratios of instruction latency.

2. Sometimes a related instruction is faster than the instruction specifically for that purpose. For instance, on the glibc on my machine, memcmp and strncmp are NOT implemented using the sse4.2 instructions for string comparison, but instead use ptest and pcmpeq, respectively, because it is faster to do so. The same phenomenon could be true of gathers as well.