Can't we use this again? I mean:
1. Partition the data so that some batches have up to 1 missing element.
2. Recover the elements where possible with the XOR trick.
3. Pick another hash function, then repeat finding more missing elements.
4. Repeat until no more missing elements.
1. Find a batch with 1 missing element 2. Delete that element from its other assigned partitions 3. Repeat, as the modified batches may now be recoverable
This iterative process (surprisingly!) succeeds with very high probability as long as the number of partitions is 1.22x larger than the number of missing elements with k=3 hash functions.
The "bloom filter" name is misleading in regard to this.
The problem is that in vector sets, the HNSW graph has the invariant that each node has bidirectional links to a set of N nodes. If A links to B, then B links to A. This is unlike most other HNSW implementations. In mine, it is required that links are reciprocal, otherwise you get a crash.
Now, combine this with another fact: for speed concerns, Redis vector sets are not serialized as
element -> vector
And then reloaded and added back to the HNSW. This would be slow. Instead, what I do, is to serialize the graph itself. Each node with its unique ID and all the links. But when I load the graph back, I must be sure it is "sane" and will not crash my systems. And reciprocal links are one of the things to check. Checking that all the links are reciprocal could be done with an hash table (as in the post problem), but that would be slower and memory consuming, so how do we use XOR instead? Each time I see a link A -> B, I normalize it swapping A and B in case A>B. So if links are reciprocal I'll see A->B A->B two times, if I use a register to accumulate the two IDs and XOR them, at the end, if the register is NOT null I got issues: some link may not be reciprocal.However, in this specific case, there is a problem: collisions. The register may be 0 even if there are non reciprocal links in case they are fancy, that is, the non-reciprocal links are a few and they happen to XOR to 0. So, to fix this part, I use a strong (and large) hash function that will make the collision extremely unlikely.
It is nice now to see this post, since I was not aware of this algorithm when I used it a few weeks ago. Sure, at this point I'm old enough that never pretend I invented something, so I was sure this was already used in the past, but well, in case it was not used for reciprocal links testing, this is a new interview questions you may want to use for advanced candidates.
For every normalized link id x:
y = (x << k) | h(x) # append a k-bit hash to the id
acc ^= y
If acc is zero, all links are reciprocal (same guarantee as before).If acc is non-zero, split it back into (x', h'):
* Re-compute h(x').
* If it equals h', exactly one link is unpaired and x' tells you which one (or an astronomically unlikely collision). Otherwise there are >= 2 problems.
This has collision-resistance like the parent comment and adds the ability to pinpoint a single offending link without a second pass or a hash table.
To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode. You can make the probability of this as low as you like by using a larger checksum, but the initial HN example was IIRC a list of 32-bit integers, so using a (say) 128 bit checksum to make false decodes 'cryptographically unlikely' would come at a rather high cost since it's a size added to each bucket.
If your sent members are multi-kilobyte contact database entries or something than the overhead required to make false decode impossible would be insignificant.
This limitation also applies somewhat to the alternative algebraic approach in my comments-- an overfull sketch could be falsely decoded--, except the added size needed to make a false decode cryptographically unlikely is very small and goes down relative to the size of the sketch as the sketch grows instead of being linear in the size of the sketch.
I haven't looked at your implementation but it can be useful to have at least 1 bit counter or just make the LSB of your checksum always 1. Doing so prevents falsely decoding an overfull bucket with an even number of members in it, and since the distribution of members to bucket is binomial 2 is an extremely common number for overfull buckets. You can use a counter bigger than 1 bit (and combine it with addition in its ring rather than xor), but the tradeoff vs just having more checksum bits is less obvious.
It's probably an interesting open question about the existence of checksums such that the xor of 2..N valid codewords is unlikely to be a valid codeword... the "always emit 1" function has perfect performance for even values but are there schemes that still contribute distance even in cases were the N isn't completely precluded?
The false decodes can be detected. During peeling, deleting a false decode inserts a new element with the opposite sign of count. Later, you decode this second false element and end up with the same element in both the A / B and B / A result sets (as long as decode completes without encountering a cycle).
So, after decode, check for any elements present in both A / B and B / A result sets and remove them.
--
Beyond that, you can also use the cell position for additional checksum bits in the decode process without increasing the data structure's bit size. i.e., if we attempt to decode the element X from a cell at position m, then one of the h_i(x) hash functions for computing indices should return m.
There's even a paper about a variant of IBFs that has no checksum field at all: https://arxiv.org/abs/2211.03683. It uses the cell position among other techniques.