Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented).
Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).
Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.
I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)
I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.
I'm still confused on the SIMD alignment. There are load instructs with alignment requirements (_mm_load_si128) and without (_mm_loadu_si128). Both claim the same latency and throughput. Somewhere I've heard the slowdown of unaligned access comes from using more bandwidth to load two aligned 128-bit lines to compose the unaligned value. But no idea if this affects multiple loads of continuous memory.