Meow Hash (2018) - Readit News

A detailed analysis of Meow Hash: https://peter.website/meow-hash-cryptanalysis

It's not the highest of quality hash functions (see the SMHasher benchmarks), but it is fast. A great alternative is XXH3 (https://cyan4973.github.io/xxHash/), which has seen far more usage in practice.

thomasahle · 4 years ago

XXH3 has issues too. There are certain inputs, independent of the seed, on which it collides much more often than it should.

On the other hand a hash like UMash guarantees low collisions on any input: https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...

jdcarter · 4 years ago

I'm using XXHash3 for verifying data integrity of large (10+MB) blobs. Very fast and appears to work very well in my testing--it's never missed any bit error I've thrown at it.

Aside: when storing hashes, be sure to store the hash type as well so that you can change it later if needed, e.g. "xxh3-[hash value]". RFC-6920 also has things to say about storing hash types and values, although I haven't seen its format in common use.

njt · 4 years ago

> be sure to store the hash type as well so that you can change it later if needed

Thanks for sharing this, I'd been doing this on my own for my own stuff (eg. foo.txt-xxh32-ea79e094), but it's good to know someone else has thought it through.

I ran into the problem once where someone had named some files foo-fb490c or something similar without any annotation, and when there was a problem, it took a file to figure out they were using truncated sha256 hashes.

DeathArrow · 4 years ago

Useless analysys since the author says it's not a cryptographic hash but useful as a fast hash for change detection.

"we wanted a fast, non-cryptographic hash for use in change detection and deduplication"

>A great alternative is XXH3

Meow Hash is twice as fast.

MauranKilom · 4 years ago

If you had made it one section into the analysis, you would have seen that at the time MeowHash made certain cryptographic claims that the author set out to disprove.

The readme has since been updated. I didn't check whether any algorithmic changes were made on top, but the discussion of the analysis on github didn't point to a lot of low-hanging fruit.

LeoPanthera · 4 years ago

It's not useless analysis, because even for non-cryptographic hashes you want the likelihood of any arbitrary hash to be roughly equal. A hash function which "prefers" certain outputs has a far higher probability of collision.

IncRnd · 4 years ago

Don't you think asset planting is an attack against a game's pipeline?

The author of the article's page claims the hash is not cryptographic but actually goes on to make security claims about the hash. People who do not understand cryptography should be careful about making such claims. The author appear to understand this more than your comment demonstrates.

For example, a claim about change detection is a cryptographic claim of detecting preimage attacks. In a threat model, a security professional would determine whether a first preimage or a second preimage attack is what should be guarded in attack scenarios. Then, the professional would help with analysis, determining mitigations, defense in depth, and prioritization of fixing the vulnerabilities exposed by how the hash is used.

A hash cannot be considered standalone. It is the architecture and use-case where the hash's security properties are used to determine what security properties of the application are fulfilled.

So, if the author is correct, which seems to be the case, then meowhash should not be used in a production environment outside of the most simplistic checks. It seems faster for its intended use case to simply check for a single bit difference between two images - no hash required.

jrochkind1 · 4 years ago

What determines whether a hash is "cryptographic"? What would make it suitable for change-detection but not be "cryptographic"? Is the claim here that it would not be suitable for detecting "malicious" changes, but is still suitable for detecting "natural" changes?

ncann · 4 years ago

Saying it's twice as fast is rather misleading? They can both hash as fast as RAM speed allows anyway. And if it's something in cache I doubt one is significantly better than the other.

iratewizard · 4 years ago

Meow hash is also written by people who initially thought SHA-1 was acceptable for large scale change hashing.

ilitirit · 4 years ago

I can vouch for xxHash (it does what it says on the can), but I'm really curious to hear from who have experience with meow hash.

> Because we have to process hundreds of gigabytes of art assets to build game packages, we wanted a fast, non-cryptographic hash for use in change detection and deduplication. We had been using a cryptographic hash (SHA-1), but it was unnecessarily slowing things down.

That's because you were doing a potentially silly thing: hashing every byte.

What does a hash tell you? If two hashes are different, it confirms that the objects are different. If two hashes are the same, they could be the same object or they could be different.

For that purpose, you don't have to hash everything: just hash the areas where objects are likely to differ.

If two 15 gigabyte files are likely to have differences in their first kilobyte, then hashing just first kilobyte (negligibly fast) will work just as well as hashing 15 gigabytes.

If two files are of different length, then they are not the same. How about:

hash(<length> + <first kilobyte> + <last kilobyte>)

For deduplication, a 32 bit CRC could probably be reasonably efficient. No matter what hash you use, you have to verify the collisions. If CRC32 is enough to make false positives rare over your type of data set, it's good enough.

andruc · 4 years ago

"just hash the areas where objects are likely to differ"

Correct me if I'm wrong but given these are art assets, wouldn't "determining where the objects are likely to differ" itself be an expensive/hard problem?

daemin · 4 years ago

Once you need to load any parts of a file into memory you might as well compute the hash for all the bytes that you've loaded into memory. The act of loading the bytes from permanent storage (even NVME SSD) is likely to be many orders of magnitude slower than actually computing the hash. So it makes no sense to just "hash parts of the file" when in all likelihood the entire file will be getting loaded into memory even if you actively read only a few bytes from each 4k block.

cb321 · 4 years ago

It really depends. The early bytes probably contain things like image dimensions which might discriminate a lot - or not at all. They might contain palette-style color information which might discriminate tons - or not at all. Most things with hash functions depend upon distributions of inputs relative to the function, though.

I think it would be accurate to say "Determining where they are likely to differ is not necessarily hard, but it could be." If it happens to be easy then you can radically change the scale of the problem.

File length alone is a weak hash actively maintained by any file system and so available "for free" (or one stat call). And it can work surprisingly well, as the grandparent alludes to.

EDIT: You actually do not need to even open(2) files that are not part of "same size clusters", never mind hash their data. This alone can really squash the problem size over a more naive approach even if you did feel the need to hash the whole file.

EDIT: And in terms of performance, if this data has to actually be read off a drive then the hash need not go any faster than the drive IO which is probably much, much slower than most candidate hash functions. Even PCIe5.0 NVMes are only like 14 GB/s, SATA buses like 0.75 GB/s, etc.

pepoluan · 4 years ago

CRC32 is actually very slow in modern CPUs. So if you can use an algorithm that is both (1) faster and (2) better quality, why not use that algo instead?

midnightclubbed · 4 years ago

I wouldn't categorize it as 'very slow', at worse 50% slower than 128bit XXH3 hash (and faster than the 32bit XXH hash).

Modern intel processors have the CRC32C instructions and ARM have both CRC32 and CRC32C. There is a reluctance in some of the hash comparisons to use the hardware CRC implementations for compatibility reasons, but most modern CPUs have them.

The main argument for using a hardware CRC32 (and to some extent a software one too) is simplicity. If your use-case needs decent (but not blinding) performance, then there is something to be said for a hash implementation that can be written in ~10 lines of code. You still have to handle the possibility of collisions but any 32bit hash has that restriction, which is fine for many uses.

kazinator · 4 years ago

> CRC32 is actually very slow in modern CPUs

Dd you read my whole comment? If we are hashing just a few kilobytes of a multi-gigabyte file, it almost certainly doesn't matter.

cb321 · 4 years ago

This is roughly how https://github.com/c-blake/cligen/blob/master/examples/dups.... can work. There is little question that changing the scale of the problem very much lessons performance sensitivity to the hash.

anonova · 4 years ago

arduinomancer · 4 years ago

A collision was reported a couple months ago

https://github.com/cmuratori/meow_hash/issues/80

Accacin · 4 years ago

The back and forth in that thread is incredibly interesting too.

I have measured (on a Skylake core) the hash from Daniel Lemire, Owen Kaser, "Faster 64-bit universal hashing using carry-less multiplications, Journal of Cryptographic Engineering (to appear)" at t_cycles = (0.00741 +- 0.00098 cycles/byte) * bytes + (37.43 +- 0.48) which is roughly 135 B/cycle (675 GB/s at 5 GHz) or over 2x faster than the claim here. Admittedly, that 37 cycle fixed cost is probably quite a bit higher, and I don't know how well the bit mixing compares.

Parenthetically, a time equation (like t_cycles = coef * bytes + fixedCost) is a more convenient way to summarize performance of this kind of thing than the constant short string/long string mumble-mumble. Yes, loop unrolling/etc. can create step-ish functions and a linear regression is pretty approximate (but my stderrs are pretty small above), but even so..much less tripping over words.

pkhuong · 4 years ago

umash (https://github.com/backtrace-labs/umash) has a similar structure PH block structure (and similar universality guarantees), but was designed for decent bit mixing (enough to satisfy smhasher, unlike CLHASH, which needs an additional finalizer) with a lower fixed time cost: 22 cycles for a one-byte hash on a 2.5 GHz Xeon 8175M.

I'm not sure how one would use that linear regression. What kind of hardware offers 675 GB/s of memory bandwidth? 140 bytes/cycle is easily more than twice the L2 read bandwidth offered by any COTS chip I'm aware of. There are also warm up effects past the fixed cost of setup and finalizers that slow down hashing for short input. For what range of input sizes (and hot/cold cache state) would you say the regression is a useful model?

I was only doing short < 1024 byte strings in L1 (like URI lengths-ish). I'm well aware no memory system on typical HW can deliver that (EDIT: "at large space scale"). The article itself mentions a similar calculation (but at their slower scale) which is why I included it.

And yes, warm up absolutely matters. My numbers are for the best case path which is among the only easy things to even define, IMO. Worst and average are sooooo much harder/assumption-riddled. Then you argue a lot about said assumptions.

I do the minimum times of many runs in an inner loop and the regression on those min-time results. So, the min-loops will warm up everything like branch predictors, etc. I'd suspect the regression model would hold up to usable L1 cache (e.g. 30 KiB or so). Beyond that, it's mostly just measuring the memory system not the hash function..and I'm not sure what the point of that is. Separation of concerns would suggest L1 perf is most relevant to the "hash calculation part". (IMO, many benchmarks confuse this.)

And, yeah..modern memory systems suck. Simple back of the envelope would suggest some DIMM-oriented workload would be waiting on the DIMMs like 95% of the time, usually much higher on Xeons with their low single core BW (maybe 98%!). The Meow article sort of alludes to that as well with less detail. Single core Xeon BW is always a huge disappointment. Forces you to go (very) parallel just to saturate DIMMs (EDIT: which in turn cannot saturate even a single L1!).

BTW, it sounds like you know all of this (and maybe more) and I don't mean to suggest otherwise, but I thought some numbers/details might assist other passersby.

{ EDIT: Also, the stderr on that slope is ~13%. So, it's possible 135 is really 128 (1.05x lower), the 2 cache line/cycle load limit or something like that. This doesn't mean the 640 GB/s rate number is wrong - it just underscores how mismatched memory systems are at their various scales. }

{ EDIT: and yes, 128/16 = 8x faster than meow, not 2x. Oops. }

pbsd · 4 years ago

1/135 cycles per byte on Skylake is just plain impossible, even if the hash consisted of simply one xor per 32 bytes of input.

The lower bound for CLHASH would be the cost of one carryless multiplication per 16 bytes of input, or in other words 1/16 ~ 0.0625 cycles per byte, since you can only issue one such multiplication per cycle.

Feel free to measure it yourself instead of just speculating. (And as mentioned elsewhere, it is probably 1/128.) { EDIT: and I never measured meow - it could be faster than 1/16 cycle/byte but poorly assessed; just be sure to use a min loop and small data. }

Cilvic · 4 years ago

I know +1 is against the rules, but I think this deserves a positive comment. I enjoyed reading it, thanks for taking the time and giving back to the ecosystem.

wolf550e · 4 years ago

How does it compare to XXH3?

https://github.com/Cyan4973/xxHash

https://github.com/rurban/smhasher#readme says XXH3 is 16GB/s while meow is 17GB/s.

caeril · 4 years ago

> How does it compare to XXH3?

XX is a slightly better hashing function, meow is slightly faster.

Take your pick. Life is about tradeoffs.

Well, the page for MeowHash is written in 2018.

xxHash Performance Comparison page is last updated in 2021, and in that page XXH3 is faster than MeowHash:

https://github.com/Cyan4973/xxHash/wiki/Performance-comparis...

NelsonMinar · 4 years ago

I wish there was a consensus non-cryptographic hash algorithm. Something that hashes arbitrary data to 128 or 256 bit keys with good randomness, few collisions on typical input data, and universal implementation.

Most programmers I know reach for SHA-1 (or if we're being honest, MD5). But a cryptohash is really wasteful if you don't need the cryptographic properties.

Last I looked at this Murmurhash was the closest to widely used with some folks saying Metrohash was better. But that was years ago. Judging by this discussion xxHash is the new hotness?

chronogram · 4 years ago

You want to have XXH128 for that right? 128 bit, portable, virtually impossible to collide, and only slightly slower than XXH3 while still way faster than most options.

xxHash is not really "new", it's been available since 2014. At least, that's the oldest Python bindings available on PyPI:

https://pypi.org/project/xxhash/0.0.1/

Sure, as it developed, xxHash variants appear. The latest in the family is XXH3 and it's already being used at least since 2019:

http://fastcompression.blogspot.com/2019/03/presenting-xxh3....

tromp · 4 years ago

> I wish there was a consensus non-cryptographic hash algorithm

I think Siphash serves that role pretty well. It's one of the most if not the most secure among non-cryptographic hashes, and quite speedy, especially with SIMD implementations. One need only consider alternatives if one wants to trade off a bit of security for even higher speed.

Siphash is more like a crypto has right? According to smhasher its performance is somewhere between a non-crypto hash like xxHash or Meow and something like SHA-1.

bmn__ · 4 years ago

> good randomness, few collisions on typical input data, and universal implementation

The relevant standards body NIST has issued SHA-3 in 2015.

Spread the word that programmers should not reach for outdated hash algos. <https://valerieaurora.org/hash.html>

tptacek · 4 years ago

The chart on that page has always squicked me out a bit, because SHA-2 is certainly still "considered strong" by cryptography engineers.

throwaway984393 · 4 years ago

Which do you want? Something non-wasteful, or something good with consensus? Something that everybody agrees on isn't necessarily fast or new or hot.

pspeter3 · 4 years ago

I’m curious too about the differences between Murmurhash and xxHash. I’ve been using Murmur in my side projects.

The production of seed-independent collisions for various versions of murmurhash (e.g., http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash...) motivated siphash. In general, when there's no positive proof of collision bounds, I would assume (un)lucky inputs can collide much more often than usual, even when you change the seed.

The difference between murmurhash and xxhash in that regard is that anyone can download code to generate murmurhash collisions, while code to make xxhash cry is still unknown/secret.

XXH3 is also a lot faster for long strings, and I believe comparable to murmurhash for short inputs.

buildbot · 4 years ago

This could be super useful for deduplicating large photo datasets!

It might be worth changing this link to the github, there’s a lot more info on the hash there.

bityard · 4 years ago

> This could be super useful for deduplicating large photo datasets!

That is indeed the use case mentioned in the article. A hash like this is useful for deduplicating large _file_ datasets but for deduplicating images in particular, you really want a perceptual hash. Because two images files can _look_ identical but have slightly (or wildly) different byte streams. The trade-off is that perceptual hashes are notorious for false positives in large datasets.

kayamon · 4 years ago

I once saw a guy dedupe a folder by batch-converting them to low quality JPEG thumbnails then sorting by size.

Yep, just happy someone was looking into it, I currently use xxHash for this.

Perceptual hashing for photos isn't great in my opinion, for example I have many raw photos that look very similar, the only difference may be a slightly higher shutter speed, or a correction of the camera angle. I'm guessing many perceptual hashes would mark these as duplicate. Maybe that's what many want, it's not what I want personally.

NKosmatos · 4 years ago

Came here to say the same thing :-) Anyone know of a free(ish) software based on a very fast hashing algorith for checking large number of images? My intention is to check for bit-rot and other unforeseen data corruptions between backup disks, I assume many others must have a similar use case.

radicality · 4 years ago

I haven't used it extensively, but this is free and you can pick from a number of hash functions - https://github.com/qarmin/czkawka