Content-addressable storage is always neat. Does anyone know if using truncated md5 like this is somehow more robust than using some non-crypto hash like siphash, which already produces 64bit hashes.
Why rename them at all? There are already good tools for duplicate detection. An example is fdupes [1], which is smart enough to rule out dupes by other tricks like partial hashes etc., so you can avoid hashing some of the files.
Using md5 is only a problem here if someone has actually gained access to your files and then gone to the trouble of secretly adding new files and calculating/brute-forcing the correct 'chosen-prefixes' to ensure a clash. It would be a pretty weird attack to mount, that's for sure.
md5 is fine for deduplicating. It's extremely improbable you'd 'organically' get a md5 hash clash for two different files.
Why not just have a program that iterates through all of the files, hashes them, stores them in a map/dict and then reports if there's a duplicate? Seems easier than renaming everything multiple times.
That's basically fdupe. Also you only have to hash files with the same length, if they aren't the same length you can be quite sure they aren't the same file.
Even such a simple optimization can make a huge difference on a large directory of images or MP3s.
http://duff.dreda.org/
Very cool!
[1] https://github.com/adrianlopezroche/fdupes
Edit: just noticed that it's using md5, which is broken [2], and that it's using truncated md5 hashes.....!
[2] https://natmchugh.blogspot.ca/2015/02/create-your-own-md5-co...
md5 is fine for deduplicating. It's extremely improbable you'd 'organically' get a md5 hash clash for two different files.
Even such a simple optimization can make a huge difference on a large directory of images or MP3s.