Rename files to match hash of contents

$ ls 01.jpg 03.jpg 03_copy.jpg 04.jpg 05.jpg $ git init Initialized empty Git repository in /tmp/test/.git/ $ git hash-object -w * 82f7d50fc89d2fd47150aff539ea4acf45ec1589 0080672bc4f248c400d569cce1a2a3d743eb1331 0080672bc4f248c400d569cce1a2a3d743eb1331 58db57b10c219b9b71f0223e58a6dc0d51cfe207 05dcde743807bddaf55ad1231572c1365d4db4af $ find .git/objects -type f .git/objects/00/80672bc4f248c400d569cce1a2a3d743eb1331 .git/objects/05/dcde743807bddaf55ad1231572c1365d4db4af .git/objects/58/db57b10c219b9b71f0223e58a6dc0d51cfe207 .git/objects/82/f7d50fc89d2fd47150aff539ea4acf45ec1589

It would be nice to turn this into a program that stores the previous name so they can be renamed back after deduplicating.

Very cool!

kardos · 9 years ago

Why rename them at all? There are already good tools for duplicate detection. An example is fdupes [1], which is smart enough to rule out dupes by other tricks like partial hashes etc., so you can avoid hashing some of the files.

[1] https://github.com/adrianlopezroche/fdupes

Edit: just noticed that it's using md5, which is broken [2], and that it's using truncated md5 hashes.....!

[2] https://natmchugh.blogspot.ca/2015/02/create-your-own-md5-co...

spangry · 9 years ago

Using md5 is only a problem here if someone has actually gained access to your files and then gone to the trouble of secretly adding new files and calculating/brute-forcing the correct 'chosen-prefixes' to ensure a clash. It would be a pretty weird attack to mount, that's for sure.

md5 is fine for deduplicating. It's extremely improbable you'd 'organically' get a md5 hash clash for two different files.

d4l3k · 9 years ago

Why not just have a program that iterates through all of the files, hashes them, stores them in a map/dict and then reports if there's a duplicate? Seems easier than renaming everything multiple times.

sliken · 9 years ago

That's basically fdupe. Also you only have to hash files with the same length, if they aren't the same length you can be quite sure they aren't the same file.

Even such a simple optimization can make a huge difference on a large directory of images or MP3s.

BuuQu9hu · 9 years ago

Or just use git-annex

sgentle · 9 years ago

Fun! This is similar to how git stores files internally. You can do some neat tricks like this:

If you're curious, you can read more about how it works here: https://git-scm.com/book/en/v1/Git-Internals-Git-Objects

cperciva · 9 years ago

This is also how FreeBSD Update and Portsnap store files. This technique has been around for a long time.

patrickmn · 9 years ago

https://en.wikipedia.org/wiki/Content-addressable_storage

stirner · 9 years ago

http://mywiki.wooledge.org/BashPitfalls

Be warned that this (by default) only looks at part of the file. Seems like a poor default.

askvictor · 9 years ago

Don't modern filesystems allow you to store metadata like this separately to the filename or file data?

m0atz · 9 years ago

Nirsofts 'hashmyfiles' has this functionality built in already, known as duplicate search mode. Works extremely well. http://www.nirsoft.net/utils/hash_my_files.html

zokier · 9 years ago

Content-addressable storage is always neat. Does anyone know if using truncated md5 like this is somehow more robust than using some non-crypto hash like siphash, which already produces 64bit hashes.

zbuf · 9 years ago

duff (duplicate file finder) is another useful tool for this with flags to operate once duplicates are found:

http://duff.dreda.org/

tucaz · 9 years ago