If you are in the habbit of saving a local copy of (scientific) articles you read like me, you will run into the problem of managing a large number of files with many many duplicates. I have been thinking of writing a program giving me a list of duplicates so that I can do spring cleaning every once in a while. Working with MD5 hash signatures for a different project I got an idea, and this small hack is born.
Here is a short shell script when run give a list of clustered file names where each cluster indicates identical files.
#!/bin/sh
find $@ -type f -exec md5sum {} \; | \
awk '{ count[$1]+=1; u=$1; $1=""; \
res[u] = $0"\n"res[u]; \
} \
END \
{ for(i in res) { \
if(count[i]>1) { \
print res[i]"\n"; \
}\
}\
}'
“`