The Kitchen Sink and Other Oddities

Atabey Kaygun

Listing duplicate files

If you are in the habbit of saving a local copy of (scientific) articles you read like me, you will run into the problem of managing a large number of files with many many duplicates. I have been thinking of writing a program giving me a list of duplicates so that I can do spring cleaning every once in a while. Working with MD5 hash signatures for a different project I got an idea, and this small hack is born.

Here is a short shell script when run give a list of clustered file names where each cluster indicates identical files.

#!/bin/sh

find $@ -type f -exec md5sum {} \; | \
awk '{ count[$1]+=1; u=$1; $1=""; \ 
       res[u] = $0"\n"res[u]; \
     } \
     END \ 
     { for(i in res) { \
          if(count[i]>1) { \
            print res[i]"\n"; \
          }\
       }\
     }' 

“`