What is the fastest way to find duplicate pictures?

Thursday. February 19, 2015 - 3 mins

I needed to clean up duplicates photos from my personal library. And because I could not choose which duplicate finder to try, I have decided to test them all. Amongst the free tools that were correct in identifying all duplicates in my test, dupd was the fastest. Other python and perl based solution did also very well, often better than their C/C++ colleagues.

Why a duplicate file finder?

As a parent, my picture collection has mushroomed in size. I do try to save everything in one place and have backups, but it is difficult to keep track of it. Too afraid of losing family memories, often I end up downloading pictures from my phone multiple times, “just in case”. Now that I have reached the limits of my hard drive space, I figured the best thing to do was to remove some duplicate pictures.

One could try with iPhoto or some other graphical interface, but the approach is simply too slow for a large library (over 100GB). The obvious choice was to search for a command line tool.

I quickly realized that there are far too many tools which have been written for this task. And no easy way to find which one is best. So I decided to compare the speed of most of them.

A speed comparison

Wikipedia has a pretty complete list of duplicate file finders. From there I downloaded and installed all free/open source command line tools.

I have excluded from this comparison those tools which could not be readily installed on my macbook (eg. Makefile would have needed fixing) or appear to have very limited support (less than 10 stars on github, slow or stagnant development, etc.). I did not take into considerations those tools that do not have the option of a “dry run” or simple listing of duplicates, but instead attempt to delete or hardlink the duplicate files. I find this behaviour too aggressive for most users. Definitely too risky to run on the folder containing the pictures of my kids.

Finally, I excluded those files that failed to find all correct duplicates in my test folder. The folder was designed to contain 1195 duplicates in 325 clusters. Here are the results, tested on a 7GB mixture of pictures, videos, symlinks, small files and recursively nested files:

Since I keep my pictures in a separate NAS storage, it is also useful how much each of this methods hinges on memory or CPU:

Conclusion

Dupd was the clear speed winner, also with an acceptable memory footprint. liten and fastdupes come close second and may be slightly more portable as they do not require to be compiled. Compiling of the C/C++ tools tend to be a little fragile out of the main UNIX distros, which is a problem when working on a NAS. I ended up using fastdupes.

It is interesting to see how the (arguably) best known solution fdupes was also the slowest. Though it remains one of the only tools which can do byte-by-byte comparison. Both the fastest and second fastest tool rely on SQLite databases and allow you to explore duplicates interactively, after they run.

Please let me know if I forgot any other tool which should have been in this list. The commands included in the analysis were:

duff (C)
dupseek (Perl)
fastdupes.py (Python)
fdf (perl)
fdupe (perl)
fdupes(C)
fdupes-jody a speedier fork of fdupes (C)
liten (Python)
liten2 (Python)
rdfind (C)
ssdeep (C) also does partial matches
python active state recipe (Python)