Re: How to search several directories for duplicate files?



On Sun, 08 Oct 2006 22:06:54 -0400, Dan Espen wrote:

Andy Axnot <andy1@xxxxxxxxxxxxxxxxx> writes:

I wish to search several directories for duplicate files. This could
involve several thousands of files.
...
It works by first finding identical sized files and then running md5sum
on those of the same filesize. I have no idea how 'samefile' works.

Does anyone have any experience with these or other utilities or
scripts?

Not here.

Any thoughts on the likelihood of errors using size and md5sum vs cmp or
something similar?

The odds of md5 giving a false positive are very low. After finding the
dups with md5, running cmp to verify can't hurt.

Any info or advice on time required with large files or large numbers of
files?

The time it takes would depend on the number of same sized files. Doing
the size comparisions would be very fast. The md5 is going to require a
read of the whole file but then it can be compared very quickly to other
files. If you tried to cmp each file to every other file of the same
size, that could be very slow.

Is a script too slow for something like this?

All the time is in reading the files to get the md5 sum. A script isn't
going to slow it down.

OK, thanks much for your input. I'll run some tests on increasingly
larger test samples to see if times are reasonable. Whatever reasonable
is :-)

Andy

.