If files are exactly the same, then their md5sum
s will be exactly the same, so you can use:
find A/ B/ -type f -exec md5sum {} + | sort | uniq -w32 -D
An md5sum is always exactly 128 bits (or 16 bytes or 32 hex digits) long, and the md5sum
program output uses hex digits. So we use the -w32
option on the uniq
command to compare only the first 32 characters on each line.
This will print all files with a non-unique md5sum. i.e. duplicates.
NOTE: this will detect duplicate files no matter where they are in A/ or B/ - so if /A/subdir1/file
and A/subdir2/otherfile
are the same, they will still be printed. If there are multiple duplicates, they will all be printed.
You can remove the md5sums from the output by piping into, e.g., awk '{print $2}'
or with cut
or sed
etc. I've left them in the output because they're a useful key for an associative array (aka a 'hash') in awk
or perl
etc for further processing.
stat
values, e.g. the file size and modification date. "More similar" sounds like you would need to open the files and compare the contents somehow. – thrig May 17 '16 at 18:25