1

Say you have one file called foo.pdf. How can I find out whether my machine contains another file, say bar.pdf, that has the exact same content but just a different name?

4 Answers4

2

Tell me if this does work (won't be fast):

find /home/user -type f -name "*.pdf" -exec md5sum {} + 2> /dev/null | uniq -f2 -D
1

fdupes sounds quite smart, but it does match all the files together. You could use a couple of the same techniques more optimally if you already have a single file that you want to match.

You could start by getting the file size of foo.pdf, and constructing a find command that matches the exact size only. That should be a cheap shortlist.

Then you could cut the first few bytes (a few hundred) from each of those files, and compare those bytes with cmp -s. That should eliminate some more.

For files that are still possible duplicates, you can cksum or md5sum them.

You probably want to check the inode numbers are different from your original, in case you find a hard-linked copy.

Paul_Pedant
  • 8,679
  • 1
    Instead of the first few bytes (which may be the same in many files because of boilerplate headers), start comparing from the middle of the file ;-) –  Feb 22 '20 at 03:39
  • Also, you should compare the device:inode tuples, not just the inode numbers. –  Feb 22 '20 at 03:41
  • My bad: the inode numbers are only unique within a given file system (partition, which looks like a device). Although, I'm suggesting using inode to detect hard links, and IIRC hard links are not supported cross-device anyway. – Paul_Pedant Feb 22 '20 at 11:49
0

You could use fdupes to search for duplicate files in different directories. The default is to list duplicate files as blocks separated by a blank line.

If both files are in one directory dir1:

fdupes dir1

For a recursive search, add the -r / --recurse option:

fdupes -r dir1

You can search multiple directories and set the recurse option for specific dirs:

fdupes dir1 dir2 --recurse: dir3
Freddy
  • 25,565
0
rmlint -r

rmlint is an extremely fast tool to find duplicates and optionally remove them if you want.

Features

Finds…

  • …Duplicate Files and duplicate directories.
  • …Nonstripped binaries (i.e. binaries with debug symbols)
  • …Broken symbolic links.
  • …Empty files and directories.
  • …Files with broken user or/and group ID.

Differences to other duplicate finders:

  • Extremely fast (no exaggeration, we promise!)
  • Paranoia mode for those who do not trust hashsums.
  • Many output formats.
  • No interactivity.
  • Search for files only newer than a certain mtime.
  • Many ways to handle duplicates.
  • Caching and replaying.
  • btrfs support.

This tutorial will help and guide you gently ;)

Munzir Taha
  • 1,490