How to determine whether a file is duplicated but has a different name?

Question

Say you have one file called foo.pdf. How can I find out whether my machine contains another file, say bar.pdf, that has the exact same content but just a different name?

I would say just md5sum each file, the output should be the same. — schrodingerscatcuriosity, Feb 21 '20 at 18:34
@guillermochamorro I meant how can I search for this file. I don't know whether bar.pdf exists. — Paul Razvan Berg, Feb 21 '20 at 18:50
Do you mean same file size, byte-by-byte comparison, checksums or same content like after converting PDF to PDF? — frostschutz, Feb 21 '20 at 19:11

score 2 · Answer 1 · answered Feb 21 '20 at 19:17

2

Tell me if this does work (won't be fast):

find /home/user -type f -name "*.pdf" -exec md5sum {} + 2> /dev/null | uniq -f2 -D

answered Feb 21 '20 at 19:17

schrodingerscatcuriosity

12,396

guillermo chamorro You beat me to the answer. I was going to suggest the same thing. :D – Mark Stewart Feb 21 '20 at 20:30
Shouldn't it be … | sort | uniq -w32 -D to print the lines with the same hash? – Freddy Feb 21 '20 at 21:49

score 1 · Answer 2 · answered Feb 21 '20 at 20:35

1

fdupes sounds quite smart, but it does match all the files together. You could use a couple of the same techniques more optimally if you already have a single file that you want to match.

You could start by getting the file size of foo.pdf, and constructing a find command that matches the exact size only. That should be a cheap shortlist.

Then you could cut the first few bytes (a few hundred) from each of those files, and compare those bytes with cmp -s. That should eliminate some more.

For files that are still possible duplicates, you can cksum or md5sum them.

You probably want to check the inode numbers are different from your original, in case you find a hard-linked copy.

answered Feb 21 '20 at 20:35

Paul_Pedant

8,679

1

Instead of the first few bytes (which may be the same in many files because of boilerplate headers), start comparing from the middle of the file ;-) – Feb 22 '20 at 03:39
Also, you should compare the device:inode tuples, not just the inode numbers. – Feb 22 '20 at 03:41
My bad: the inode numbers are only unique within a given file system (partition, which looks like a device). Although, I'm suggesting using inode to detect hard links, and IIRC hard links are not supported cross-device anyway. – Paul_Pedant Feb 22 '20 at 11:49

score 0 · Answer 3 · answered Feb 21 '20 at 20:05

You could use fdupes to search for duplicate files in different directories. The default is to list duplicate files as blocks separated by a blank line.

If both files are in one directory dir1:

fdupes dir1

For a recursive search, add the -r / --recurse option:

fdupes -r dir1

You can search multiple directories and set the recurse option for specific dirs:

fdupes dir1 dir2 --recurse: dir3

Munzir Taha · Answer 4 · 2020-02-21T20:35:09.770

rmlint -r

rmlint is an extremely fast tool to find duplicates and optionally remove them if you want.

Features

Finds…

…Duplicate Files and duplicate directories.
…Nonstripped binaries (i.e. binaries with debug symbols)
…Broken symbolic links.
…Empty files and directories.
…Files with broken user or/and group ID.

Differences to other duplicate finders:

Extremely fast (no exaggeration, we promise!)
Paranoia mode for those who do not trust hashsums.
Many output formats.
No interactivity.
Search for files only newer than a certain mtime.
Many ways to handle duplicates.
Caching and replaying.
btrfs support.

This tutorial will help and guide you gently ;)

How to determine whether a file is duplicated but has a different name?

4 Answers4

Features

Finds…

Differences to other duplicate finders: