Given a filepath, how do I find every copy of that file?

Question

I know of programs such as rdfind or fdupes. They solve a similar but more complicated problem.

Given a file path and a directory, I want to search that directory recursively for every copy of that file - be it under other name, permissions or ownership.

If for instance I did rdfind needle.file haystack/, it would also find files other than needle.file that are duplicated only within haystack.

I could filter the output of rdfind, but if haystack/ is large, this will do a lot of unnecessary work.

Should have a command line application, since I intend to use it in a script/cron-job.

Maybe this link to a similar question can help you solve the problem. — sudodus, Aug 27 '22 at 13:30

Panki · Answer 1 · 2022-08-27T14:17:27.187

5

A simple approach:

Get the md5sum of your target file, store it in a variable
Get the size of your file, store in a variable
Use find to run md5sum on all same-sized files
grep the output of find for our target MD5 hash

target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash

edited Aug 27 '22 at 14:17

answered Aug 27 '22 at 14:00

Panki

6,664

Without testing it in detail, it looks fine to me. One could probably extend it a bit and write a full shell script that finds duplicates this way. – allo Aug 27 '22 at 14:47

score 1 · Answer 2 · answered Aug 27 '22 at 13:32

1

You could use Czkawka (Flathub and GitHub).

It's a nice GUI-tool with advanced features such as checksum etc.

answered Aug 27 '22 at 13:32

telometto

1,975

Yeah, thanks, but it seems there's no way to do that with Czkawka, it has the same "problem" as in my rdfind example. – con-f-use Aug 27 '22 at 14:06

MiniMax · Answer 3 · 2022-08-28T07:35:21.383

If number of files is not big - less than 1000 for example, bash script can be suitable. Otherwise, execution of binaries (md5sum, stat) in loop will create noticeable overhead.

File sizes matters, because if we have 1000 files with 1G size, then binaries loading overhead can be neglected as relatively small. But if we have 1000 000 files with size 1K, then another story.

Variant №1

md5sum usage.

find_dups_by_md5.sh

#!/bin/bash
get_size() {
    stat -c"%s" "$1"
}
get_hash() {
    md5sum "$1" | cut -d' ' -f1 
}
needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")
shopt -s globstar
GLOBIGNORE=$needle
for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi
cur_file_hash=$(get_hash &quot;$f&quot;)
if [[ &quot;$needle_hash&quot; != &quot;$cur_file_hash&quot; ]]; then
    continue
fi  

echo -e &quot;duplicate:\t${f}&quot;

done

Variant №2

cmp usage.

Simple byte by byte comparison even better - less code, same result and even a little faster. Hash calculation is redundant here, because this hash is used once. For each file we make hash by md5sum (needle file included) and md5sum processes whole files by definition. Thus, if we have for example 100 1 Gygabyte files, md5sum will process all 100G, even if these files are different at first Kilobyte.

So, in case of single comparison each file with target, byte by byte comparison time will be the same in worst case (all files have same content) or faster if files have different content (assuming that md5 hash calculation time equals byte by byte comparison).

find_dups_by_cmp.sh

#!/bin/bash
get_size() {
    stat -c"%s" "$1"
}
needle=$1
needle_size=$(get_size "$needle")
shopt -s globstar
GLOBIGNORE=$needle
for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi
if ! cmp -s &quot;$needle&quot; &quot;$f&quot;; then
    continue
fi  

echo -e &quot;duplicate:\t${f}&quot;

done

Testing

Test files generation

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}
shopt -s globstar
mkdir -p {a..d}/{e..g}/{m..o}
#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
    [ -f "$f" ] && echo_random_bytes > "$f"
done
#Creation of duplicates
same_string=$(echo_random_bytes)
touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
    echo "$same_string" > "$f"
done
#Target file creation
echo "$same_string" > needle_file.txt

Search of duplicates

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

Performance comparison

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null
real    0m0,761s
user    0m0,809s
sys 0m0,169s
$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null
real    0m0,645s
user    0m0,526s
sys 0m0,162s

score 0 · Answer 4 · answered Aug 28 '22 at 08:08

Based on Panki's answer, this should have fewer invocations of md5sum, which would improve performance if thousands of files are to be checked.

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

Caveat: As with the original, there may be display issues if file names contain new line characters.

Given a filepath, how do I find every copy of that file?

4 Answers4

Variant №1

Variant №2