0

I know of programs such as rdfind or fdupes. They solve a similar but more complicated problem.

Given a file path and a directory, I want to search that directory recursively for every copy of that file - be it under other name, permissions or ownership.

If for instance I did rdfind needle.file haystack/, it would also find files other than needle.file that are duplicated only within haystack.

I could filter the output of rdfind, but if haystack/ is large, this will do a lot of unnecessary work.

Should have a command line application, since I intend to use it in a script/cron-job.

con-f-use
  • 423

4 Answers4

5

A simple approach:

  • Get the md5sum of your target file, store it in a variable
  • Get the size of your file, store in a variable
  • Use find to run md5sum on all same-sized files
  • grep the output of find for our target MD5 hash
target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash
Panki
  • 6,664
  • Without testing it in detail, it looks fine to me. One could probably extend it a bit and write a full shell script that finds duplicates this way. – allo Aug 27 '22 at 14:47
1

You could use Czkawka (Flathub and GitHub).

It's a nice GUI-tool with advanced features such as checksum etc.

telometto
  • 1,975
  • Yeah, thanks, but it seems there's no way to do that with Czkawka, it has the same "problem" as in my rdfind example. – con-f-use Aug 27 '22 at 14:06
1

If number of files is not big - less than 1000 for example, bash script can be suitable. Otherwise, execution of binaries (md5sum, stat) in loop will create noticeable overhead.

File sizes matters, because if we have 1000 files with 1G size, then binaries loading overhead can be neglected as relatively small. But if we have 1000 000 files with size 1K, then another story.

Variant №1

md5sum usage.

find_dups_by_md5.sh

#!/bin/bash

get_size() { stat -c"%s" "$1" }

get_hash() { md5sum "$1" | cut -d' ' -f1 }

needle=$1 needle_size=$(get_size "$needle") needle_hash=$(get_hash "$needle")

shopt -s globstar GLOBIGNORE=$needle

for f in **; do cur_file_size=$(get_size "$f") if [[ "$needle_size" != "$cur_file_size" ]]; then continue fi

cur_file_hash=$(get_hash "$f")
if [[ "$needle_hash" != "$cur_file_hash" ]]; then
    continue
fi  

echo -e "duplicate:\t${f}"

done

Variant №2

cmp usage.

Simple byte by byte comparison even better - less code, same result and even a little faster. Hash calculation is redundant here, because this hash is used once. For each file we make hash by md5sum (needle file included) and md5sum processes whole files by definition. Thus, if we have for example 100 1 Gygabyte files, md5sum will process all 100G, even if these files are different at first Kilobyte.

So, in case of single comparison each file with target, byte by byte comparison time will be the same in worst case (all files have same content) or faster if files have different content (assuming that md5 hash calculation time equals byte by byte comparison).

find_dups_by_cmp.sh

#!/bin/bash

get_size() { stat -c"%s" "$1" }

needle=$1 needle_size=$(get_size "$needle")

shopt -s globstar GLOBIGNORE=$needle

for f in **; do cur_file_size=$(get_size "$f") if [[ "$needle_size" != "$cur_file_size" ]]; then continue fi

if ! cmp -s "$needle" "$f"; then
    continue
fi  

echo -e "duplicate:\t${f}"

done

Testing

Test files generation

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}

shopt -s globstar

mkdir -p {a..d}/{e..g}/{m..o}

#Fill directories by some files with random content touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt for f in **; do [ -f "$f" ] && echo_random_bytes > "$f" done

#Creation of duplicates same_string=$(echo_random_bytes)

touch {a..d}/{e..g}/m/dup_file.txt for f in {a..d}/{e..g}/m/dup_file.txt; do echo "$same_string" > "$f" done

#Target file creation echo "$same_string" > needle_file.txt

Search of duplicates

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

Performance comparison

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null

real 0m0,761s user 0m0,809s sys 0m0,169s

$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null

real 0m0,645s user 0m0,526s sys 0m0,162s

MiniMax
  • 4,123
0

Based on Panki's answer, this should have fewer invocations of md5sum, which would improve performance if thousands of files are to be checked.

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

Caveat: As with the original, there may be display issues if file names contain new line characters.

bxm
  • 4,855