If number of files is not big - less than 1000 for example, bash script can be suitable. Otherwise, execution of binaries (md5sum
, stat
) in loop will create noticeable overhead.
File sizes matters, because if we have 1000 files with 1G size, then binaries loading overhead can be neglected as relatively small. But if we have 1000 000 files with size 1K, then another story.
Variant №1
md5sum
usage.
find_dups_by_md5.sh
#!/bin/bash
get_size() {
stat -c"%s" "$1"
}
get_hash() {
md5sum "$1" | cut -d' ' -f1
}
needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")
shopt -s globstar
GLOBIGNORE=$needle
for f in **; do
cur_file_size=$(get_size "$f")
if [[ "$needle_size" != "$cur_file_size" ]]; then
continue
fi
cur_file_hash=$(get_hash "$f")
if [[ "$needle_hash" != "$cur_file_hash" ]]; then
continue
fi
echo -e "duplicate:\t${f}"
done
Variant №2
cmp
usage.
Simple byte by byte comparison even better - less code, same result and even a little faster. Hash calculation is redundant here, because this hash is used once. For each file we make hash by md5sum
(needle file included) and md5sum
processes whole files by definition. Thus, if we have for example 100 1 Gygabyte files, md5sum
will process all 100G, even if these files are different at first Kilobyte.
So, in case of single comparison each file with target, byte by byte comparison time will be the same in worst case (all files have same content) or faster if files have different content (assuming that md5 hash calculation time equals byte by byte comparison).
find_dups_by_cmp.sh
#!/bin/bash
get_size() {
stat -c"%s" "$1"
}
needle=$1
needle_size=$(get_size "$needle")
shopt -s globstar
GLOBIGNORE=$needle
for f in **; do
cur_file_size=$(get_size "$f")
if [[ "$needle_size" != "$cur_file_size" ]]; then
continue
fi
if ! cmp -s "$needle" "$f"; then
continue
fi
echo -e "duplicate:\t${f}"
done
Testing
Test files generation
###Generate test files
echo_random_bytes () {
openssl rand -base64 100000;
}
shopt -s globstar
mkdir -p {a..d}/{e..g}/{m..o}
#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
[ -f "$f" ] && echo_random_bytes > "$f"
done
#Creation of duplicates
same_string=$(echo_random_bytes)
touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
echo "$same_string" > "$f"
done
#Target file creation
echo "$same_string" > needle_file.txt
Search of duplicates
$ ./find_dups_by_md5.sh needle_file.txt
duplicate: a/e/m/dup_file.txt
duplicate: a/f/m/dup_file.txt
duplicate: a/g/m/dup_file.txt
duplicate: b/e/m/dup_file.txt
duplicate: b/f/m/dup_file.txt
duplicate: b/g/m/dup_file.txt
duplicate: c/e/m/dup_file.txt
duplicate: c/f/m/dup_file.txt
duplicate: c/g/m/dup_file.txt
duplicate: d/e/m/dup_file.txt
duplicate: d/f/m/dup_file.txt
duplicate: d/g/m/dup_file.txt
Performance comparison
$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null
real 0m0,761s
user 0m0,809s
sys 0m0,169s
$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null
real 0m0,645s
user 0m0,526s
sys 0m0,162s