Shellscript implementation of the OP's, @vume's, idea
Background with the example rsync
Have a look at rsync
. It has several levels of checking if files are identical. The manual man rsync
is very detailed, and you can identify what I describe, and probably also some other interesting alternatives.
The most rigorous check is to compare each byte, but as you write, it takes a lot of time when there is a lot of data, for example a whole backup.
-c, --checksum skip based on checksum, not mod-time & size
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
The standard check is size and other file attributes (for example time stamps). It is often considered good enough.
Your idea, @vume, means something between these two checking levels. I have not seen any tool like that, but I would be very interested in such a tool.
Edit 1: The shellscript vumer
The following shellscript vumer
uses dd
to do what I think you want, @vume.
#!/bin/bash
chunks=16
chnksiz=16
function usage {
echo "Usage: ${0##*/} <file>"
}
if ! test -f "$1"
then
usage
exit
fi
size=$(stat --format='%s' "$1")
step=$(( size/(chunks-1) ))
#echo "step=$step"
tmpfil=$(mktemp)
if [ $size -lt 512 ]
then
chksum=$(md5sum "$1" | sed 's/ .//')
else
for (( i=0;i<chunks;i++ ))
do
if [ $i -eq $((chunks-1)) ]
then
pos=$((size-chnksiz))
else
pos=$((istep))
fi
echo "$i: $pos"
dd if="$1" bs=1 skip=$pos count=16 >> "$tmpfil" 2> /dev/null
done
chksum=$(md5sum "$tmpfil" | sed 's/ .*//')
fi
modif=$(stat --format='%y' "$1")
modif=${modif//\ /_}
echo "size=$size modif=$modif checksum=$chksum file="$1""
#less "$tmpfil"
rm "$tmpfil"
In my Lenovo C30 workstation (old but quite powerful) I tested vumer
with the Ubuntu Desktop 22.04 LTS iso file and compared the time used with md5sum
,
$ time vumer ubuntu-22.04-desktop-amd64.iso
size=3654957056 modif=2022-04-19_10:25:02.000000000_+0200 checksum=ec3483153bfb965745753c4b1b92bf2e file="ubuntu-22.04-desktop-amd64.iso"
real 0m0,024s
user 0m0,018s
sys 0m0,008s
$ time md5sum ubuntu-22.04-desktop-amd64.iso
7621da10af45a031ea9a0d1d7fea9643 ubuntu-22.04-desktop-amd64.iso
real 0m19,919s
user 0m5,331s
sys 0m0,988s
So for big files it is really a lot faster than md5sum
, which today is considered a [too] simple checksum tool. sha256sum
is even slower.
I checked also with a Debian iso file that was converted to replace a couple of boot options quiet splash
with persistence
and compared with its original file. vumer
had bad luck and did not check the few locations modified. So here we must fall back on the classic timestamp to tell the difference. Of course md5sum
can tell the difference.
$ vumer debian-live-10.0.0-amd64-standard.iso
size=865075200 modif=2019-09-05_10:14:31.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="debian-live-10.0.0-amd64-standard.iso"
$ vumer persistent-debian-live-10.0.0-amd64-standard.iso
size=865075200 modif=2019-09-12_18:01:55.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="persistent-debian-live-10.0.0-amd64-standard.iso"
$ md5sum *debian-live-10.0.0-amd64-standard.iso
a64ae643520ca0edcbcc769bae9498f3 debian-live-10.0.0-amd64-standard.iso
574ac1f29a6c86d16353a54f6aa8ea1c persistent-debian-live-10.0.0-amd64-standard.iso
So it depends what kind of files you have, and how they might be modified, if vumer
and similar tools are useful.
Edit 2: A 'oneliner' to scan a directory tree
This is a 'oneliner' to scan a directory tree
md5sum $(for i in $(find -type f -ls|sed 's/^ *//'|tr -s ' ' '\t '| \
cut -f 7,11|sort -n|sed 's/\t/\t /'|uniq -Dw10|sed 's/\t */\t/'| \
cut -f 2 );do vumer "$i";done|sed -e 's/.*checksum=//'|sort|rev| \
uniq -f 1 -D|rev|tee /dev/stderr|sed -e 's/.*file=//' -e 's/"//g')|uniq -Dw32
- There were 418 files in a directory tree (with iso files and some other files)
- checking the size identified 72 files (potentially 36 pairs)
vumer
identified 30 files (15 pairs) with the same vumer checksum
md5sum
identified 18 files (9 pairs) with the same md5sum checksum
This means that vumer
saved a lot of time; md5sum
needed to check only 30 of the 418 files.
Edit 3: The shellscript scan4dblt
I replaced the 'oneliner' with a script , scan4dblt
, that I also tested in a few directory trees, and made some edits to the 'doer' script, vumer
.
#!/bin/bash
function mkfil {
tmpf0=$(mktemp)
tmpf1=$(mktemp)
tmpf2=$(mktemp)
tmpf3=$(mktemp)
tmpf4=$(mktemp)
tmpf5=$(mktemp)
}
function rmfil {
rm "$tmpf0" "$tmpf1" "$tmpf2" "$tmpf3" "$tmpf4" "$tmpf5"
}
main
echo -n "${0##*/}: "
mkfil
same size
find -type f -printf "%s %p\n"
|sed 's/ / /'|sort -n|uniq -Dw11|tee "$tmpf0" |tr -s ' ' ' '
|cut -d ' ' -f 2- |sed -e 's/&/&/g' -e 's/(/(/g' -e 's/)/)/g' > "$tmpf1"
res=$(wc -l "$tmpf0"|sed 's/ .*//')
if [ $res -gt 1 ]
then
echo "same size ($res):"
cat "$tmpf0"
else
echo "no files with the same size"
rmfil
exit 1
fi
vumer
while read fnam
do
echo -n '.'
vumer "$fnam" >> "$tmpf2"
echo -e "$fnam: $(vumer "$fnam")" >> "$tmpf2"
done < "$tmpf1"
echo ''
sed "$tmpf2" -e 's/.checksum=//'|sort|uniq -Dw33|tee "$tmpf1" |sed -e 's/.file=//' -e 's/"//g' > "$tmpf3"
res=$(wc -l "$tmpf1"|sed 's/ .*//')
if [ $res -gt 1 ]
then
echo "vumer: ($res)"
cat "$tmpf1"
else
echo "vumer: no files with the same checksum"
rmfil
exit 1
fi
inode
while read fnam
do
echo -n '.'
size=$(stat --format='%s' "$fnam")
inod=$(stat --format='%i' "$fnam")
printf "%12d %10d %s\n" $size $inod "$fnam" >> "$tmpf4"
done < "$tmpf3"
echo ''
#cat "$tmpf4"
#cat "$tmpf4" |sort -k3|uniq -f2 -Dw32 |sort -n > "$tmpf5"
cat "$tmpf4" |sort -k2|uniq -f1 -Dw11 |sort -n > "$tmpf5"
> "$tmpf2"
while read fnam
do
if ! grep "$fnam" "$tmpf5" 2>&1 > /dev/null
then
echo "$fnam" >> "$tmpf2"
fi
done < "$tmpf3"
res=$(wc -l "$tmpf5"|sed 's/ .*//')
if [ $res -gt 1 ]
then
echo "inode: ($res)"
echo " size inode md5sum file-name"
cat "$tmpf5"
else
echo "inode: no files with the same inode"
fi
md5sum
> "$tmpf4"
while read fnam
do
echo -n '.'
size=$(stat --format='%s' "$fnam")
inod=$(stat --format='%i' "$fnam")
printf "%12d %10d " $size $inod >> "$tmpf4"
md5sum "$fnam" >> "$tmpf4"
done < "$tmpf2"
echo ''
#echo "4: ";cat "$tmpf4"
cat "$tmpf4" |sort -k3|uniq -f2 -Dw33 |sort -n > "$tmpf5"
res=$(wc -l "$tmpf5"|sed 's/ .*//')
if [ $res -gt 1 ]
then
echo "md5sum: ($res)"
echo " size inode md5sum file-name"
cat "$tmpf5"
else
echo "md5sum: no files with the same checksum"
rmfil
exit 1
fi
rmfil
Edit 4: Improved shellscript scan4dblt
plus example (output file)
The shellscript scan4dblt
is developed further and tested with a few directory trees including big iso files, pictures, video clips and documents. Several bugs have been fixed (and the current version replaces the original one here).
Example:
The following example shows the output file produced by
scan4dblt | tee /tmp/scan4dblt.out
Even though a small minority of the files were fully checked by md5sum
, the full checks used most of the execution time. The proportion of time by md5sum
will depend on the file sizes.
Particularly when there are lots of relatively small files, this implementation via shellscripts will be inefficient, a compiled program would be much better. But with huge files, for example iso files and video clips, shellscripts might do a good job.
Edit 5: Additional comment about the shellscripts
If I would do this exercise again, I would start by saving doublets separately due to hard links, and keep one remaining [hardlinked] file in the list to check if it matches one more file in later comparisons.
It would also be interesting to test how big chunks of data should be checked in order [for the tool called vumer
here] to do a good job. This must probably be tailor-made for the kind of files to be checked for duplicates.
I would also test which file size, where it is useful with the intermediate check [by vumer
].
Final(?) comments
I'm glad to notice how much attention this question has received, both answers and comments. As Peter Cordes writes in his answer (as well as in comments), the quick testing tool (in my case vumer
) can be improved in several ways depending on the kind of files it is made to test.
In my answer I have only implemented the original idea of @vume, and can show that it is good enough in many cases when combined with other quick sorting methods in order to minimize the need for full checksum testing.
jdupes
. – Kamil Maciorowski Jul 12 '22 at 10:07fdupes
is surprisingly efficient. For example it doesn't try to compare files with different sizes, so if there's only one file of size 12345 bytes it will never get scanned for either partial or complete checksums – Chris Davies Jul 12 '22 at 18:44fdupes
compare file sizes first (stored in the inode). So they only compare contents for identically sized files. 2) Comparing files from the start until you find a difference seems the best idea: a) if files differ at all they are probably going to differ within the first block anyway, and b) it's not really feasible to read pick and read blocks within arbitrary files without inadvertently reading and discarding a lot more, as there can be a lot of indirection ( e.g. NFS on ext4 on LVM on iSCSI on Ceph RADOS). – reinierpost Jul 15 '22 at 12:12