17

Tools like fdupes are ridiculous overkill when dealing with jpg or h264 compressed files. Two such files having the exact same filesize is already a pretty good indication of them being identical.

If, say, in addition to that, 16 equidistant chunks of 16 bytes are extracted and compared and they are the same as well that would be plenty of evidence for me to assume that they are identical. Is there something like that?

(By the way I am aware that filesize alone can be a rather unreliable indicator since there are options to compress to certain target sizes, like 1MB or 1 CD/DVD. If the same target size is used on many files, it is quite reasonable that some different files will have the exact same size.)

vume
  • 173
  • 4
  • The closest options in jdupes would be -Q -T. However although -T compares hashes of 4096 bytes, they are sequential and at the start. And some media files are padded, so they might be identical in the first 1-2 blocks. And there does not seem to be an option to increase this to, say, 8 blocks. Even then videos with the same intro and same compression might reasonably be identical. – vume Jul 12 '22 at 11:43
  • 6
    fdupes is surprisingly efficient. For example it doesn't try to compare files with different sizes, so if there's only one file of size 12345 bytes it will never get scanned for either partial or complete checksums – Chris Davies Jul 12 '22 at 18:44
  • 6
    Keep in mind that bit-rot is possible; so are incomplete torrents. I wouldn't delete a "duplicate" without checking the whole file. I had written longer comments but turned them into an answer. – Peter Cordes Jul 13 '22 at 03:51
  • 1
    I wonder if your premise is correct. 1) Duplication detection tools such as fdupes compare file sizes first (stored in the inode). So they only compare contents for identically sized files. 2) Comparing files from the start until you find a difference seems the best idea: a) if files differ at all they are probably going to differ within the first block anyway, and b) it's not really feasible to read pick and read blocks within arbitrary files without inadvertently reading and discarding a lot more, as there can be a lot of indirection ( e.g. NFS on ext4 on LVM on iSCSI on Ceph RADOS). – reinierpost Jul 15 '22 at 12:12
  • Trivially, you'll essentially always (in any real filesystem) have a nonzero probability of false positives with any method that doesn't compare until a difference is found. I suppose it depends on your application whether this is acceptable. – Darren Jul 15 '22 at 12:46
  • @vume's idea is that comparing blocks out of sequence may be faster than comparing them in sequence. I don't think that is true. In certain specific situations it may be. – reinierpost Jul 15 '22 at 15:39
  • Apart from the issue as to whether it is or efficient to compare file blocks in order or possibly out of order, there is the question of hashes. Comparing hashes is more efficient, e.g. where less network traffic may be required for hashes that are small compared to the amount of data hashed, with high probability of difference. Comparing hashes may be less efficient when all files are local. Some disk controllers or file appliances may have primitives that return hashes instead of transferring data. Sone file systems may store hashes. – Krazy Glew Jul 15 '22 at 23:06

8 Answers8

12

czkawka is an open source tool which was created to find duplicate files (and images, videos or music) and present them through command-line or graphical interfaces, with an emphasis on speed. This part from the documentation may interest you:

Faster scanning for big number of duplicates

By default for all files grouped by same size are computed partial hash(hash from only of 2KB each file). Such hash is computed usually very fast, especially on SSD and fast multicore processors. But when scanning a hundred of thousands or millions of files with HDD or slow processor, typically this step can take much time.

With the GUI version, hashes will be stored in a cache so that searching for duplicates later will be way faster.


Examples:

Create some test files:

We generate random images, then copy a.jpg to b.jpg in order to have a duplicate.

$ convert -size 1000x1000 plasma:fractal a.jpg
$ cp -v a.jpg b.jpg
'a.jpg' -> 'b.jpg'
$ convert -size 1000x1000 plasma:fractal c.jpg
$ convert -size 1000x1000 plasma:fractal d.jpg
$ ls --size
total 1456
364 a.jpg  364 b.jpg  364 c.jpg  364 d.jpg

Check only the size:

$ linux_czkawka_cli dup --directories /run/shm/test/ --search-method size
Found 2 files in 1 groups with same size(may have different content) which took 361.76 KiB:
Size - 361.76 KiB (370442) - 2 files 
/run/shm/test/b.jpg
/run/shm/test/a.jpg

Check files by their hashes:

$ linux_czkawka_cli dup --directories /run/shm/test/ --search-method hash
Found 2 duplicated files in 1 groups with same content which took 361.76 KiB:
Size - 361.76 KiB (370442) - 2 files 
/run/shm/test/b.jpg
/run/shm/test/a.jpg

Check files by analyzing them as images:

$ linux_czkawka_cli image --directories /run/shm/test/
Found 1 images which have similar friends
/run/shm/test/a.jpg - 1000x1000 - 361.76 KiB - Very High
/run/shm/test/b.jpg - 1000x1000 - 361.76 KiB - Very High
A.L
  • 1,586
  • 4
    Sounds interesting, but if you're looking for CLI tool, perhaps even more interesting is fclones. A comparison against GNU's cmp (the selected answer) would also be interesting. – Seamus Jul 13 '22 at 00:33
  • @Seamus I don't know how to use cmp to compare several files. I added an example with 4 files. – A.L Jul 13 '22 at 11:37
  • is which took 361.76 KiB supposed to reflect elapsed time instead of size? If size, "which took" is unusual wording. – shoover Jul 14 '22 at 15:37
  • 1
    @shoover I guess it means that the duplicate files, if they indeed turn out to be duplicates, are wasting a total of 361.76 KiB, which could be recovered by deleting them. – Fabio says Reinstate Monica Jul 14 '22 at 17:09
11

You'd probably want to make sure you do a full compare (or hash) on the first and last 1MiB or so, where metadata can live that might be edited without introducing offsets to the compressed data. Also, read granularity from storage is typically at least 512 bytes not 16, so might as well do that; a tiny bit extra CPU time to compare more data is trivial. (Aligned at a 512-byte boundary)

(A write sector size of at least 4096B is typical, but a logical sector size of 512 might allow a SATA disk to only send the requested 512B over the wire, if the kernel doesn't widen the request to a full page itself. Which it probably would; the pagecache is managed in whole pages.)


Keep in mind that bit-rot is possible, especially if files have been stored on DVD-R or other optical media. I wouldn't delete a "duplicate" without checking for bitwise identical (or at least identical hashes). Ruling out duplicates quickly based on a hash signature of an early part of a file is useful, but you'd still want to do a full check before declaring two files duplicates for most purposes.


If two files are almost the same but have a few bit-differences, use ffmpeg -i foo.mp4 -f null - to find glitches, decoding but doing nothing with the output.

If you do find a bitwise difference but neither file has errors a decoder notices, use

ffmpeg -i foo.mp4 -f framecrc  foo.fcrc

or -f framemd5 to see which frame has a difference that wasn't an invalid h.264 stream. Then seek to there and visually inspect which one is corrupt.

Your method could be good for detecting files that are corrupt (or metadata-edited) copies of each other, something that normal duplicate-finders won't do easily. Comments under the question point out that jdupes can use hashes of the first N megabytes of a file after a size compare, so that's a step in the right direction.


For other use-cases, maybe you'd be ok with less stringent checking, but given that duplicate file finders exist that only compare or hash when there are files of identical size, you can just let one of those run (overnight or while you're going out), and come back to a fully checked list.

Some like fslint have the option to hard-link duplicates to each other (or symlink), so next time you look for duplicates, they'll already be the same file. So in my experience, duplicate file finding is not something where I've felt a need to take a faster but risky approach.

(fslint never got updated for Python3, apparently czkawka is a modern clone in Rust, according to an askubuntu answer.)

Peter Cordes
  • 6,466
10

Does GNU cmp help you?

  • You can use the -s option to suppress output and only use the return value
  • It checks the file size first to skip any comparison on different file size
  • With options -i (skip initial) and -n (number of bytes to compare) you can additionally define a range of bytes you want to compare

If the number of files is too big to cmp each pair of them, you may want to first sort all files by their size and only compare groups with same size (uniq -D with -w).

Philippos
  • 13,453
  • 6
    Could you please share an example of the usage to find duplicates? It looks like it compares only 2 files at a time, so how do you find duplicates if a directory contains 50 files? – A.L Jul 13 '22 at 09:22
  • 1
    cmp is not at all optimized for speed in the way that fdupes and similar are. (Comparing only files with the same sizes is part of what fdupes does out-of-the-box; so is comparing only the beginning and ending chunks and only continuing further if there are matches that can't be disambiguated any other way). – Charles Duffy Jul 14 '22 at 17:53
  • 1
    @CharlesDuffy The OP says fdupes is overkill, because only a small segment of the files needs to be compared for this purpose. What cmp does can't be optimized further for that given goal: Compare the size, compare the given bit and not continue under no circumstances ever. That's the request here. What you call an optimization, is actually the problem here. – Philippos Jul 15 '22 at 06:01
  • 1
    @Philippos, the problem is that cmp doesn't span out to more than two files at a time. If you want to know if you have any duplicates among 5,000 files, it's crazy to even consider comparing them pairwise; you want to narrow down to the ones with the same sizes, and then toss out the ones where the first bytes don't match, and only then look at whether the bytes at the tail match (or do other spot-checking). There are other answers suggesting tools that do this; cmp does not, and is consequently unfit for the task at hand. – Charles Duffy Jul 15 '22 at 13:53
  • So what the OP wants is a subset of fdupes behavior, without the "compare the whole body if everything else looks like a match" at the end. It's like fdupes does 20 things, and the OP wants the first 19, whereas cmp does only one of them. It's manifestly unsuited. Sure, you can combine several cmp calls to get close, but then you're instructing the OP to write their own replacement for 95% of fdupes; it would make more sense just to take fdupes and patch it to make the one step that isn't desired optional. – Charles Duffy Jul 15 '22 at 13:55
  • That's what I outlined in the last paragraph: Do a simple filesize sort in with uniq -D and only compare groups of identical sized files pairwise. – Philippos Jul 15 '22 at 14:48
5

Shellscript implementation of the OP's, @vume's, idea

Background with the example rsync

Have a look at rsync. It has several levels of checking if files are identical. The manual man rsync is very detailed, and you can identify what I describe, and probably also some other interesting alternatives.

  • The most rigorous check is to compare each byte, but as you write, it takes a lot of time when there is a lot of data, for example a whole backup.

    -c, --checksum              skip based on checksum, not mod-time & size
    -a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
    
  • The standard check is size and other file attributes (for example time stamps). It is often considered good enough.

  • Your idea, @vume, means something between these two checking levels. I have not seen any tool like that, but I would be very interested in such a tool.

Edit 1: The shellscript vumer

The following shellscript vumer uses dd to do what I think you want, @vume.

#!/bin/bash

chunks=16 chnksiz=16

function usage { echo "Usage: ${0##*/} <file>" }

if ! test -f "$1" then usage exit fi

size=$(stat --format='%s' "$1") step=$(( size/(chunks-1) )) #echo "step=$step" tmpfil=$(mktemp)

if [ $size -lt 512 ] then chksum=$(md5sum "$1" | sed 's/ .//') else for (( i=0;i<chunks;i++ )) do if [ $i -eq $((chunks-1)) ] then pos=$((size-chnksiz)) else pos=$((istep)) fi

echo "$i: $pos"

dd if="$1" bs=1 skip=$pos count=16 >> "$tmpfil" 2> /dev/null done chksum=$(md5sum "$tmpfil" | sed 's/ .*//') fi

modif=$(stat --format='%y' "$1") modif=${modif//\ /_} echo "size=$size modif=$modif checksum=$chksum file=&quot;$1&quot;" #less "$tmpfil" rm "$tmpfil"

In my Lenovo C30 workstation (old but quite powerful) I tested vumer with the Ubuntu Desktop 22.04 LTS iso file and compared the time used with md5sum,

$ time vumer ubuntu-22.04-desktop-amd64.iso
size=3654957056 modif=2022-04-19_10:25:02.000000000_+0200 checksum=ec3483153bfb965745753c4b1b92bf2e file="ubuntu-22.04-desktop-amd64.iso"

real 0m0,024s user 0m0,018s sys 0m0,008s

$ time md5sum ubuntu-22.04-desktop-amd64.iso 7621da10af45a031ea9a0d1d7fea9643 ubuntu-22.04-desktop-amd64.iso

real 0m19,919s user 0m5,331s sys 0m0,988s

So for big files it is really a lot faster than md5sum, which today is considered a [too] simple checksum tool. sha256sum is even slower.

I checked also with a Debian iso file that was converted to replace a couple of boot options quiet splash with persistence and compared with its original file. vumer had bad luck and did not check the few locations modified. So here we must fall back on the classic timestamp to tell the difference. Of course md5sum can tell the difference.

$ vumer debian-live-10.0.0-amd64-standard.iso
size=865075200 modif=2019-09-05_10:14:31.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="debian-live-10.0.0-amd64-standard.iso"

$ vumer persistent-debian-live-10.0.0-amd64-standard.iso size=865075200 modif=2019-09-12_18:01:55.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="persistent-debian-live-10.0.0-amd64-standard.iso"

$ md5sum *debian-live-10.0.0-amd64-standard.iso a64ae643520ca0edcbcc769bae9498f3 debian-live-10.0.0-amd64-standard.iso 574ac1f29a6c86d16353a54f6aa8ea1c persistent-debian-live-10.0.0-amd64-standard.iso

So it depends what kind of files you have, and how they might be modified, if vumer and similar tools are useful.

Edit 2: A 'oneliner' to scan a directory tree

This is a 'oneliner' to scan a directory tree

md5sum $(for i in $(find -type f -ls|sed 's/^ *//'|tr -s ' ' '\t     '| \
cut -f 7,11|sort -n|sed 's/\t/\t     /'|uniq -Dw10|sed 's/\t */\t/'| \
cut -f 2 );do vumer "$i";done|sed -e 's/.*checksum=//'|sort|rev| \
uniq -f 1 -D|rev|tee /dev/stderr|sed -e 's/.*file=//' -e 's/"//g')|uniq -Dw32
  • There were 418 files in a directory tree (with iso files and some other files)
  • checking the size identified 72 files (potentially 36 pairs)
  • vumer identified 30 files (15 pairs) with the same vumer checksum
  • md5sum identified 18 files (9 pairs) with the same md5sum checksum

This means that vumer saved a lot of time; md5sum needed to check only 30 of the 418 files.

Edit 3: The shellscript scan4dblt

I replaced the 'oneliner' with a script , scan4dblt, that I also tested in a few directory trees, and made some edits to the 'doer' script, vumer.

#!/bin/bash

function mkfil {

tmpf0=$(mktemp) tmpf1=$(mktemp) tmpf2=$(mktemp) tmpf3=$(mktemp) tmpf4=$(mktemp) tmpf5=$(mktemp) }

function rmfil {

rm "$tmpf0" "$tmpf1" "$tmpf2" "$tmpf3" "$tmpf4" "$tmpf5" }

main

echo -n "${0##*/}: " mkfil

same size

find -type f -printf "%s %p\n"
|sed 's/ / /'|sort -n|uniq -Dw11|tee "$tmpf0" |tr -s ' ' ' '
|cut -d ' ' -f 2- |sed -e 's/&/&amp;/g' -e 's/(/(/g' -e 's/)/)/g' > "$tmpf1"

res=$(wc -l "$tmpf0"|sed 's/ .*//') if [ $res -gt 1 ] then echo "same size ($res):" cat "$tmpf0" else echo "no files with the same size" rmfil exit 1 fi

vumer

while read fnam do echo -n '.' vumer "$fnam" >> "$tmpf2"

echo -e "$fnam: $(vumer "$fnam")" >> "$tmpf2"

done < "$tmpf1" echo '' sed "$tmpf2" -e 's/.checksum=//'|sort|uniq -Dw33|tee "$tmpf1" |sed -e 's/.file=//' -e 's/"//g' > "$tmpf3"

res=$(wc -l "$tmpf1"|sed 's/ .*//') if [ $res -gt 1 ] then echo "vumer: ($res)" cat "$tmpf1" else echo "vumer: no files with the same checksum" rmfil exit 1 fi

inode

while read fnam do echo -n '.' size=$(stat --format='%s' "$fnam") inod=$(stat --format='%i' "$fnam") printf "%12d %10d %s\n" $size $inod "$fnam" >> "$tmpf4" done < "$tmpf3" echo '' #cat "$tmpf4" #cat "$tmpf4" |sort -k3|uniq -f2 -Dw32 |sort -n > "$tmpf5" cat "$tmpf4" |sort -k2|uniq -f1 -Dw11 |sort -n > "$tmpf5"

> "$tmpf2" while read fnam do if ! grep "$fnam" "$tmpf5" 2>&1 > /dev/null then echo "$fnam" >> "$tmpf2" fi done < "$tmpf3"

res=$(wc -l "$tmpf5"|sed 's/ .*//') if [ $res -gt 1 ] then echo "inode: ($res)" echo " size inode md5sum file-name" cat "$tmpf5" else echo "inode: no files with the same inode" fi

md5sum

> "$tmpf4" while read fnam do echo -n '.' size=$(stat --format='%s' "$fnam") inod=$(stat --format='%i' "$fnam") printf "%12d %10d " $size $inod >> "$tmpf4" md5sum "$fnam" >> "$tmpf4" done < "$tmpf2" echo '' #echo "4: ";cat "$tmpf4" cat "$tmpf4" |sort -k3|uniq -f2 -Dw33 |sort -n > "$tmpf5"

res=$(wc -l "$tmpf5"|sed 's/ .*//') if [ $res -gt 1 ] then echo "md5sum: ($res)" echo " size inode md5sum file-name" cat "$tmpf5" else echo "md5sum: no files with the same checksum" rmfil exit 1 fi

rmfil

Edit 4: Improved shellscript scan4dblt plus example (output file)

The shellscript scan4dblt is developed further and tested with a few directory trees including big iso files, pictures, video clips and documents. Several bugs have been fixed (and the current version replaces the original one here).

Example:

The following example shows the output file produced by

scan4dblt | tee /tmp/scan4dblt.out

Even though a small minority of the files were fully checked by md5sum, the full checks used most of the execution time. The proportion of time by md5sum will depend on the file sizes.

Particularly when there are lots of relatively small files, this implementation via shellscripts will be inefficient, a compiled program would be much better. But with huge files, for example iso files and video clips, shellscripts might do a good job.

Edit 5: Additional comment about the shellscripts

If I would do this exercise again, I would start by saving doublets separately due to hard links, and keep one remaining [hardlinked] file in the list to check if it matches one more file in later comparisons.

It would also be interesting to test how big chunks of data should be checked in order [for the tool called vumer here] to do a good job. This must probably be tailor-made for the kind of files to be checked for duplicates.

I would also test which file size, where it is useful with the intermediate check [by vumer].

Final(?) comments

I'm glad to notice how much attention this question has received, both answers and comments. As Peter Cordes writes in his answer (as well as in comments), the quick testing tool (in my case vumer) can be improved in several ways depending on the kind of files it is made to test.

In my answer I have only implemented the original idea of @vume, and can show that it is good enough in many cases when combined with other quick sorting methods in order to minimize the need for full checksum testing.

sudodus
  • 6,421
  • 1
    For a multimedia file, you might want to make sure the blocks you check include the first 1 or 2 MiB, and the last. Tags / metadata are typically there (which could be edited without offsetting other bytes like your quiet / splash), not in the middle. Anything short of a full checksum won't detect bit-rot / corruption, though, so if I was going to delete one copy or the other I'd want to make sure they were truly identical. (And if not, ffmpeg -i foo.mp4 -f null - to see which decodes without errors.) – Peter Cordes Jul 13 '22 at 03:26
5

There is a tool called imosum that works similar to e.g. sha256sum, but it only uses three 16 kB blocks. The samples are taken from beginning, middle and end of the file, and file size is included in the hash also.

Example usage for finding duplicates:

pip install imohash
find .../your_path -type f -exec imosum {} + > /tmp/hashes
sort /tmp/hashes | uniq -w 32 --all-repeated=separate

Output will have groups of duplicate files:

e8e7d502a5407e75dc1b856024f2c9aa  path/to/file1
e8e7d502a5407e75dc1b856024f2c9aa  other/path/to/duplicate1

e9c83ccdc726d0ec83c55c72ea151d48 path/to/file2 e9c83ccdc726d0ec83c55c72ea151d48 other/path/to/duplicate2

On my SSD, this took about 10 seconds to process 72 GB worth of digital photos (10k files).

jpa
  • 1,269
3

My usual tool when deal with comparing files is to use hash. For example:

sha1sum -- * |sort >output_file

will create hashes and sort them so you can see in the file the duplicates.

And this give much higher confidence the files are the same than first several bytes.

Romeo Ninov
  • 17,484
  • 1
    Not just the FIRST 16 bytes. 16*16bytes all over the file. Ideally with an offset of 8k from start and end. Hash requires to read the entire file which takes a tremendous amount of time for large files. – vume Jul 12 '22 at 11:43
  • @vume, my test show time like 1 second for 200M file. Is this unreasonable for you? – Romeo Ninov Jul 12 '22 at 11:44
  • 2
    I admit I never actually tested this and am pleasantly surprised by how fast my old laptop can compute hashes even from external drives. No doubt my concept would be much faster though and with similar reliability for compressed files. – vume Jul 12 '22 at 14:12
  • 1
    @vume Why would run times change depending on if the file is compressed or not. A hash of a 200MB file reads all the bytes and it doesn't care what those bytes represent. Your solution may be faster but designing for simplicity has advantages. – doneal24 Jul 12 '22 at 16:40
  • 4
    @doneal24: vume's point was that only reading a few small portions of a file is always much faster, and a compressed file is unlikely to have matches at some blocks without matching everywhere (unless it's a corrupt copy, that was originally the same but has some bytes zeroed or otherwise modified). Variable-length compression will at least offset later parts that are the same, and lossy compression like h.264 (with most profiles) simply doesn't end up with parts that are byte-for-byte identical. Even JPEG, where each 8x8 block is compressed separately, is still like this I think. – Peter Cordes Jul 13 '22 at 03:18
  • @RomeoNinov: If you were going to do this, you'd use fdupes or fslint-gui to avoid reading a file that didn't have any others the same size in your input set. Careful exact-duplicate file finders already exist, including I think some that trust SHA1 or MD5 without doing a byte-by-byte compare on each pair of files in a set that's all the same length. – Peter Cordes Jul 13 '22 at 03:20
  • 1
    @PeterCordes I understand vume's goal and reasoning. I just question where he's trying to put his optimization efforts. Writing a custom tool to read 128 bytes and then seek 8k further into the file; repeat until you reach EOL, seems like a lot of work to save a few seconds. As Romeo said, you can checksum 200MB+/second with standard tools. – doneal24 Jul 13 '22 at 12:17
  • 1
    @PeterCordes, right. But I can keep the file with hashes for future reference. And also do check if meanwhile any file was changes. – Romeo Ninov Jul 13 '22 at 12:48
  • 1
    @doneal24: If your files are on a NAS over 1G ethernet, you only have 100MB/s bandwidth. That can take a fair while if you have multiple files of 1 to 4 GiB (e.g. videos). Doing a fast first pass to rule out duplicates before hashing a whole file makes sense, but skipping a full hash or compare doesn't make a lot of sense to me, either. (See my answer.) – Peter Cordes Jul 13 '22 at 13:05
2

As an author of the disketo tool I can recommend that: https://github.com/martlin2cz/disketo

Clone that, and run:

$ cd disketo
$ bin/run.pl scripts/files-with-duplicities.ds YOUR_DIRECTORY

It will stdout on each line a path to file, which has at least one duplicate one (having the same name) so with path to all of theese duplicities (separated by TAB).

You can customize the search. Instead of pre-installed "files-with-duplicities.ds" provide custom disketo script. To compare not just the file name, but also the size, use the ds file:

filter files having at-least-one of-the-same name-and-size
print files with files-of-the-same name-and-size

If you wish to compare based on something else (i.e. some 16 Byte chunks of the file contents), use the custom sub:

group files by-custom-groupper sub {
        my ($file, $context) = @_;
        # TODO implement properly
        use File::Slurp;
        return read_file($file);
} as-meta "same-file-contents"
filter files having at-least-one of-the-same custom-group "same-file-contents"

print files with files-of-the-same custom-group "same-file-contents"

Or open an issue and I could add it.

martlin
  • 173
  • 1
  • 1
  • 5
0

Rdfind is awesome as well as being included in most distros.

Checks file size, then start bytes, then end, then sha1