1

Problem: finding duplicate files should be fast and not open & hash files unnecessarily. @taltman has written a beautifully fast script, here, which only uses MD5 when files of identical size have been found. The script only runs on Centos. And it doesn't provide output about the largest files.

Status: I have ported the script to run also on Cinnamon Mint - yay! It can even handle spaces in file names. It's here. Output now:

MD5 30599d19eb93cfb45030a1b0270e698c:
        ./../abc.jpg
        ./../xyz.jpg
MD5 3d0bc4e9ec8c77f5a430d8455252ef58:
        ./../def.mp4
        ./../hij.mp4

I would like to add sorting by size in the report block (largest on bottom), and showing the size. Output as I would love it:

## 4.53MB (MD5 30599d19eb93cfb45030a1b0270e698c):
        ./../abc.jpg
        ./../xyz.jpg
## 1.76GB (MD5 3d0bc4e9ec8c77f5a430d8455252ef58):
        ./../def.mp4
        ./../hij.mp4

Asking for help: Is there anyone who really understands AWK? Willing to help?

roddyt
  • 11
  • 1
    Your source material seems flawed -- processing, as it is, the output of ls. There seems to be broad agreement that that is a bad idea. – bxm Jul 01 '23 at 21:50
  • if you change hashes[hash] to hashes[hash]=size, then you will have access to the sizes too in the for(hash in hashes) ... with printing the hashes[hash]. you can then use GNU awk array sorting functions to sort your hashes[] array on values (see PROCINFO["sorted_in"] and its accepted values in the man awk). And && $5 > 35 part of the code skips the files with 35bytes size or smaller. And you are not correctly handling the filesName with spaces too, if you had a fileName with more than 6 whitespaces, it won't work. – αғsнιη Jul 02 '23 at 05:21
  • Have you tried fdupes or even better, rdfind? Both use similar approaches: only resorting to hashing for identically sized files – Chris Davies Jul 02 '23 at 07:36
  • [edit] your question to include concise, testable sample input and expected output and a description of what it is you're trying to do so we can help you. No links,no images, just your complete question stated here as text. – Ed Morton Jul 02 '23 at 19:56

3 Answers3

1

Maybe not an answer to your question, however this might be easier implementation. I've written it in bash and is probably less complex than dealing with awk.

#!/usr/bin/env bash

die() { echo >&2 "$@" exit 1 }

usage() { echo >&2 "Usage: $0 path" die }

checkdupes() { local path="$1" declare -A flist declare -a output_array

while read -r sum fname; do if [[ ${flist[$sum]} ]]; then fsize=$(stat --printf="%s" "$fname") fsize_converted=$(convert_bytes "$fsize") output_array+=("$fsize_converted $(md5sum "$fname") and ${flist[$sum]} are identical") fi flist[$sum]+="$fname" done < <(find "$path" -type f -exec sha256sum {} +)

IFS=$'\n' sorted_array=($(sort -h <<<"${output_array[*]}")) unset IFS for ((i=${#sorted_array[@]}-1; i>=0; i--)); do printf '%s\n' "${sorted_array[i]}" done }

convert_bytes() { local bytes=$1 local unit="" local value=""

if ((bytes < 1024)); then unit="B" value=$bytes elif ((bytes < 1048576)); then unit="KB" value=$((bytes / 1024)) elif ((bytes < 1073741824)); then unit="MB" value=$((bytes / 1048576)) else unit="GB" value=$((bytes / 1073741824)) fi

printf '%d%s' "${value}" "${unit}" }

if (($# < 1)); then usage else checkdupes "$1" fi

Throughout my scripts here at SE you'll see this part

die()
{
  echo >&2 "$@"
  exit 1
}

usage() { echo >&2 "Usage: $0 path" die }

This is actually just the error handling part. You can create a file called errorhandling and source it in the script.

Usage:

./check_dupes [path]

Hope this helps!

1

Solution in TXR Lisp

I made several duplicate files in the subdirectory linenoise:

$ txr findup.tl linenoise/
---
969d22f22e167313   1c11d2 (linenoise/history.txt.link linenoise/history.txt)
---
c3211c8f2a6ac412   1c1e0d (linenoise/example.c)
c3211c8f2a6ac412   1cd21f (linenoise/example.c.dup)
---
e4cd0181a0e73fda   1cd20a (linenoise/LICENSE.lnk linenoise/LICENSE.dup)
e4cd0181a0e73fda   1c11d4 (linenoise/LICENSE)

When I run the program, it reports duplicate groups, each preceded by ---. Within each group, it prints a portion of the SHA256, followed by the inode number, followed by the path(s).

The program identifies files that are hard links to each other (same inode number) as well as duplicates by hash, or combinations of the two.

Various cases are shown above.

The files history.txt.link and history.txt are links to each other. No other duplicate exists, so only one line is shown for them.

The files example.c and example.c.dup are identical, but distinct objects.

Then we have a mixed situation: LICENSE.lnk and LICENSE.dup are links to the same object, which is a duplicate of LICENSE.

Code:

(let ((dir "."))
  (match-case *args*
    ((@where) (set dir where))
    (())
    (@else (put-line "bad arguments" *stderr*)
           (exit nil)))
  (flow (build (ftw dir (lambda (path type stat . rest)
                          (if (eql type ftw-f)
                            (add stat)))))
    (group-by .size)
    hash-values
    (keep-if cdr)
    (each ((group @1))
      (flow group
        (group-by .ino)
        hash-values
        (collect-each ((group @1))
          (let ((hash (with-stream (s (open-file (car group).path))
                        (sha256-stream s))))
            (cons hash group)))
        (sort-group @1 car)
        (each ((subgr @1))
          (when-match @(require ((@hash @stat . @more-stats) . @other-members)
                                (or other-members more-stats))
                      subgr
            (put-line "---")
            (each-match ((@nil . @stats) subgr)
              (format t "~x ~08x ~a\n"
                      [hash 0..8] (car stats).ino
                      (mapcar .path stats)))))))))

The ftw function is a wrapper for the POSIX nftw function. It gives our callback lambda information about each visited object, including stat structure (Lisp version of struct stat). The stat structure has slots like ino (inode number), size, and path (full relative path). We can do everything we need with these stat objects.

We first group the objects by size, and keep only groups that have two or more members. Files with unique sizes do not have duplicates, as the question notes.

The approach is first to find groups of paths that have the same size.

We iterate over these groups; we group each group into subgroups by inode number. We then hash the leader of each group (which is a Lisp list of stat structures), and tack the hash onto the group as a head item.

Lastly, we do a sort-group operation on these groups, by hash. What that means is that the groups are sorted by hash, and runs of duplicate hashes are grouped together.

Then we just have to walk over these equal hash groups and dump them, where we are careful only to report groups that either have two or more members (duplicate objects) or else have multiple paths for one inode: (or other-members more-stats).

An improvement could be made to the code. In situations when all the files that have a given size are links to the same object (same inode), we do not need to calculate a hash for them; we know they are the same, and no copy exists in that subtree. We could substitute some fake value for the hash, like 0, and just go through the sort-group.

Also, the program neglects to do a full comparison to eliminate false positives. It reports files that have the same SHA256, not identical files.

Here is a possible way to elide the hash for pure hard link duplicates:

        (collect-each ((group @1))
          (let ((hash (if (cdr @1)
                        (with-stream (s (open-file (car group).path))
                          (sha256-stream s))
                        (load-time (make-buf 32)))))
            (cons hash group)))

The output then being:

---
0000000000000000   1c11d2 (linenoise/history.txt.link linenoise/history.txt)
---
c3211c8f2a6ac412   1c1e0d (linenoise/example.c)
c3211c8f2a6ac412   1cd21f (linenoise/example.c.dup)
---
e4cd0181a0e73fda   1cd20a (linenoise/LICENSE.lnk linenoise/LICENSE.dup)
e4cd0181a0e73fda   1c11d4 (linenoise/LICENSE)

I substituted a full all-zero hash for that case, which is clearly visible. (load-time (make-buf 32)) makes a 32 byte all-zero buffer, same length as a SHA256. load-time ensures that in compiled code, the calculation is carried out once when the code is loaded, not each time it is executed. The cdr function is a Lisp idiom for "does this list have more than one item"; it retrieves the rest of the list, excluding the first item. If that is empty, it returns nil, which is also Boolean false.

Kaz
  • 8,273
  • More caveats: if the directory tree searched spans multiple volumes, then the inode numbers won't necessarily be unique. It could happen that two files of exactly the same size on different volumes have the same inode number, and we falsely conclude they are the same. We need to include the device number. A size cutoff would be useful, and sorting by size. – Kaz Jul 03 '23 at 00:20
0

I THINK what you're trying to do might be something like the following using GNU tools (untested):

while IFS= read -r -d $'\0' currName; do
    currSum=$(md5sum "$currName")
    if [[ "$currSum" == "$prevSum" ]]; then
        printf 'Dups:'
        printf '%s\n' "$prevName"  # end with \0 if necessary
        printf '%s\n' "$currName"
    fi
    prevSum="$currSum"
    prevName="$currName"
done < <(
    find . -type f -printf '%s\t%p\0' |
    sort -z -k2- |
    awk '
        BEGIN { RS=ORS="\0" }
        {
            currName = $0
            sub(/[^\t]+\t/,"",currName)
        }
        $1 == prevSize {
            print prevName currName
            prevName = ""
            next
        }
        {
            prevSize = $1
            prevName = currName ORS
        }
    '
)
Ed Morton
  • 31,617