0

I am trying to read through a directory and identify any files which may be duplicates in their contents.

A simplified version of this:

#!/bin/bash

IFS=$'\n'

for FILEA in $(find "$1" -type f -print)
do
    echo "$FILEA"
    for FILEB in $(find "$1" -type f -print)
    do
        echo cmp \"${FILEA}\" \"${FILEB}\"
        cmp \"${FILEA}\" \"${FILEB}\"
    done
done

However, each instance of cmp complains No such file or directory.

If I grab the echo'd version of the cmp command, it executes exactly as expected.

Why can't cmp see the files when executed in the script context?

muru
  • 72,889
jtb
  • 11

1 Answers1

4

Your main issue is that you add literal double quotes to the filenames when using cmp (and the filenames themselves are actually unquoted). This is why the files can't be found (they don't have the quotes in their names). You also loop over the output of find, which is not ideal.

If you really don't want to use fdupes, you could do the following to try to make your approach a bit more efficient (speed-wise):

#!/bin/bash

# enable the "**" globbing pattern,
# remove non-matched patterns rather than keeping them unexpanded, and
# allow the matching of hidden names:
shopt -s globstar nullglob dotglob

pathnames=("${1:-.}"/**)  # all pathnames beneath "$1" (or beneath "." if "$1" is empty)

# loop over the indexes of the list of pathnames
for i in "${!pathnames[@]}"; do
    this=${pathnames[i]}  # the current pathname

    # skip this if it's not a regular file (or a symbolic link to one)
    [[ ! -f "$this" ]] && continue

    # loop over the remainder of the list
    for that in "${pathnames[@]:i+1}"; do

        # skip that if it's not a regular file (or a symbolic link to one)
        [[ ! -f "$that" ]] && continue

        # compare and print if equal
        if [[ "$this" -ef "$that" ]] || cmp -s "$this" "$that"; then
            printf '"%s" and "%s" contains the same thing\n' "$this" "$that"
        fi
    done
done

This avoids walking the whole directory structure once for each file (which you do in your inner loop), and also avoids comparing pairs more than once. It is still very slow as it needs to run cmp on each combination of files in the whole directory hierarchy.

Instead, you may want to try a simpler approach:

#!/bin/bash

tmpfile=$(mktemp)

find "${1:-.}" -type f -exec md5sum {} + | sort -o "$tmpfile"
awk 'FNR == NR && seen[$1]++ { next } seen[$1] > 1' "$tmpfile" "$tmpfile"

rm -f "$tmpfile"

This calculates the MD5 checksum for all files, sorts that list and saves it to a temporary file. awk is then used to extract the md5sum output for all files that are duplicated.

The output will look something like

$ bash ~/script.sh
01b1688f97f94776baae85d77b06048b  ./QA/StackExchange/.git/hooks/pre-commit.sample
01b1688f97f94776baae85d77b06048b  ./Repositories/password-store.git/hooks/pre-commit.sample
036208b4a1ab4a235d75c181e685e5a3  ./QA/StackExchange/.git/info/exclude
036208b4a1ab4a235d75c181e685e5a3  ./Repositories/password-store.git/info/exclude
054f9ffb8bfe04a599751cc757226dda  ./QA/StackExchange/.git/hooks/pre-applypatch.sample
054f9ffb8bfe04a599751cc757226dda  ./Repositories/password-store.git/hooks/pre-applypatch.sample
2b7ea5cee3c49ff53d41e00785eb974c  ./QA/StackExchange/.git/hooks/post-update.sample
2b7ea5cee3c49ff53d41e00785eb974c  ./Repositories/password-store.git/hooks/post-update.sample
3c5989301dd4b949dfa1f43738a22819  ./QA/StackExchange/.git/hooks/pre-push.sample
3c5989301dd4b949dfa1f43738a22819  ./Repositories/password-store.git/hooks/pre-push.sample

In the output above, there happens to be a number of duplicated pairs of files.

If you have filenames that contain embedded newlines, md5sum will output a line prefixed with a \ character:

$ touch $'hello\nworld'
$ md5sum *
\d41d8cd98f00b204e9800998ecf8427e  hello\nworld

To be able to cope with this properly (by removing the backslash at the start of the line), you may want to modify the first pipeline in the script into

find "${1:-.}" -type f -exec md5sum {} + | sed 's/^\\//' | sort -o "$tmpfile"
Kusalananda
  • 333,661