The following uses md5
to produce an MD5 digest for all the files in the current directory or below:
find . -type f -exec md5 {} +
Replace md5
with md5sum --tag
if you don't have the BSD md5
utility.
Let's build a simple script to do that on the directories:
#!/bin/bash
tmpdir=${TMPDIR:-/tmp}
if (( $# != 2 )); then
echo 'Expected two directories as arguments' >&2
exit 1
fi
i=0
for dir in "$@"; do
(( ++i ))
find "$dir" -type f -exec md5 {} + | sort -t '=' -k2 -o "$tmpdir/md5.$i"
done
This takes two directories on the command line and produces files called md5.1
and md5.2
, one file for each directory, in /tmp
(or wherever $TMPDIR
is pointing). These files are sorted on the MD5 digest.
The files will look like
MD5 (<path>) = <MD5 digest>
with one such line for each file.
Then, in the same script, compare the checksum between the two files:
join -t '=' -1 2 -2 2 "$tmpdir"/md5.[12]
This does a relational "join" operation between the two files, using the checksum as the join field. Any lines that have the same checksum in the two fields will be merged and outputted.
If any checksum is the same in both files, this will output:
<space><MD5 digest>=MD5 (<path1>) =MD5 (<path2>)
This may be passed on to awk
directly to parse out the two paths:
awk -F '[()]' 'BEGIN { OFS="\t" } { print $2, $4 }'
The -F [()]
is just a way of saying that we'd like to divide each line into fields based on (
and )
. Doing this leaves us with the paths in fields 2 and 4.
This would output
<path1><tab><path2>
Then it's just a matter of reading these pairs of tab-separated paths and issuing the correct commands to create the links:
while IFS=$'\t' read -r path1 path2; do
echo ln -f "$path1" "$path2"
done
In summary:
#!/bin/bash
tmpdir=${TMPDIR:-/tmp}
if (( $# != 2 )); then
echo 'Expected two directories as arguments' >&2
exit 1
fi
i=0
for dir in "$@"; do
(( ++i ))
find "$dir" -type f -exec md5 {} + | sort -t '=' -k2 -o "$tmpdir/md5.$i"
done
join -t '=' -1 2 -2 2 "$tmpdir"/md5.[12] |
awk -F '\\)|\\(' 'BEGIN { OFS="\t" } { print $2, $4 }' |
while IFS=$'\t' read -r path1 path2; do
echo ln -f "$path1" "$path2"
done
rm -f "$tmpdir"/md5.[12]
The echo
in the while
loop is there for safety. Run it once to see what would happen, and remove it and run it again if you feel confident that it's doing the right thing.
Remember that hard links can't span partitions. This means that both directories need to live on the same partition. The files in the second directory will be overwritten if they are found to be duplicates. Keep a backup of the originals somewhere until you're happy with the result!
Note that this solution will not work properly if any file has (
or )
or a tab in their file names.