2

I have a lot of music in a tree under one directory, saved in whatever format I initially got it in, for quality. I have a second directory tree which is similar in structure, but with all files in a lossy-compressed format playable by my phone, and with occasional metadata changes (e.g. removing embedded covers to save space).

It occurs to me that for a significant portion of the music, there is no difference between the two instances - generally when the distributed version was only available as mp3/ogg and didn't have embedded covers. Hard drive space may be cheap, but that's no reason to waste it. Is there a way to script:

  1. Check for identical files in two directories
  2. Whenever identical files are found, replace one with a hardlink to the other
  3. Without e.g. taking the time to get a full diff, in the interest of time
  4. But still without the risk of accidentally deleting a copy of two non-identical files, which is a remote but nonzero chance if I were to e.g. just compare hashes?
Vivian
  • 317

2 Answers2

3

The following uses md5 to produce an MD5 digest for all the files in the current directory or below:

find . -type f -exec md5 {} +

Replace md5 with md5sum --tag if you don't have the BSD md5 utility.

Let's build a simple script to do that on the directories:

#!/bin/bash

tmpdir=${TMPDIR:-/tmp}

if (( $# != 2 )); then
    echo 'Expected two directories as arguments' >&2
    exit 1
fi

i=0
for dir in "$@"; do
    (( ++i ))
    find "$dir" -type f -exec md5 {} + | sort -t '=' -k2 -o "$tmpdir/md5.$i"
done

This takes two directories on the command line and produces files called md5.1 and md5.2, one file for each directory, in /tmp (or wherever $TMPDIR is pointing). These files are sorted on the MD5 digest.

The files will look like

MD5 (<path>) = <MD5 digest>

with one such line for each file.

Then, in the same script, compare the checksum between the two files:

join -t '=' -1 2 -2 2 "$tmpdir"/md5.[12]

This does a relational "join" operation between the two files, using the checksum as the join field. Any lines that have the same checksum in the two fields will be merged and outputted.

If any checksum is the same in both files, this will output:

<space><MD5 digest>=MD5 (<path1>) =MD5 (<path2>)

This may be passed on to awk directly to parse out the two paths:

awk -F '[()]' 'BEGIN { OFS="\t" } { print $2, $4 }'

The -F [()] is just a way of saying that we'd like to divide each line into fields based on ( and ). Doing this leaves us with the paths in fields 2 and 4.

This would output

<path1><tab><path2>

Then it's just a matter of reading these pairs of tab-separated paths and issuing the correct commands to create the links:

while IFS=$'\t' read -r path1 path2; do
    echo ln -f "$path1" "$path2"
done

In summary:

#!/bin/bash

tmpdir=${TMPDIR:-/tmp}

if (( $# != 2 )); then
    echo 'Expected two directories as arguments' >&2
    exit 1
fi

i=0
for dir in "$@"; do
    (( ++i ))
    find "$dir" -type f -exec md5 {} + | sort -t '=' -k2 -o "$tmpdir/md5.$i"
done

join -t '=' -1 2 -2 2 "$tmpdir"/md5.[12] |
awk -F '\\)|\\(' 'BEGIN { OFS="\t" } { print $2, $4 }' |
while IFS=$'\t' read -r path1 path2; do
    echo ln -f "$path1" "$path2"
done

rm -f "$tmpdir"/md5.[12]

The echo in the while loop is there for safety. Run it once to see what would happen, and remove it and run it again if you feel confident that it's doing the right thing.

Remember that hard links can't span partitions. This means that both directories need to live on the same partition. The files in the second directory will be overwritten if they are found to be duplicates. Keep a backup of the originals somewhere until you're happy with the result!

Note that this solution will not work properly if any file has ( or ) or a tab in their file names.

Kusalananda
  • 333,661
  • 1
    There's an existing wheel for the duplicate finding: fdupes. Pity it has a delete option, but no hardlink option. – Gilles 'SO- stop being evil' May 27 '17 at 23:01
  • 1
    Note that with MD5, there is no risk of accidental collisions, but deliberate collisions are possible. Use SHA-256 instead if deliberate collisions are a concern. – Gilles 'SO- stop being evil' May 27 '17 at 23:02
  • @Gilles - accidental and deliberate? What does this mean? –  May 28 '17 at 00:36
  • 1
    @DarkHeart Accidental: two files just happen to have the same hash, by coincidence. For MD5, the probability is negligible if you have less than about ten quintillion files (birthday problem. Deliberate: someone creates a pair of files, on purpose, that are not identical but have the same hash. This was first done for MD5 with massive calculations in 2004, and nowadays it is known how to do it for instantly MD5 and with massive calculations for SHA-1, but no one has a clue how to do it for SHA-256. – Gilles 'SO- stop being evil' May 28 '17 at 00:50
3

Unless you have a large collection of very similar files, computing and comparing hashes does not speed up the process of finding duplicates. The slowest operation is a disk read. Computing a hash means to read the whole file and also it is a CPU intensive task with modern cryptographically strong hashes.

We have to compare data only if file lenghts differ. If there is only one file with given length, there are obviously no duplicates. If there are two, simply comparing them is always more efficient than hashing. If there are three or more, the number of comparisons rises, but chances are they differ in first bytes or blocks, so the disk I/O is still low and repeated reads are returned from the cache.

That's why I would recommend to make a recursive directory listing preparing a length+pathname list, then sort the list numerically and finally deal only with sets of files sharing the same length by comparing them pairwise. If two files match, one can be replaced by a hardlink.

VPfB
  • 801