202

I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.

Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Josh
  • 8,449
  • 25
    One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music. – Steven D Oct 13 '10 at 02:48
  • 4
    another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical. – msw Oct 13 '10 at 02:57
  • 4
    better sollution might be to have a public music folder... – Stefan Oct 13 '10 at 07:08
  • Use symlinks then: Copy the MP3 Files to a central repo and symlink them into the user's folder, "deleting" a file would just mean deleting the symlink. (But Stefan's idea of the public music folder is better) – tante Oct 13 '10 at 07:52
  • 4
    related: http://superuser.com/questions/140819/ways-to-deduplicate-files – David Cary Mar 16 '11 at 23:59
  • 3
    @tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in). – maaartinus Mar 14 '12 at 03:56
  • 1
    The comment from @StevenD already notes this is a potential issue, but just moving or deleting a hardlinked file should not effect the other copy of it in most cases. I would think the bigger concern here is that changes such as retagging MP3's would affect both users. What if they prefer different schemes for their meta-data? – Caleb Jun 24 '14 at 08:58
  • @Caleb in this specific case, we didn't. But yes that could be an issue. See my comment below about why ZFS dedupe was a better solution in the end. – Josh Jun 25 '14 at 06:19
  • Wikipedia has a full overview on which dedup tools support which platform: http://en.wikipedia.org/wiki/List_of_duplicate_file_finders – oligofren Jan 03 '15 at 13:49
  • 1
    Forget everything people have written and just use btrfs as filesystem. (or any other copy-on-write filesystem you like) – paladin Aug 30 '21 at 14:52
  • As @paladin says, you need a copy-on-write filesystem. This means that 1) You run a script to make duplicate files linked 2) When a user change his file, a copy of his file created on disk, so other users share the same file not affected. From mainstream filesystems I know only BTRFS doing this. Ext4 do not. – Chameleon Nov 10 '21 at 13:45

20 Answers20

161

rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (sha256, md5, or sha1).

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).


Edit: I can add that rdfind is really easy to use/very pedagogic. Just use the -dryrun true flag and it will be very intuitive, not scary (which, IMO, tools that delete files usually are).

d-b
  • 1,891
  • 3
  • 18
  • 30
  • 18
    +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint. – Daniel Trebbien Dec 29 '13 at 20:49
  • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks! – oligofren Jan 03 '15 at 13:38
  • Very smart and fast algorithm. – ndemou Oct 30 '15 at 12:53
  • 10
    I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary. – cdhowie May 31 '18 at 21:19
  • rdfind is very dependent on new OS and compiler. (will not run on CentOS 6.x without a near complete rebuild of the development tools) – Cosmo F Apr 08 '19 at 22:33
  • 8
    Beware that some versions of rdfind are affected by a pretty inconvenient bug : source files are deleted when hardlinks can't be created. – Skippy le Grand Gourou Jan 02 '20 at 17:01
  • @SkippyleGrandGourou That defect is fixed since 1,5 years. Why do you scaremonger? – d-b Mar 21 '20 at 12:10
  • @CosmoF What? Here is a script that installs rdfind on CentOS. Very simple https://gist.github.com/cuibonobo/868943676037cc009405 – d-b Mar 21 '20 at 12:11
  • 4
    @d-b I wonder what proportion of systems are less than 1.5 years up-to-date. I lost data, and other people will if they don't check their version. They deserve to be warned. – Skippy le Grand Gourou Mar 23 '20 at 10:08
  • 1
    @d-b -- I have to agree with Skippy; I am running the latest Mint with all updates and still have rdfile 1.3.5, rdfile 1.4.0 is where the fix is at according to the change log. I lost data (annoying, but easily recovered from backups). – Nathanael Jun 12 '20 at 06:03
55

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:


filename1
filename2

filename3
filename4
filename5


with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

tante
  • 6,350
  • 1
  • 28
  • 23
  • 1
    Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet. – Stuart Axon Aug 28 '13 at 14:19
  • 11
    I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from https://github.com/tobiasschulz/fdupes was super easy. – neu242 Aug 30 '13 at 15:07
  • 3
    Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well. – oligofren Jan 03 '15 at 13:43
  • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative. – oligofren Jan 03 '15 at 14:28
  • 13
    fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO. – Calimo Nov 08 '17 at 15:58
  • 6
    There's a similar tool called jdupes that's based on fdupes, but it can also replace the duplicate files with symlinks (-l), hardlinks (-L) or instruct btrfs to deduplicate the blocks on the filesystem level (-B, if you're using btrfs). – Marius Gedminas Aug 22 '18 at 16:36
45

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the command line, compute MD5 checksums and find files with identical MD5. IF they are equal, do a real comparison if they are really equal, replace the second of two files with a hard link to the first one.

fschmitt
  • 8,790
  • Sounds perfect, thanks!! I'll try it and accept if it works as described! – Josh Oct 12 '10 at 20:09
  • 3
    This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked. – Josh Dec 08 '10 at 20:13
  • 15
    Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below. – oligofren Jan 03 '15 at 13:42
  • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that. – phunehehe Jun 26 '15 at 06:59
  • 8
    Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions). – Charles Duffy Feb 01 '16 at 16:56
32

I use hardlink from http://jak-linux.org/projects/hardlink/

waltinator
  • 4,865
  • 1
    Nice hint, I am using on a regular base http://code.google.com/p/hardlinkpy/ but this was not updated for a while... – meduz Apr 11 '12 at 19:09
  • 2
    This appears to be similar to the original hardlink on Fedora/RHEL/etc. –  Jun 21 '12 at 08:43
  • 3
    hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files). – Marcel Waldvogel Feb 05 '17 at 19:13
  • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: https://pagure.io/hardlink - Fedora package name: hardlink) – maxschlepzig Jan 04 '19 at 17:52
19

This is one of the functions provided by "fslint" -- http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

Flimm
  • 4,218
  • 4
    The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do – Azendale Oct 29 '12 at 05:57
  • 1
    On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default) – Jocelyn Sep 08 '13 at 15:38
16

Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago. If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

Stephen Kitt
  • 434,908
Wei-Yin
  • 535
  • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not. – Josh Dec 08 '10 at 20:14
  • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support. – hhaamu Jul 15 '12 at 17:48
  • I used zfs-fuse for a backup partition because of its deduplication and it took days, literally days to copy less than 1TB of data onto it. I can't begin to imagine how zfs-fuse would perform as a home partition or something. – Christian Jun 14 '13 at 15:04
  • 22
    ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD. – killermist Sep 22 '14 at 18:51
  • 7
    To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) https://btrfs.wiki.kernel.org/index.php/Deduplication – Marcel Waldvogel Feb 05 '17 at 19:18
  • 8
    Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored) – Josh Sep 13 '17 at 13:54
  • ZFS doesn't support offline dedupe? – endolith Feb 14 '19 at 19:42
  • 1
    @endolith Nope! Which makes ZFS dedup completely useless for 99% of users. If you have enough RAM for online dedup, either your disks are tiny or your wallet is big enough that you should instead spend some engineer-hours implementing dedup in your application. – Navin Jan 20 '22 at 13:10
10

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

  • 2
    Background and more information is listed on https://btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below) – Marcel Waldvogel Feb 05 '17 at 19:22
9
apt show hardlink

Description: Hardlinks multiple copies of the same file Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

I also used jdupes recently with success.

9

jdupes has been mentioned in a comment but deserves its own answer, since it is probably available in most distributions and runs pretty fast (it just freed 2.7 GB of a 98 % full 158 GB partition (SSD drive) in about one minute) :

jdupes -rL /foo/bar
  • 1
    jdupes really should be the answer. It's fast, and configurable, and it shows progress as you go. You can live dangerously (-T -T) when initially scanning without modifying (-M) to see what you might be in for. Then choose the default size / partial hash / full hash / bit by bit comparison mode, skip the hashes and just go straight to bits (-K) without any added risk, or roll the dice and do the full hash only while skipping the bit comparison (-Q). About the only thing missing is an interactive mode where you could y/n each suggestion. Did I mention that it's fast? – Chris Jun 12 '20 at 00:36
6

To find duplicate files you can use duff.

Duff is a Unix command-line utility for quickly finding duplicates in a given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

Stefan
  • 25,300
5

Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:

  • filename
  • size
  • md5 checksum
  • byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | \
  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | \
  sort | uniq -w32 --all-repeated=separate
Mat
  • 52,586
johny why
  • 371
  • 3
    I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools. – dubiousjim Sep 02 '15 at 06:32
  • 4
    In my practice filename is the ___least___ reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents. – Chindraba Jan 28 '17 at 06:40
  • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks. – johny why Mar 08 '17 at 21:50
  • 1
    Don't use rmlint. It emits a shellscript that doesn't check for errors. It deletes the duplicate then attempts to hardlink which can fail (e.g. due to too many links) and then continues to work on the next duplicate which suffers the same fate. It eats your files. The correct way would be to create it under a temporary name and then replace the dupe via rename. – the8472 Oct 07 '20 at 10:01
4

I've used many of the hardlinking tools for Linux mentioned here. I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:

   --reflink[=WHEN]
          control clone/CoW copies. See below

       When  --reflink[=always]  is specified, perform a lightweight copy, where the 
data blocks are copied only when modified.  If this is not possible the
       copy fails, or if --reflink=auto is specified, fall back to a standard copy.
Marcos
  • 2,305
  • I think I will update my cp alias to always include the --reflink=auto parameter now – Marcos Mar 14 '12 at 14:08
  • 1
    Does ext4 really support --reflink? –  Jun 21 '12 at 08:42
  • 7
    This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not.

    btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

    – clacke Jul 03 '12 at 18:57
4

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash

DIR="/path/to/big/files"

find $DIR -type f -exec md5sum {} \; | sort > /tmp/sums-sorted.txt

OLDSUM=""
IFS=$'\n'
for i in `cat /tmp/sums-sorted.txt`; do
 NEWSUM=`echo "$i" | sed 's/ .*//'`
 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
 if [ "$OLDSUM" == "$NEWSUM" ]; then
  echo ln -f "$OLDFILE" "$NEWFILE"
 else
  OLDSUM="$NEWSUM"
  OLDFILE="$NEWFILE"
 fi
done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

Mat
  • 52,586
seren
  • 141
  • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ??? – MR.GEWA Jan 12 '13 at 12:17
  • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead. – seren Jan 13 '13 at 04:15
  • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ??? – MR.GEWA Jan 13 '13 at 13:12
  • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log – seren Jan 14 '13 at 19:28
  • 2
    Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed. – oligofren Jan 03 '15 at 13:46
1

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

1

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/ am developing it

islam
  • 19
1

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

amphetamachine
  • 5,517
  • 2
  • 35
  • 43
0

If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.

Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)

Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.

File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).

Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation. But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)

Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.

Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)

Znik
  • 649
  • 1
    I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: https://btrfs.wiki.kernel.org/index.php/Deduplication – Marcel Waldvogel Feb 05 '17 at 19:42
  • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing. – KJ Seefried Mar 29 '18 at 19:07
0

Easiest way is to use special program dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place. Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file. You have a choice of replacing it either with a symlink or a hardlink. ... a symlink is a shortcut to the file’s path. If the original file is deleted or moved, the link is broken. A hardlink is a link to the file itself. That link is as good as a “real” file. Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully, but under Windows, it’s a bit complicated. Windows XP doesn’t support it, but Vista and up support it. However, for the feature to work, dupeGuru has to run with administrative privileges.

0

There is a new file level deduplication tool: https://gitlab.com/lasthere/dedup

On "BTRFS" and "XFS" it can can directly "reflink" identical files. While for non reflink supporting file systems, one first needs to produce a shell script via option "--print-only sh" and modify it such that it replaces the found identical files by a hardlink (cp --link).

It is especially useful if run regularly and file modifications (new dedupe candidates) are located separately, so that only that directory needs a deeper check, while all other locations are only scanned for finding deduplication sources.

(I'm author of dedup)

-1

Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...

My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.