118

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
student
  • 18,305
  • As already mentioned, just finding duplicates and reporting them is not that hard and can be done with for instance fdupes or fslint. What is hard is to take action and clean up files based on that information. So say that the program reports that /home/yourname/vacation/london/img123.jpg and /home/yourname/camera_pictures/vacation/img123.jpg are identical. Which of those should you choose to keep and which one should you delete? To answer that question you need to consider all the other files in those two directories. – hlovdal Apr 05 '13 at 07:50
  • 1
    (continuing) Does .../camera_pictures/vacation contain all pictures from London and .../vacation/london was just a subset you showed to your neighbour? Or are all files in the london directory also present in the vacation directory? What I have really wanted for many years is a two pane file manager which could take file duplicate information as input to open the respective directories and show/mark which files are identical/different/unique. That would be a power tool. – hlovdal Apr 05 '13 at 07:58

13 Answers13

132

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n

This will break if your filenames contain newlines.

genpfault
  • 293
Chris Down
  • 125,559
  • 25
  • 270
  • 266
  • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable? – student Apr 05 '13 at 09:31
  • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs. – Olivier Dulac Apr 05 '13 at 12:27
  • 2
    @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives. – Chris Down Apr 05 '13 at 13:13
  • @student - Once you have the filenames, du piped to sort will tell you. – Chris Down Apr 05 '13 at 13:14
  • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos) – Olivier Dulac Apr 05 '13 at 14:05
  • @ChrisDown Can you make a complete example how to use du and sort for this case? A naive fdupes -r -S . |du |sort doesn't work. – student Apr 06 '13 at 11:11
  • @student - See my edit. – Chris Down Apr 06 '13 at 11:32
  • Will this work even if you have volumes that have different block sizes or other characteristics that may make two identical files appear to be two different sizes? – Kurtis Dec 02 '14 at 22:06
  • the du is significantly slower than the fdupes part. I started this command as a kind of "defragmentation" command in my home folder an hour ago and it is still going. – phil294 May 05 '18 at 21:50
  • fdupes is too slow for me, since there is no way to turn off byte-for-byte check. But the idea of a size check first is a good one. – qwr Feb 26 '24 at 22:53
33

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems, including duplicate files and problematic filenames etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a --help option which further details its parameters.

   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum {} \; > md5sums
awk '{print $1}' md5sums | sort | uniq -d > dupes
while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

terdon
  • 242,166
  • 5
    It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size. – Chris Down Apr 04 '13 at 16:34
  • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer. – terdon Apr 04 '13 at 16:37
  • It can be run on macOS, but you should replace md5sum {} with md5 -q {} and gawk '{print $1}' with cat – Finesse Oct 24 '19 at 02:26
  • @Finesse The point of the gawk command is to print the first word of the line (i.e., the checksum) without the rest of it (the filename). cat alone does not accomplish this. You can just use awk on macOS or cut -d " " -f 1. – rgov Mar 01 '24 at 21:28
  • Along the lines of @ChrisDown's suggestion, you can use -exec sh -c 'echo "$(stat --format="%s" "$1"):$(dd if="$1" bs=4096 count=1 2>/dev/null | md5sum | cut -d" " -f1) $1"' sh {} \; to output {file size}:{md5 of first 4096 bytes} {file name}. This is faster but will have a higher collision rate. Post-process the matches for a full-file comparison to eliminate false positives. – rgov Mar 01 '24 at 21:29
12

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n > myjdups_sorted.txt
  • 3
    Note: the latest version of jdupes supports matching files with only a partial hash instead of waiting to hash the whole thing. Very useful. (You have to clone the git archive to get it.) Here are the option I'm using right now: jdupes -r -T -T --exclude=size-:50m --nohidden – SurpriseDog Jul 03 '19 at 17:48
  • And, for my purposes at least, --print-unique, which fdupes lacks, was wonderful. Donation on its way. – Flash Sheridan Feb 08 '24 at 22:27
  • 1
    New URL; the author was unhappy with GitHub’s 2FA mandate: https://codeberg.org/jbruchon/jdupes. – Flash Sheridan Feb 08 '24 at 23:11
  • Thanks for pointing out the new URL @FlashSheridan, I've changed it now. – Sebastian Müller Mar 04 '24 at 09:24
9

Short answer: yes.

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.

peterph
  • 30,838
8

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 \
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
    checksum=${line:0:32}
    filename=${line:34}
    if [ $checksum == $last_checksum ]; then
        if [ ${last_filename:-0} != '0' ]; then
            echo $last_filename
            unset last_filename
        fi
        echo $filename
    else
        if [ ${last_filename:-0} == '0' ]; then
            echo "======="
        fi
        last_filename=$filename
    fi

    last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

Wayne Werner
  • 11,713
reith
  • 394
  • 2
  • 10
4

Wikipedia once had an article with a list of available open source software for this task, but it's now been deleted.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete - very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/
  • FDupes: https://en.wikipedia.org/wiki/Fdupes

  • DupeGuru: https://www.hardcoded.net/dupeguru/

  • Czkawka: https://qarmin.github.io/czkawka/

FDupes and DupeGuru work on many systems (Windows, Mac and Linux). I've not checked FSLint or Czkawka.

Toby Speight
  • 8,678
3

I had a situation where I was working in an environment where I couldn't install new software, and had to scan >380 GB of JPG and MOV files for duplicates. I developed the following POSIX awk script to process all of the data in 72 seconds (as opposed to the find -exec md5sum approach, that took over 90 minutes to run):

https://github.com/taltman/scripts/blob/master/unix_utils/find-dupes.awk

You call it as follows:

ls -lTR | awk -f find-dupes.awk

It was developed on a FreeBSD shell environment, so might need some tweaks to work optimally in a GNU/Linux shell environment.

taltman
  • 161
0

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
  echo -n '.'
  if grep -q "$i" md5-partial.txt; then echo -e "\n$i  ---- Already counted, skipping."; continue; fi
  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
  MD5=`echo $MD5 | cut -d' ' -f1`
  if grep "$MD5" md5-partial.txt; then echo "\n$i  ----   Possible duplicate"; fi
  echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:

  • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
  • Checking by file size first could speed this up.
  • Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

0

I realize this is necro but it is highly relevant. I had asked a similar question on Find duplicate files based on first few characters of filename and what was a presented was a solution to use some awk script.

I use it for mod conflict cleanup, useful in Forge packs 1.14.4+ because Forge now disabled mods that are older instead of FATAL crashing and letting you know of the duplicate.

#!/bin/bash

declare -a names

xIFS="${IFS}" IFS="^M"

while true; do awk -F'[-_ ]' ' NR==FNR {seen[tolower($1)]++; next} seen[tolower($1)] > 1 ' <(printf "%s\n" .jar) <(printf "%s\n" .jar) > tmp.dat

    IDX=0
    names=()


    readarray names &lt; tmp.dat

    size=${#names[@]}

    clear
    printf '\nPossible Dupes\n'

    for (( i=0; i&lt;${size}; i++)); do
            printf '%s\t%s' ${i} ${names[i]}
    done

    printf '\nWhich dupe would you like to delete?\nEnter # to delete or q to quit\n'
    read n

    if [ $n == 'q' ]; then
            exit
    fi

    if [ $n -lt 0 ] || [ $n -gt $size ]; then
            read -p &quot;Invalid Option: present [ENTER] to try again&quot; dummyvar
            continue
    fi

    #clean the carriage return \n from the name
    IFS='^M'
    read -ra TARGET &lt;&lt;&lt; &quot;${names[$n]}&quot;
    unset IFS

    #now remove the first element from the filesystem
    rm &quot;${TARGET[0]}&quot; 
    echo &quot;removed ${TARGET[0]}&quot; &gt;&gt; rm.log

done

IFS="${xIFS}"

I recommend saving it as "dupes.sh" to your personal bin or /usr/var/bin

0

On Linux, you can use the following tools to find duplicate files.

https://github.com/adrianlopezroche/fdupes
https://github.com/arsenetar/dupeguru
https://github.com/jbruchon/jdupes
https://github.com/pauldreik/rdfind
https://github.com/pixelb/fslint (Python 2.x)
https://github.com/sahib/rmlint
https://github.com/qarmin/czkawka (Rust)
Jay M
  • 125
Akhil
  • 1,290
0

For free open-source Linux duplicate file finders, there is a new kid on the block and it's even in a trendy language, Rust.

It is called Czkawka (which apparently means hiccup)

So it does have an unpronounceable name unless you speak Polish.

It is based very much on some of the ideas in FSlint (which can now be difficult to make work as it is no longer maintained and uses the now deprecated Python 2.x).

Czkawka has both GUI and CLI versions and is reported to be faster than FSlint and Fdupes.

There is also a Githup repo For those that want to fork it just to change the name.

Jay M
  • 125
0

With any POSIX find, stat and awk commands:

Advantages:

  1. It handles filenames with whitespaces or any other special characters.
  2. and it only calls external command md5sum for those files which are same in size, and so...
  3. It reports the duplicated files based on md5sum checksum at the end, and finally...
  4. It generates the NULL delimited output of file's size in bytes, file's checksum and file's path of duplicates, so it's can easily post-process if needed.
  5. It should be enough faster.
find . -type f -exec stat --printf='%s/%n\0' {} + |
awk '
BEGIN{
        FS = "/"
        RS = ORS = "\0"
        q = "\047"
        md5_cmd = "md5sum"
}

{ #get the files path from the two columns data delimited by slash char reported #by the stat command. filePath = substr($0, index($0, "/") +1)

#record and group filePath having the same fileSize with NULL delimited
sizes[$1] = ($1 in sizes? sizes[$1] : &quot;&quot;) filePath ORS

}

END { for (size in sizes) {

    #split the filesPath for each group of files to 
    #calculate the check-sum for last confirmation to see if there are
    #any duplicate files among the same sized files
    filesNr = split(sizes[size], filesName, ORS)

    #call md5sum only if there are more than two files with the same size in that group.
    if (filesNr &gt; 2) {
        for (i = 1; i &lt; filesNr; i++) {
            if ((md5_cmd &quot; &quot; q filesName[i] q) | getline md5 &gt;0) {

                #split to extract the hash of a file
                split(md5, hash, &quot; &quot;)

                #remove leading back-slash from the hash if a fileName contain
                #back-slash char in its name. see https://unix.stackexchange.com/q/424628/72456
                sub(/\\/, &quot;&quot;, hash[1])

                #records all the same sized filesPath along with their hash, again NULL delimited
                hashes[hash[1]] = (hash[1] in hashes? hashes[hash[1]] : &quot;&quot;) filesName[i] ORS

                #record also the size of files with hash used as key mapping
                fileSize[hash[1]] = size
            }
        }
    }
}
for (fileName in hashes) {

    #process the hash of the same sized filesPath to verify if there is a hash
    #which occupied for more than one file.
    #here hash is the key and filesName are values of the hashes[] array.
    filesNr = split(hashes[fileName], filesName, ORS)

    #OK, if there is a hash with +2 files, then we found duplicates, print the size, hash and the path.
    if ( filesNr&gt; 2) {
        print  fileSize[fileName] &quot; bytes, MD5: &quot; fileName
        for(i=1; i &lt; filesNr; i++)
            print filesName[i]
    }
}

}'

αғsнιη
  • 41,407
-1

You can find duplicate with this command:

find . ! -empty -type f -print0 | xargs -0 -P"$(nproc)" -I{} md5sum "{}" | sort | uniq -w32 -dD