Find files with same name but different content?

Question

I want to generate a list of files that have:

Same name
Different content

in a directory (including all children directories and content).

How to do? Bash, perl, anything is fine.

So, two files with the same name and same content should not show up.

Suppose there are three files with the same name and two of the three are identical? — Kyle Jones, Mar 07 '12 at 05:09
@KyleJones: "three files with the same name and two of the three are identical" > Then this filename should be added to the list — Nicolas Raoul, Mar 07 '12 at 06:01

Peter.O · Accepted Answer · 2012-06-23T16:53:17.830

Update: fixed a typo in the script: changed print $NF to print $3; also tidied things up, and added some comments.

Assuming file names do not contain \n, the following prints out a sorted list which breaks (as in: section control breaks) at unique file name, unique md5sum, and shows the corresponding group of file paths.

#!/bin/bash

# Choose which script to use for the final awk step 
out_script=out_all

# Print all duplicated file names, even when md5sum is the same 
out_all='{ if( p1 != $1 ) { print nl $1; print I $2 }
      else if( p2 != $2 ) { print I $2 }
      print I I $3; p1=$1; p2=$2; nl="\n" }
   END { printf nl}'

# Print only duplicated file names which have multiple md5sums.
out_only='{ if( p1 != $1 ) { if( multi ) { print pend }
                             multi=0; pend=$1 "\n" I $2 "\n" }
       else if( p2 != $2 ) { multi++; pend=pend I $2 "\n" } 
       pend=pend I I $3 "\n"; p1=$1; p2=$2 } 
   END { if( multi ) print pend }'

# The main pipeline 
find "${1:-.}" -type f -name '*' |  # awk for duplicate names
awk -F/ '{ if( name[$NF] ) { dname[$NF]++ }
           name[$NF]=name[$NF] $0 "\n" } 
     END { for( d in dname ) { printf name[d] } 
   }' |                             # standard md5sum output 
xargs -d'\n' md5sum |               # " "==text, "*"==binary
sed 's/ [ *]/\x00/' |               # prefix with file name  
awk -F/ '{ print $3 "\x00" $0 }' |  # sort by name. md5sum, path 
sort |                              # awk to print result
awk -F"\x00" -v"I=   " "${!out_script}"

Output showing only file names with multiple md5s

afile.html
   53232474d80cf50b606069a821374a0a
      ./test/afile.html
      ./test/dir.svn/afile.html
   6b1b4b5b7aa12cdbcc72a16215990417
      ./test/dir.svn/dir.show/afile.html

Output showing all files with the same name.

afile.html
   53232474d80cf50b606069a821374a0a
      ./test/afile.html
      ./test/dir.svn/afile.html
   6b1b4b5b7aa12cdbcc72a16215990417
      ./test/dir.svn/dir.show/afile.html

fi    le.html
   53232474d80cf50b606069a821374a0a
      ./test/dir.svn/dir.show/fi    le.html
      ./test/dir.svn/dir.svn/fi    le.html

file.html
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.html
      ./test/dir.show/dir.svn/file.html

file.svn
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.svn
      ./test/dir.show/dir.svn/file.svn
      ./test/dir.svn/dir.show/file.svn
      ./test/dir.svn/dir.svn/file.svn

file.txt
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.txt
      ./test/dir.show/dir.svn/file.txt
      ./test/dir.svn/dir.show/file.txt
      ./test/dir.svn/dir.svn/file.txt

score 1 · Answer 2 · edited Apr 13 '17 at 12:36

1

For those who want to see only a list of filenames, here is the relevant part of Peter.O's answer:

find "${1:-.}" -type f -name '*' | 
awk -F/ '{ if( name[$NF] ) { dname[$NF]++ }
       name[$NF]=name[$NF] $0 "\n" } 
 END { for( d in dname ) { printf name[d] "\n" }

}'

I don't need md5sums because I use fslint-gui before the script to clear all duplicates.

edited Apr 13 '17 at 12:36

Community

1

answered Apr 24 '14 at 15:23

int_ua

243

on my mac this shows the duplicate files same name same content – nightograph Mar 25 '15 at 03:25

score 1 · Answer 3 · answered Mar 07 '12 at 07:03

Here's a Perl script. Run it in the directory at the top of the tree you want to search. The script depends on find and md5, but the latter can be replaced with sha1, sum or any other file hashing program that accepts input on stdin and outputs a hash on stdout.

use strict;

my %files;
my %nfiles;
my $HASHER = 'md5';

sub
print_array
{
    for my $x (@_) {
        print "$x\n";
    }
}

open FINDOUTPUT, "find . -type f -print|" or die "find";

while (defined (my $line = <FINDOUTPUT>)) {
    chomp $line;
    my @segments = split /\//, $line;
    my $shortname = pop @segments;
    push @{ $files{$shortname} }, $line;
    $nfiles{$shortname}++;
}

for my $shortname (keys %files) {
    if ($nfiles{$shortname} < 2) {
        print_array @{ $files{$shortname} };
        next;
    }
    my %nhashes;
    my %revhashes;
    for my $file (@{ $files{$shortname} }) {
        my $hash = `$HASHER < $file`;
        $revhashes{$hash} = $file;
        $nhashes{$hash}++;
    }
    for my $hash (keys %nhashes) {
        if ($nhashes{$hash} < 2) {
            my $file = $revhashes{$hash};
            print "$file\n";
        }
    }
}

score 1 · Answer 4 · answered Mar 07 '12 at 11:25

1

finddup this tool can also help you in listing out the files with same names or content..

answered Mar 07 '12 at 11:25

user379997

751
1
6
7

score 0 · Answer 5 · answered Oct 22 '22 at 09:47

Here's my one-liner solution:

find . -type f -exec basename {} \; | sort | uniq -d | xargs -n 1 -I {name} sh -c 'echo {name}; find . -type f -name {name} -exec md5sum {} \;; echo'

It prints a result set as follows, with files grouped by file name, and for each one a list of the paths of the duplicates along with their md5 sums:

file1.pdf
1983af4bc5c5e3fff33fb87b59147e0e  ./folder1/file1.pdf
6d028226d0a08745c1d2993043e0baba  ./folder2/file1.pdf
5830a22229a843a0bcc70d8d59419f03  ./folder3/file1.pdf
51d1844aad6bfddc60e381090d504a71  ./folder4/file1.pdf
file2.pdf
bd2c5037621998abcf3d33eb826dbfa6  ./folder1/file2.pdf
bd2c5037621998abcf3d33eb826dbfa6  ./folder2/file2.pdf

Find files with same name but different content?

5 Answers5

Linked