10

I want to generate a list of files that have:

  • Same name
  • Different content

in a directory (including all children directories and content).

How to do? Bash, perl, anything is fine.

So, two files with the same name and same content should not show up.

5 Answers5

8

Update: fixed a typo in the script: changed print $NF to print $3; also tidied things up, and added some comments.

Assuming file names do not contain \n, the following prints out a sorted list which breaks (as in: section control breaks) at unique file name, unique md5sum, and shows the corresponding group of file paths.

#!/bin/bash

# Choose which script to use for the final awk step 
out_script=out_all

# Print all duplicated file names, even when md5sum is the same 
out_all='{ if( p1 != $1 ) { print nl $1; print I $2 }
      else if( p2 != $2 ) { print I $2 }
      print I I $3; p1=$1; p2=$2; nl="\n" }
   END { printf nl}'

# Print only duplicated file names which have multiple md5sums.
out_only='{ if( p1 != $1 ) { if( multi ) { print pend }
                             multi=0; pend=$1 "\n" I $2 "\n" }
       else if( p2 != $2 ) { multi++; pend=pend I $2 "\n" } 
       pend=pend I I $3 "\n"; p1=$1; p2=$2 } 
   END { if( multi ) print pend }'

# The main pipeline 
find "${1:-.}" -type f -name '*' |  # awk for duplicate names
awk -F/ '{ if( name[$NF] ) { dname[$NF]++ }
           name[$NF]=name[$NF] $0 "\n" } 
     END { for( d in dname ) { printf name[d] } 
   }' |                             # standard md5sum output 
xargs -d'\n' md5sum |               # " "==text, "*"==binary
sed 's/ [ *]/\x00/' |               # prefix with file name  
awk -F/ '{ print $3 "\x00" $0 }' |  # sort by name. md5sum, path 
sort |                              # awk to print result
awk -F"\x00" -v"I=   " "${!out_script}"

Output showing only file names with multiple md5s

afile.html
   53232474d80cf50b606069a821374a0a
      ./test/afile.html
      ./test/dir.svn/afile.html
   6b1b4b5b7aa12cdbcc72a16215990417
      ./test/dir.svn/dir.show/afile.html

Output showing all files with the same name.

afile.html
   53232474d80cf50b606069a821374a0a
      ./test/afile.html
      ./test/dir.svn/afile.html
   6b1b4b5b7aa12cdbcc72a16215990417
      ./test/dir.svn/dir.show/afile.html

fi    le.html
   53232474d80cf50b606069a821374a0a
      ./test/dir.svn/dir.show/fi    le.html
      ./test/dir.svn/dir.svn/fi    le.html

file.html
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.html
      ./test/dir.show/dir.svn/file.html

file.svn
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.svn
      ./test/dir.show/dir.svn/file.svn
      ./test/dir.svn/dir.show/file.svn
      ./test/dir.svn/dir.svn/file.svn

file.txt
   53232474d80cf50b606069a821374a0a
      ./test/dir.show/dir.show/file.txt
      ./test/dir.show/dir.svn/file.txt
      ./test/dir.svn/dir.show/file.txt
      ./test/dir.svn/dir.svn/file.txt
Peter.O
  • 32,916
1

For those who want to see only a list of filenames, here is the relevant part of Peter.O's answer:

find "${1:-.}" -type f -name '*' | 
awk -F/ '{ if( name[$NF] ) { dname[$NF]++ }
       name[$NF]=name[$NF] $0 "\n" } 
 END { for( d in dname ) { printf name[d] "\n" } 

}'

I don't need md5sums because I use fslint-gui before the script to clear all duplicates.

int_ua
  • 243
1

Here's a Perl script. Run it in the directory at the top of the tree you want to search. The script depends on find and md5, but the latter can be replaced with sha1, sum or any other file hashing program that accepts input on stdin and outputs a hash on stdout.

use strict;

my %files;
my %nfiles;
my $HASHER = 'md5';

sub
print_array
{
    for my $x (@_) {
        print "$x\n";
    }
}

open FINDOUTPUT, "find . -type f -print|" or die "find";

while (defined (my $line = <FINDOUTPUT>)) {
    chomp $line;
    my @segments = split /\//, $line;
    my $shortname = pop @segments;
    push @{ $files{$shortname} }, $line;
    $nfiles{$shortname}++;
}

for my $shortname (keys %files) {
    if ($nfiles{$shortname} < 2) {
        print_array @{ $files{$shortname} };
        next;
    }
    my %nhashes;
    my %revhashes;
    for my $file (@{ $files{$shortname} }) {
        my $hash = `$HASHER < $file`;
        $revhashes{$hash} = $file;
        $nhashes{$hash}++;
    }
    for my $hash (keys %nhashes) {
        if ($nhashes{$hash} < 2) {
            my $file = $revhashes{$hash};
            print "$file\n";
        }
    }
}
Kyle Jones
  • 15,015
1

finddup this tool can also help you in listing out the files with same names or content..

user379997
  • 751
  • 1
  • 6
  • 7
0

Here's my one-liner solution:

find . -type f -exec basename {} \; | sort | uniq -d | xargs -n 1 -I {name} sh -c 'echo {name}; find . -type f -name {name} -exec md5sum {} \;; echo'

It prints a result set as follows, with files grouped by file name, and for each one a list of the paths of the duplicates along with their md5 sums:

file1.pdf
1983af4bc5c5e3fff33fb87b59147e0e  ./folder1/file1.pdf
6d028226d0a08745c1d2993043e0baba  ./folder2/file1.pdf
5830a22229a843a0bcc70d8d59419f03  ./folder3/file1.pdf
51d1844aad6bfddc60e381090d504a71  ./folder4/file1.pdf

file2.pdf bd2c5037621998abcf3d33eb826dbfa6 ./folder1/file2.pdf bd2c5037621998abcf3d33eb826dbfa6 ./folder2/file2.pdf