2

I have two directories A and B; each one contains a lot of sub-directories

geom001, geom002 ....etc

each sub-directory contains a file named results. I want to compare, without opening any of them, each file in A with each file in B and find if there is a file or more in A similar to one or more file in B. How can I use command like the following in a loop to search over all files?

cmp --silent  file1 file2  || echo "file1 and file2 are different"
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 8
    Erm, "without opening" a file the comparison options are limited to stat values, e.g. the file size and modification date. "More similar" sounds like you would need to open the files and compare the contents somehow. – thrig May 17 '16 at 18:25
  • 1
    FYI, you can't compare the contents of two files without opening them. That's axiomatic. – Wildcard May 18 '16 at 02:16
  • Also, what, exactly, do you mean by 'similar'? exactly the same? if not, how much and/or what kind of difference is allowed before a file is no longer considered to be similar? different only by white space? less than X lines/bytes different? or something else. – cas May 18 '16 at 05:32
  • exactly the same in everything even space – Mohsen El-Tahawy May 18 '16 at 18:32

3 Answers3

2

If files are exactly the same, then their md5sums will be exactly the same, so you can use:

find A/ B/ -type f -exec md5sum {} + | sort | uniq -w32 -D

An md5sum is always exactly 128 bits (or 16 bytes or 32 hex digits) long, and the md5sum program output uses hex digits. So we use the -w32 option on the uniq command to compare only the first 32 characters on each line.

This will print all files with a non-unique md5sum. i.e. duplicates.

NOTE: this will detect duplicate files no matter where they are in A/ or B/ - so if /A/subdir1/file and A/subdir2/otherfile are the same, they will still be printed. If there are multiple duplicates, they will all be printed.

You can remove the md5sums from the output by piping into, e.g., awk '{print $2}' or with cut or sed etc. I've left them in the output because they're a useful key for an associative array (aka a 'hash') in awk or perl etc for further processing.

cas
  • 78,579
1

The seeming challenges of the question / request is perhaps the recursion aspect.

Assuming that cmp is an adequate utility and that both folder / directories 1 & 2 to be compared are of the same structure (ie same files & folders) and reside within the same root path - you can try something similar to:

#!/bin/bash
ROOT=$PWD ; # #// change to absolute path eg: /home/aphorise/my_files
PWD1="1/*" ;
PWD2="2/*" ;

# #// Get lengths of seperators
IFS=/ read -a DP <<< ${ROOT} ;
PLEN1=${#DP[*]} ;
IFS=/ read -a DP <<< ${PWD1} ;
PLEN1=$(echo "${#DP[*]}" + $PLEN1 - 1 | bc) ;
IFS=/ read -a DP <<< ${PWD2} ;
PLEN2=${#DP[*]} ;

# #// Set absolute paths:
PWD1="${ROOT}/${PWD1}" ;
PWD2="${ROOT}/${PWD2}" ;
DIFFS=0 ;

function RECURSE()
{
    for A_FILE in $1 ; do
        if [ -d $A_FILE ] ; then
            RECURSE "$A_FILE/*" ;
        else
            IFS=/ read -a FP <<< ${A_FILE} ;
            B_FILE="${PWD2:0:${#PWD2}-${#PLEN2}}$( IFS=/ ; printf "%s" "${FP[*]:$PLEN1:512}"; )" ;
            if ! cmp ${A_FILE} ${B_FILE} 1>/dev/null ; then printf "$A_FILE --> $B_FILE <-- DIFFER.\n" ; ((++DIFFS)) ; fi ;
        fi ;
    done ;
}

printf "Starting comparison on $PWD1 @ $(date)\n\n" ;
RECURSE "${PWD1[*]}" ;
if ((DIFFS != 0)) ; then printf "\n= $DIFFS <= differences detected.\n" ; fi ;
printf "\nCompleted comparison @ $(date)\n" ;

UPDATE:

Following with another script - subsequent to additional feedback received - to unconditionally compare all files in directory 1 with 2:

#!/bin/bash
PWD1="$PWD/1/*" ;
PWD2="$PWD/2/*" ;
DIFFS=0 ;
NODIFFS=0 ;

printf "Starting comparison on $PWD1 @ $(date)\n\n" ;

FILES_A=$(find ${PWD1} -type f) ;
FILES_B=$(find ${PWD2} -type f) ;

for A_FILE in ${FILES_A[*]} ; do
        for B_FILE in ${FILES_B[*]} ; do
                if ! cmp ${A_FILE} ${B_FILE} 1>/dev/null ; then
                        printf "$A_FILE & $B_FILE <- DIFFER.\n" ;
                        ((++DIFFS)) ;
                else
                        printf "\n-> SAME: $A_FILE & $B_FILE\n" ;
                        ((++NODIFFS)) ;
                fi ;
        done ;
done ;

printf "\n= $DIFFS <= differences detected - & - $NODIFFS <= exact matches.\n" ;
printf "\nCompleted comparison @ $(date)\n" ;
aphorise
  • 261
  • well, but it doesn't see the files in directory 2, /home/mohsin//1/geom001/results.dat --> /home/mohsin//2/ <-- DIFFER. also it just compares each file with the corresponding one in the second directory e.g. it compares_1/geom001/results.dat with 2/geom001/results.dat and not with 1/geom00{2..n}/results.dat_ – Mohsen El-Tahawy May 18 '16 at 19:11
  • So if I'm understanding you correctly you dont not have identical file structures with differing file-tree / directory structures and for each file in directory 1 you'd like a find & compare in 2 for all files that are named the same? – aphorise May 18 '16 at 19:29
  • yes, what I want is to compare each file in 1 with all files in 2, sorry 1/geom00{2..n}/results.dat should be 2/geom00{2..n}/results.dat in the previous comment – Mohsen El-Tahawy May 18 '16 at 19:42
  • Just those that are the same name or all files irrespective of their naming? - Ie for each file in 1 you'd want it compared to all files in path / sub-paths of 2? – aphorise May 18 '16 at 19:46
  • yes, irrespective of their naming; each file in the first with all files in the second directory. – Mohsen El-Tahawy May 18 '16 at 19:49
1

I think this will get you close. It will list out the cmp output for all files named results in A compared to all files named results in B.

find ./A -name results | xargs -I REPLACESTR find ./B -name results -exec cmp REPLACESTR {} \;