2

I have 2 files containing list. Column 1 is userIds & column 2 is associated values

# cat file1
e3001 75
n5244 30
w1453 500

#cat file2 d1128 30 w1453 515 n5244 30 e3001 55

Things to consider.

  1. userIds may not be sorted exactly in both files
  2. Number of userIds may vary in files

REQUIRED

  • firstly, userId from file1:column1 must match UserId in file2:column1
  • next compare their values in file1:column2 with file2:column2
  • print where values has variance. also extra userIds if any

OUTPUT:

e3001 has differnece, file1 value: 75 & file2 value: 55
w1453 has differnece, file1 value: 500 & file2 value: 515
d1128 is only present in filename: file1|file2

solution with 1liner-awk or bash loop is welcome

I'm trying to loop, but it's spitting garbage, guess there's some mislogic

#!/usr/bin/env bash

VARIABLES

FILE1=file1 FILE2=file2 USERID1=(awk -F'\t' '{ print $1 }' ${FILE1}) USERID2=(awk -F'\t' '{ print $1 }' ${FILE2}) USERDON1=(awk -F'\t' '{ print $2 }' ${FILE1}) USERDON2=(awk -F'\t' '{ print $2 }' ${FILE2})

for user in ${USERID1[@]} do for (( i = 0; i < "${#USERID2[@]}"; i++ )) #for user in ${USERID2[@]} do if [[ ${USERID1[$user]} == ${USERID2[i]} ]] then echo ${USERID1[$user]} MATCHES BALANCE FROM ${FILE1}: ${USERDON1[$i]} WITH BALANCE FROM ${FILE2}: ${USERDON2[$i]} else echo ${USERID1[$user]} fi done done

Below is copied file right from linux box. It's tab separated, but awk works with tab also, as far as I know.

#cat file1
e3001   55
n5244   30
w1453   515
Sollosa
  • 1,929
  • 4
  • 20
  • 38

5 Answers5

3

Hmmm - your script takes the scenic route, so to speak. How about a simple awk approach? Like

awk '
NR==FNR         {ARR[$1] = $2
                 F1      = FILENAME
                 next
                }
($1 in ARR)     {if ($2 != ARR[$1]) print $1 " has difference," \
                                          F1 " value: " ARR[$1] \
                                          " & " FILENAME " value: " $2 
                 delete ARR[$1]
                 next
                }
                {print $1 " is only present in filename: " FILENAME
                }
END             {for (a in ARR) print a " is only present in filename: " F1
                }
' file[12]
d1128 is only present in filename: file2
w1453 has difference, file1 value: 500 & file2 value: 515
e3001 has difference, file1 value: 75 & file2 value: 55

It reads all of file1 into an array, then, with every line in file2, checks $1 against the array indices, and, if present, prints the difference (or doesn't print if none), and deletes the array element (that delete may be missing in some awk implementations, BTW). If not present, print accordingly. In the END section, all remaining array elements are printed as they exist only in file1.

terdon
  • 242,166
RudiC
  • 8,969
  • In my case, it's only printing this line d1128 is only present in filename: file2 – Sollosa Apr 10 '22 at 15:08
  • What awk version do you use? Did you copy the script verbatim? – RudiC Apr 10 '22 at 15:41
  • I'm on RockyLinux8 and it has GNU awk version 4.2.1 – Sollosa Apr 10 '22 at 16:48
  • 2
    Sollosa: With your NEW file1 (that from the third edit), there's NO differences; just the d1128 is missing! That's why my as well as @terdon 's approach are outputting just that line! What if you run it with your initial file1? – RudiC Apr 10 '22 at 17:52
  • 1
    I really want to upvote this answer instead of posting my own that's the same approach but your insistence on using all upper case variable names is a real show-stopper. Since multiple people have already suggested you not use all upper case variable names and you keep doing it anyway there's no point commenting about that again so unfortunately I guess all I can do is post my own very similar answer.. – Ed Morton Apr 10 '22 at 19:28
2

The shell is a horrible tool for this sort of thing. Also, as a general rule, you should avoid CAPS for your shell variables in your shell scripts. Since, by convention, global environment shell variables are capitalized, this can lead to naming collisions and hard to debug issues. Finally, your script requires reading the file 4 separate times(!) and then processing the data.

With that said, here's another awk approach (frankly, RudiC's is better, but I'd already written this so I'm posting anyway):

$ awk '{
  if(NR==FNR) {
    fn1=FILENAME;
    f1[$1]=$2;
    next
  }
  f2[$1]=$2;
  if($1 in f1){
    if($2 != f1[$1]){
      printf "%s is different; %s value: %s & %s value: %s\n", \
             $1,fn1,$2,FILENAME,f1[$1]
    }
  }
  else{
    print $1,"is only present in filename:", FILENAME
  }
}
END{
  for(id in f1){
    if( !(id in f2) ){print id,"is only present in afilename:",fn1}
  }
}' file1 file2
d1128 is only present in filename: file2
w1453 is different; file1 value: 515 & file2 value: 500
e3001 is different; file1 value: 55 & file2 value: 75
terdon
  • 242,166
  • Don't see c[...] used any further after being incremented? – RudiC Apr 10 '22 at 15:56
  • Thanks, @RudiC, that was a remnant from a different approach I'd tried. Fixed now. Sadly, I didn't think of del to clear the element from the array so your approach is much cleverer and more elegant. – terdon Apr 10 '22 at 16:00
  • Still printing only 1 line. – Sollosa Apr 10 '22 at 16:59
  • @Sollosa then your files are not as you show. Are these Windows text files, perhaps? Did you ever open them on a Windows machine? Do they have \r characters? Try sed -n '/\r/p' file1 file2, if that prints anything, you need to remove the \r characters. – terdon Apr 10 '22 at 17:01
  • @terdon no such characters, even files created on linux, I even tried to add FS as tab in begin section but no luck. I'll copy paste sample in below section of my question from my linux vm. – Sollosa Apr 10 '22 at 17:05
  • @Sollosa are you sure there are only two fields in the file? Could you have stray \t characters in there? You can investigate with od -c file1. – terdon Apr 10 '22 at 17:07
  • @terdon this is the output for command
    0000020   0  \n   w   1   4   5   3  \t   5   1   5  \n
    0000034```
    
    – Sollosa Apr 10 '22 at 17:26
  • @Sollosa OK, the issue seems to be that some of this (FILENAME, I think) is GNU-awk specific. Do you have gawk? This is why you must always mention your operating system. I'm guessing this isn't Linux. This will work if you try with gawk. Sorry about that, answer clarified (stated that it requires gawk). – terdon Apr 10 '22 at 17:38
  • @terdon according to Sollosa comment on RudiC answer the OS is RockyLinux8 and awk is GNU awk – DanieleGrassini Apr 10 '22 at 17:53
  • There is nothing gawk-specific in your script, it'll behave the same with any awk. – Ed Morton Apr 10 '22 at 19:29
  • @Sollosa you showed the output of od -c file1 and that has no problems so the issue must be with file2 so show the output of od -c file2. – Ed Morton Apr 10 '22 at 19:40
  • @EdMorton I tried it with busybox awk and got the output the OP was describing. Of course, I now see that was because I was using the new file, which only has one difference, but I hadn't realized it at the time. Thanks! – terdon Apr 10 '22 at 22:44
  • @terdon fellows I'm not even using any variables, just created 2 files, and running your commands as is. – Sollosa Apr 11 '22 at 07:15
2

Comment are self explanatory :

awk '
    BEGIN {file1 = ARGV[1]; file2 = ARGV[2]}
# Load all file1 contents
NR == FNR {map[$1] = $2; next}

# If $1 is not in m then this key is unique to file2
!($1 in map) {uniq[$1]; next}

# If $1 is in m and the value differs there are delta
# between the two files. Save it.
$1 in map &amp;&amp; map[$1] != $2 {diff[$1] = $2; next}

# The two files have all the same data.
{delete map[$1]}

END {
    # Anything is in diff are in both files but
    # with different values
    for ( i in diff )
        print i, &quot;has difference,&quot;, file1, &quot;value:&quot;, map[i], &quot;&amp;&quot;, file2, &quot;value:&quot;, diff[i]

    # Anything is still in m is only in file 1
    for ( i in map )
        if (!(i in diff))
            print i, &quot;is only present in filename :&quot;, file1

    # Anything is in uniq is unique to file2
    for ( i in uniq )
        print i, &quot;is only present in filename :&quot;, file2
}

' file1 file2

1
awk 'function printUniq(Id, fName){
         printf("%s is only present in filename: %s\n", Id, fName)
}

{ fileName[nxtinput+0]=FILENAME }
!nxtinput{ Ids[$1]=$2; next }

($1 in Ids){ if($2!=Ids[$1])
                 printf ("%s has difference, %s value: %s & %s value: %s\n",\
                 $1, fileName[0], Ids[$1], fileName[1], $2);
             delete Ids[$1];
             next
}
{ printUniq($1, fileName[1]) }
END{ for(id in Ids) printUniq(id, fileName[0]) }' file1 nxtinput=1 file2
αғsнιη
  • 41,407
1

Essentially the same solution as posted by RudiC but without the all upper case variable names and with a couple of other minor improvements to clarity:

$ cat tst.awk
NR==FNR {
    file1[$1] = $2
    next
}
$1 in file1 {
    if ( $2 != file1[$1] ) {
        printf "%s has difference, %s value: %s & value: %s\n", $1, ARGV[1], file1[$1], FILENAME, $2
    }
    delete file1[$1]
    next
}
{
    print $1, "is only present in filename:", FILENAME
}
END {
    for ( id in file1 ) {
        print id, "is only present in filename:", ARGV[1]
    }
}

$ awk -f tst.awk file1 file2
d1128 is only present in filename: file2
w1453 has difference, file1 value: 500 & value: file2
e3001 has difference, file1 value: 75 & value: file2
Ed Morton
  • 31,617