2

Found below awk command which finds missing lines in 1.txt compared to 2.txt.

awk 'NR==FNR{b[$0]=1;next}!b[$0]' 1.txt 2.txt

Need to understand step by step how this awk construct finds the missing lines.

Kusalananda
  • 333,661
  • 1
    actually it is not looking missing lines in 1.txt, it's getting lines existing in 2.txt but not in 1.txt. as 1.txt can have a line which 2.txt doesn't – αғsнιη Sep 17 '17 at 07:31

1 Answers1

8

The script outputs any line in the second file that does not occur in the first file.

awk 'NR==FNR { b[$0] = 1; next } !b[$0]' 1.txt 2.txt

The awk script starts off with comparing NR to FNR. NR is the number of records (lines) read so far, in total, including the current record. FNR is the number of records read from the current input file. If these two numbers are the same, then we are still reading the first input file. Note that this will break down if the first file happens to be empty as NR == FNR would then be true for the second file as well.

If the we're reading the first of the input files (we will assume that it is non-empty), b[$0] = 1 will use the contents of the current record as a hash key and store the value 1 for that key in the array b (array indexes may be strings in awk). The script then executes next which means that it jumps back to the start of the script and reads the next record.

If NR is not equal to FNR, then this means we're reading the second of the two input files, and !b[$0] is a test with the current input record (line) as a key into the array b that we populated earlier. If a 1 is stored for the current record in b, then we know that this was previously found in the first file. The ! negates the test.

If the test is true, i.e. if the current line from the second file was not previously seen in the first file, then the default action is performed. The default action for a test without a corresponding {...} block is to output the current line (i.e. it acts as if the code was !b[$0] { print }).


Since this awk script reads all (unique) lines from the first file into memory, it may not be a good idea to run in on very large files.

In those cases, it may be better to do something like

comm -13 <( sort -u file1 ) <( sort -u file2 )

(which requires a shell that know about process substitutions), or just

comm -13 file1 file2

if the files are already sorted.

This does not generate the exact same output as the awk script will output any line from file2 that occurs multiple times once for each time it occurs, whereas the comm command above won't if sort -u is used on the input.

See the comm manual on your system for further information.


Addressing the questions in the comments:

  1. Yes, FNR is the number of records read from the current input file.
  2. NR and FNR doesn't "belong" to either file, they are just counters. The FNR counter resets as the end of a file is reached.
  3. Both NR and FNR are incremented upon reading a line from a file. The next command forces a jump to the beginning of the script which also causes the next line to be read. NR and FNR are incremented by this action since a new line is read.
  4. If NR != FNR then this means we have gone past the first file. FNR was reset to zero upon reaching the end of the previous file, but NR just keeps on counting.
  5. $0 is a variable holding the current line. It holds the complete line as read from the file. $1, $2 etc. holds the fields of the current line as split on the value of the variable IFS (usually any whitespace). If the current line is hello world, then $0 would have the value hello world while $1 has the value hello and $2 has the value world (since the line was split on the space). This script only uses $0 though and you may think of $0 as "the contents of the current input line".
  6. b[$0] = 1 is an assignment of a value to a particular location/index in the array b. The location is determined by the current line, $0, and the value assigned is 1. This makes the array b act like a "lookup table"; if b[i] is 1 for any particular index i, then this means that it was seen in the first input file.
  7. !b[$0] will be true if the value stored at index $0 of b is zero (or if it's uninitialized), i.e. if b[$0] was never assigned the value 1, i.e. the line just read from the second file was not previously seen in the first file. Since there is no action (no {...} block) corresponding to this test, the default action of printing $0 is performed. This has the effect of printing every line from the second file that is not present in the first file.
Kusalananda
  • 333,661
  • Thanks Kusalananda for the detail info.

    I couldn't get the logic still, so please could you explain the below queries.1.FNR is number of records read from the current input file. so here current input file is 2.txt or 1.txt ,as both are input files to awk. 2.Is NR belongs to 2.txt and FNR belongs to 1.txt 3.when 'next' is performed does NR is incremented or FNR 4.Could yo brief me more if NR!=FNR 5.what is $0 6.if b[$0] = 1, means line is present in 1.txt ,when NR==FNR ? 7.Is !b[$0] shows the missing lines ,as it shows the whole array elements which doesn't have 1 stored in it.

    – Prashant BJ Sep 17 '17 at 08:21
  • @PrashantBJ See updated answer. – Kusalananda Sep 17 '17 at 08:40
  • 1
    Thanks a lot for explaining with such a detail.
    Now I am able to get the logic.
    Once again really thanks for your patience in writing the detailed explanation to make understand the logic/construct.
    – Prashant BJ Sep 17 '17 at 10:23