Found below awk
command which finds missing lines in 1.txt
compared to 2.txt
.
awk 'NR==FNR{b[$0]=1;next}!b[$0]' 1.txt 2.txt
Need to understand step by step how this awk
construct finds the missing lines.
Found below awk
command which finds missing lines in 1.txt
compared to 2.txt
.
awk 'NR==FNR{b[$0]=1;next}!b[$0]' 1.txt 2.txt
Need to understand step by step how this awk
construct finds the missing lines.
The script outputs any line in the second file that does not occur in the first file.
awk 'NR==FNR { b[$0] = 1; next } !b[$0]' 1.txt 2.txt
The awk
script starts off with comparing NR
to FNR
. NR
is the number of records (lines) read so far, in total, including the current record. FNR
is the number of records read from the current input file. If these two numbers are the same, then we are still reading the first input file. Note that this will break down if the first file happens to be empty as NR == FNR
would then be true for the second file as well.
If the we're reading the first of the input files (we will assume that it is non-empty), b[$0] = 1
will use the contents of the current record as a hash key and store the value 1 for that key in the array b
(array indexes may be strings in awk
). The script then executes next
which means that it jumps back to the start of the script and reads the next record.
If NR
is not equal to FNR
, then this means we're reading the second of the two input files, and !b[$0]
is a test with the current input record (line) as a key into the array b
that we populated earlier. If a 1 is stored for the current record in b
, then we know that this was previously found in the first file. The !
negates the test.
If the test is true, i.e. if the current line from the second file was not previously seen in the first file, then the default action is performed. The default action for a test without a corresponding {...}
block is to output the current line (i.e. it acts as if the code was !b[$0] { print }
).
Since this awk
script reads all (unique) lines from the first file into memory, it may not be a good idea to run in on very large files.
In those cases, it may be better to do something like
comm -13 <( sort -u file1 ) <( sort -u file2 )
(which requires a shell that know about process substitutions), or just
comm -13 file1 file2
if the files are already sorted.
This does not generate the exact same output as the awk
script will output any line from file2
that occurs multiple times once for each time it occurs, whereas the comm
command above won't if sort -u
is used on the input.
See the comm
manual on your system for further information.
Addressing the questions in the comments:
FNR
is the number of records read from the current input file.NR
and FNR
doesn't "belong" to either file, they are just counters. The FNR
counter resets as the end of a file is reached.NR
and FNR
are incremented upon reading a line from a file. The next
command forces a jump to the beginning of the script which also causes the next line to be read. NR
and FNR
are incremented by this action since a new line is read.NR != FNR
then this means we have gone past the first file. FNR
was reset to zero upon reaching the end of the previous file, but NR
just keeps on counting.$0
is a variable holding the current line. It holds the complete line as read from the file. $1
, $2
etc. holds the fields of the current line as split on the value of the variable IFS
(usually any whitespace). If the current line is hello world
, then $0
would have the value hello world
while $1
has the value hello
and $2
has the value world
(since the line was split on the space). This script only uses $0
though and you may think of $0
as "the contents of the current input line".b[$0] = 1
is an assignment of a value to a particular location/index in the array b
. The location is determined by the current line, $0
, and the value assigned is 1. This makes the array b
act like a "lookup table"; if b[i]
is 1 for any particular index i
, then this means that it was seen in the first input file.!b[$0]
will be true if the value stored at index $0
of b
is zero (or if it's uninitialized), i.e. if b[$0]
was never assigned the value 1, i.e. the line just read from the second file was not previously seen in the first file. Since there is no action (no {...}
block) corresponding to this test, the default action of printing $0
is performed. This has the effect of printing every line from the second file that is not present in the first file.I couldn't get the logic still, so please could you explain the below queries.1.FNR is number of records read from the current input file. so here current input file is 2.txt or 1.txt ,as both are input files to awk. 2.Is NR belongs to 2.txt and FNR belongs to 1.txt 3.when 'next' is performed does NR is incremented or FNR 4.Could yo brief me more if NR!=FNR 5.what is $0 6.if b[$0] = 1, means line is present in 1.txt ,when NR==FNR ? 7.Is !b[$0] shows the missing lines ,as it shows the whole array elements which doesn't have 1 stored in it.
– Prashant BJ Sep 17 '17 at 08:21