I have a file with many numbers in it (only numbers and each number is in one line). I want to find out the number of lines in which the number is greater than 100 (or infact anything else). How can I do that?
2 Answers
Let's consider this test file:
$ cat myfile
98
99
100
101
102
103
104
105
Now, let's count the number of lines with a number greater than 100:
$ awk '$1>100{c++} END{print c+0}' myfile
5
How it works
$1>100{c++}
Every time that the number on the line is greater than 100, the variable
c
is incremented by 1.END{print c+0}
After we have finished reading the file, the variable
c
is printed.By adding
0
toc
, we force awk to treatc
like a number. If there were any lines with numbers>100
, thenc
is already a number. If there were not, thenc
would be an empty (hat tip: iruvar). By adding zero to it, we change the empty string to a0
, giving a more correct output.
Similar solution with perl
$ seq 98 105 | perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}'
5
Speed comparison: numbers reported for 3 consecutive runs
Random file:
$ perl -le 'print int(rand(200)) foreach (0..10000000)' > rand_numbers.txt
$ perl -le 'print int(rand(100200)) foreach (0..10000000)' >> rand_numbers.txt
$ shuf rand_numbers.txt -o rand_numbers.txt
$ tail -5 rand_numbers.txt
114
100
66125
84281
144
$ wc rand_numbers.txt
20000002 20000002 93413515 rand_numbers.txt
$ du -h rand_numbers.txt
90M rand_numbers.txt
With awk
$ time awk '$1>100{c++} END{print c+0}' rand_numbers.txt
14940305
real 0m7.754s
real 0m8.150s
real 0m7.439s
With perl
$ time perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}' rand_numbers.txt
14940305
real 0m4.145s
real 0m4.146s
real 0m4.196s
And just for fun with grep
(Updated: faster than even Perl with LC_ALL=C)
$ time grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt
14940305
real 0m10.622s
$ time LC_ALL=C grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt
14940305
real 0m0.886s
real 0m0.889s
real 0m0.892s
sed
is no fun:
$ time sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l
14940305
real 0m11.929s
$ time LC_ALL=C sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l
14940305
real 0m6.238s

- 12,008
-
1To be fair compare apples to apples: compare grep w/o -c piped through wc -l to the sed solution, but I expect sed would still be slower. – Dani_l Dec 12 '17 at 04:48
-
yeah, I had included
sed
only because it was tagged by OP.. sed isn't the tool to use for arithmetic.. and I was actually surprised when I checked grep + LC_ALL=C today which prompted the edit.. – Sundeep Dec 12 '17 at 05:05
print c
toprint 0+c
or evenprint +c
so a sane value of 0 is printed when no line exists with a number greater than100
– iruvar Sep 26 '16 at 04:05+0
to force conversion to a number. – John1024 Sep 26 '16 at 05:32