7

I have a file with many numbers in it (only numbers and each number is in one line). I want to find out the number of lines in which the number is greater than 100 (or infact anything else). How can I do that?

Innocent
  • 83
  • 1
  • 1
  • 4

2 Answers2

13

Let's consider this test file:

$ cat myfile
98
99
100
101
102
103
104
105

Now, let's count the number of lines with a number greater than 100:

$ awk '$1>100{c++} END{print c+0}' myfile
5

How it works

  • $1>100{c++}

    Every time that the number on the line is greater than 100, the variable c is incremented by 1.

  • END{print c+0}

    After we have finished reading the file, the variable c is printed.

    By adding 0 to c, we force awk to treat c like a number. If there were any lines with numbers >100, then c is already a number. If there were not, then c would be an empty (hat tip: iruvar). By adding zero to it, we change the empty string to a 0, giving a more correct output.

John1024
  • 74,655
  • 2
    I would change the print c to print 0+c or even print +c so a sane value of 0 is printed when no line exists with a number greater than 100 – iruvar Sep 26 '16 at 04:05
  • @iruvar Good point! Thanks. answer updated with +0 to force conversion to a number. – John1024 Sep 26 '16 at 05:32
2

Similar solution with perl

$ seq 98 105 | perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}'
5


Speed comparison: numbers reported for 3 consecutive runs

Random file:

$ perl -le 'print int(rand(200)) foreach (0..10000000)' > rand_numbers.txt
$ perl -le 'print int(rand(100200)) foreach (0..10000000)' >> rand_numbers.txt

$ shuf rand_numbers.txt -o rand_numbers.txt 
$ tail -5 rand_numbers.txt 
114
100
66125
84281
144
$ wc rand_numbers.txt 
20000002 20000002 93413515 rand_numbers.txt
$ du -h rand_numbers.txt 
90M rand_numbers.txt

With awk

$ time awk '$1>100{c++} END{print c+0}' rand_numbers.txt 
14940305

real    0m7.754s
real    0m8.150s
real    0m7.439s

With perl

$ time perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}' rand_numbers.txt 
14940305

real    0m4.145s
real    0m4.146s
real    0m4.196s

And just for fun with grep (Updated: faster than even Perl with LC_ALL=C)

$ time grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt 
14940305

real    0m10.622s

$ time LC_ALL=C grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt
14940305

real    0m0.886s
real    0m0.889s
real    0m0.892s

sed is no fun:

$ time sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l
14940305

real    0m11.929s

$ time LC_ALL=C sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l
14940305

real    0m6.238s
Sundeep
  • 12,008
  • 1
    To be fair compare apples to apples: compare grep w/o -c piped through wc -l to the sed solution, but I expect sed would still be slower. – Dani_l Dec 12 '17 at 04:48
  • yeah, I had included sed only because it was tagged by OP.. sed isn't the tool to use for arithmetic.. and I was actually surprised when I checked grep + LC_ALL=C today which prompted the edit.. – Sundeep Dec 12 '17 at 05:05