10

How I can count the number of scientific numbers in a file? The file also has a few lines of header which needs to be skipped.

A portion of the file's content is in below.

FileHeaderLine1
FileHeaderLine2
FileHeaderLine3
FileHeaderLine4
2.91999996E-001 2.97030300E-001 3.02060604E-001 3.07090908E-001 3.12121212E-001 3.17151517E-001
3.22181821E-001 3.27212125E-001 3.32242429E-001 3.37272733E-001 3.42303038E-001 3.47333342E-001
3.52363646E-001 3.57393950E-001 3.62424254E-001 3.67454559E-001 3.72484863E-001 3.77515137E-001
3.82545441E-001 3.87575746E-001 3.92606050E-001 3.97636354E-001 4.02666658E-001 4.07696962E-001
4.12727267E-001 4.17757571E-001 4.22787875E-001 4.27818179E-001 4.32848483E-001 4.37878788E-001
4.42909092E-001 4.47939396E-001 4.52969700E-001

So, how can I skip the first four lines of the example above and count the number of scientific numbers in the file?

Braiam
  • 35,991

6 Answers6

14

With core module Scalar::Util, you can do:

$ perl -MScalar::Util=looks_like_number -anle '
    $count += grep { looks_like_number($_) } @F;
    END { print $count }
' file
33

More about looks_like_number can see in perldoc perlapi.

cuonglm
  • 153,898
7

Using GNU grep

You can use grep to do this, using the PCRE facilities. Incidentally the same pattern can be used in Perl too:

$ grep -oP '\d+E[-+]?\d+' file.txt  | wc -l
33

You can also use wc -w to count words, I'm counting lines above, but the grep returns a single match on a line so it doesn't really matter in that scenario.

Using Perl

For Perl you could use this one liner:

$ perl -lane '$c += grep /\d+E[-+]?\d+/, @F; END { print $c; }' file.txt 
33

References

slm
  • 369,824
  • @StephaneChazelas - thanks for the edit. Sorry I only ever am on GNU systems so do tend to forget this point all the time. I'll try to not make that mistake. – slm Jun 20 '14 at 11:55
4

egrep will work:

egrep "[0-9].[0-9]E-[0-9]" YourFile | wc -w

UPDATE:

if a line happened to contain both a number and some other string, we can use awk to solve the problem:

awk -F' ' '{for(i=1;i<=NF;i++)if(!(i%1))$i=$i "\n"}1' YourFile | egrep "[0-9].[0-9]E-[0-9]" | wc -w ( or wc -l )
Nidal
  • 8,956
  • This would give incorrect results if a line happened to contain both a number and some other string. The answer above that uses grep's -o option to output only matches is more correct. – Johnny Jun 20 '14 at 02:16
  • I didn't know about -oP option mentioned in slm answer before, but I have fixed my problem using awk @Johnny – Nidal Jun 20 '14 at 02:36
3

Assuming you have only scientific numbers after 4th line, you can do something like below.

tail -n +5 filename | wc - w

For the input you have provided, the output is 33 after running the above command.

Ramesh
  • 39,297
3

If you need to simply count the number of whitespace delimited fields following the header lines in perl, I think you could just do

perl -lane '$sum += $#F+1 if $. > 4; END{print $sum}' file

If you really need to count only scientifically-formatted numbers then one approach might be to search and replace numbers according to a suitable regex and then count the number of replacements (the perl substitution expression returns the number of replacements when you bind it to a variable)

perl -lane '$sum += s/[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?//g if $. > 4; END{print $sum}' file
steeldriver
  • 81,074
2

It all goes down to what you actually want to consider a scientific number, what you can expect your input to contain, and where you may accept to find those numbers in the input.

For instance, in:

That's inferior to the LK2E2000 model.

I can find either 0 or 2 (inf and 2E2000) or 3 (inf, 2E200, 0) numbers (or taken to the extreme, looking for all the sequences of characters that form a valid number: 17 (inf, 2, 2E2, 2E20, 2E200, 2E200, 2E2000, 2, 20, 200, 2000, 0, 00, 000, 0, 00, 0)).

If you know your input has only numbers in the X.XXXXXXXXE-XXX, and that they're on words of their own, it may be safer to look just for that in whole words like:

tr -s '[[:blank:]]' '[\n*]' | LC_ALL=C grep -xEc '[0-9]\.[0-9]{8}E-[0-9]{3}'

The idea there, is to get one word per line and to match the whole line (-x) against the pattern you want. To allow any scientify notation number (-1.2e+1234... as long as there's a e or E), you could change the pattern to:

[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9])[eE][-+]?[0-9]+

Or make the e... part optional to allow all sorts of decimal floating point numbers:

[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9])([eE][-+]?[0-9]+)?

That all gives the same answer for your specific input, but where that would make a difference is where there is input that departs from the strict pattern shown in your sample.