Count the number of rows with a string occurring n times in multiple columns

Question

I have several hundred text files consisting each of five tab delimited columns. The first column contains an index and the following four the count of occurrences. Now I would like to count the number of rows that contain 3 columns with 0 (i.e. 7 rows in the example below).

1   0   0   0   9
2   0   9   0   0
3   10  0   0   0
4   0   10  4   0
5   0   0   0   10
6   0   0   0   10
7   0   0   0   10
8   0   10  0   0
9   5   0   5   0

I can code this as a loop in R, but as the original files each contain 60+ million rows, I wonder if there is no workaround with awk or sed and wc -l.

terdon · Answer 1 · 2019-12-18T13:37:04.283

5

Yes, you can do this in awk:

awk '{ 
       k=0; 
       for(i=2;i<=NF;i++){ 
         if($i == 0){
             k++
         }
       }
       if(k==3){
         tot++
       }
      }
      END{
          print tot
      }' file

And also with (GNU) sed and wc:

$ sed -nE '/\b0\b.*\b0\b.*\b0\b/p' file | wc -l
7

But, personally, I would do in in perl instead:

$ perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' file 
7

Or, the slightly less condensed:

$ perl -ale 'if( (grep{$_ == 0 } @F) == 3 ){
                  $tot++ 
              }
              END{
                  print $tot
              }' file 
7

And the same thing, for the golfers among you:

$ perl -ale '(grep{$_==0}@F)==3&&$t++}{print$t' file
7

Explanation

-ale: -a makes perl behave like awk. It will read each line of the input file and split it on whitespace into the array @F. The -l adds a \n to each call of print and removes trailing newlines from the input and the -e is the script that should be applied to each line of input.
$tot++ if (grep{$_ == 0 } @F) == 3 : increment $tot by one, for every time where there are exactly 3 fields that are 0. Since the 1st field starts from 1, we know it will never be 0 so we don't need to exclude it.
}{: this is just a shorthand way of writing END{}, of giving a block of code that will be executed after the file has been processed. So, }{ print $tot will print the total number of lines with exactly three fields with a value of 0.

edited Dec 18 '19 at 13:37

answered Dec 18 '19 at 13:29

terdon

242,166

The sed solution counts lines with 3 or more zeros. – Dec 20 '19 at 18:14
The perl solution treats a field with a letter (a,b,etc.) as a variable which value is what is tested. – Dec 20 '19 at 18:15
@Isaac yes, the sed counts 3 or more, but since the OP stated that the file only has five fields, the first is an index (starting at 1) and the last is the sum of fields 2,3 and 4, finding three 0s is sufficient. And yes, again, the perl does that, but why is it a problem? We know the file only has numerical values, and even if it doesn't, the match will simply fail, as it should. – terdon Dec 20 '19 at 19:44
No, wait, why would it treat a letter any differently? It performs arithmetic comparison, yes, but all fields are treated as variables. – terdon Dec 20 '19 at 19:47
The title asks for an n solution. limiting the answer to precisely 3 zeros and assuming the first one is never zero doesn't look like a general (for n) solution. – Dec 20 '19 at 20:51
Add a line like abc 0 0 3 4 or ++2 0 0 2 3 or 0x3 0 2 0 7 or !4 0 0 3 5 and then try: perl -ale 'print if (grep{$_ == 0 } @F) == 3' file. You will see the effect. – Dec 20 '19 at 20:51

Sundeep · Answer 2 · 2019-12-21T10:35:51.647

2

With GNU grep or ripgrep

$ LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' ip.txt 
7

$ rg -c '\t0\b.*\b0\b.*\b0\b' ip.txt
7

where $'\t' will match tab character, thus working even if first column is 0.

Sample run with large file:

$ perl -0777 -ne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
92M f1

$ time LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' f1 > f2
real    0m0.416s

$ time rg -c '\t0\b.*\b0\b.*\b0\b' f1 > f3  
real    0m1.271s

$ time LC_ALL=C awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' f1 > f4
real    0m8.645s

$ time perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' f1 > f5
real    0m14.349s

$ time LC_ALL=C sed -n 's/\t0\>//4;t;s//&/3p' f1 | wc -l > f6
real    0m14.075s
$ time LC_ALL=C sed -n 's/\t0\>/&/3p' f1 | wc -l > f8    
real    0m6.772s

$ time LC_ALL=C awk '{ 
       k=0; 
       for(i=2;i<=NF;i++){ 
         if($i == 0){
             k++
         }
       }
       if(k==3){
         tot++
       }
      }
      END{
          print tot
      }' f1 > f7 
real    0m10.675s

Remove LC_ALL=C if file can contain non-ASCII characters. ripgrep is usually faster than GNU grep but in test run GNU grep was faster. As per ripgrep's author, (?-u:\b) can be used to avoid unicode word boundary, but that resulted in similar time for above case.

edited Dec 21 '19 at 10:35

answered Dec 18 '19 at 13:34

Sundeep

12,008

Why are you adding the LC_ALL, is collation relevant here? The \b shouldn't be affected by the locale, what am I missing? – terdon Dec 18 '19 at 13:38
@terdon I'm aiming for speed here as OP mentions original files each contain 60+ million rows – Sundeep Dec 18 '19 at 13:40
OK, so how does LC_ALL improve speed in this case? I am not saying you're wrong, I'm sure you're right, I'm just trying to learn. – terdon Dec 18 '19 at 13:44
I'm trying to find a Q&A where Stéphane Chazelas explains how it speeds up.. till then, here's one where grep was much faster with LC_ALL=C Counting the number of lines having a number greater than 100 – Sundeep Dec 18 '19 at 13:47
Not able to find other than What does “LC_ALL=C” do?, but as far as I understand, the speed benefit it because LC_ALL=C would mean characters are all single bytes only (search space is short) – Sundeep Dec 18 '19 at 13:55
Hmm, interesting. But I bet the reason is that in that case you were using character classes ([0-9] etc.) and those are indeed affected by the order of collation. I doubt it would make a difference in this case. I'll make some fake files and test though. – terdon Dec 18 '19 at 13:55
1

I just tried this: perl -0777 -ne 'print $_ x 1000000' ip.txt > f1 and time grep -wc '0\b.*\b0\b.*\b0' f1 > f2 takes about 0m9.994s whereas time LC_ALL=C grep -wc '0\b.*\b0\b.*\b0' f1 > f2 takes about 0m0.449s – Sundeep Dec 18 '19 at 14:01
1

Well, time for me to ask a question then! Thanks :) – terdon Dec 18 '19 at 14:04
1

You can use non-Unicode word boundaries in ripgrep with (?-u:\b). – BurntSushi5 Dec 20 '19 at 14:24

Rakesh Sharma · Accepted Answer · 2019-12-20T22:38:18.360

1

Using GNU sed:

sed -E 's/\t0\>/&/3;t;d' file  | wc -l

As pointed out by Isaac, if we want to count exact 3 then do this :

sed -n 's/\t0\>//4;t;s//&/3p' file | wc -l

edited Dec 20 '19 at 22:38

answered Dec 19 '19 at 02:02

Rakesh Sharma

836

This counts lines with 3 or more zeros. – Dec 20 '19 at 18:16

score 1 · Answer 4 · answered Dec 19 '19 at 20:35

1

$ awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' file
7

answered Dec 19 '19 at 20:35

Ed Morton

31,617

@Isaaec /[ \t]+/ would recognize spaces as the separator in any awk but then you could get failures if any of the tab-separated fields contains blanks (e.g. a field that's foo 0 bar) so it's just not worth it - the OP said the fields are tab-separated so we should just believe them. – Ed Morton Dec 21 '19 at 14:57

Rakesh Sharma · Answer 5 · 2019-12-19T13:01:19.997

Use Perl to count the number of lines which see a zero surrounded on the left by a TAB and to the right by a word boundary, three times. In the end print the line count of such lines.

perl -lne '$c += 3 == (() = /\t0\b/g)}{print $c' file
7

Another way to do it is by looking at the fields:

perl -F'\t' -lane '$c++ if 3 == grep ! $_, @F[1..$#F]}{print $c' file

Yet another way is to use the s/// command in a scalar context:

perl -lne '$c += s/\t0\b//g == 3}{print $c' file

We use Gnu awk to do this:

awk -F'\t' '
  {
      gsub(FS, FS FS)
      $0 = $0 FS
      if ($0 != gensub(FS"0"FS, "", 3, $0))  ++c
  }
  END{print c}
' file

Gnu grep can also help you:

grep -cP '(.*\t0\b.*){3}' file

iruvar · Answer 6 · 2019-12-19T15:21:29.787

0

Another awksolution

awk '{x=$0; if(gsub(/\y0\y/, "", x) == 3) y++}; END{print y+0}' file

This assumes the index column never contains a 0. We set a variable x to $0, and then replace a 0 surrounded by word boundaries with an empty string via the gsub call. Since gsub returns the number of replacements made, we increment y if return value equals 3

edited Dec 19 '19 at 15:21

answered Dec 19 '19 at 15:09

iruvar

16,725

score 0 · Answer 7 · 2019-12-20T20:13:50.363

It is possible to match exactly n zeros (not more) with (GNU) sed:

$ i=3;               # use the n value wanted.
$ sed -E 's/\<0\>/&/'"$i"';Ta;{s//&/'"$((i+1))"';T};:a;d' file | wc -l
7

Explanation

i=3                  # define the count of the regex to match.
sed -E               # use extended regexes. reduce the need of `\`.
/\<0\>/              # match a zero (0) surrounded by whitespace.
s/\<0\>/&/           # replace zeros with themselves (detect that they exist)
'"$i"'               # count how many zeros to change.
Ta                   # branch to label `a` (delete line) if
                     # there were less than `i` zeros
{s//&/'"$((i+1))"'   # replace again but now i+1 zeros
T                    # branch to print if there are no more zeros than `i`.
:a;d                 # label to delete line (and catch any other line).

The equivalent with (GNU) awk is even simpler (gsub counts exactly how many zeros were replaced).

$ awk -vn=3 'gsub(/\<0\>/,"&")==3{c++}END{print c+0}' file
7

score 0 · Answer 8 · answered Dec 21 '19 at 09:12

$ grep -E '(\<0\>.*){3}' file | wc -l
       7

The pattern \<0\> will match a lone digit 0. The full pattern matches any line that has at least three lone zeros.

Te only count lines that contain exactly three zeros, not four, remove any line that contains four or more zeros:

grep -E '(\<0\>.*){3}' file | grep -v -E '(\<0\>.*){4,}' | wc -l

Count the number of rows with a string occurring n times in multiple columns

8 Answers8

Explanation