Count total number of occurrences using grep

Question

grep -c is useful for finding how many times a string occurs in a file, but it only counts each occurence once per line. How to count multiple occurences per line?

I'm looking for something more elegant than:

perl -e '$_ = <>; print scalar ( () = m/needle/g ), "\n"'

I know grep is specified, but for anyone using ack, the answer is simply ack -ch <pattern>. — Kyle Strand, May 19 '16 at 15:56
@KyleStrand For me ack -ch only counted the lines with occurrences and not the number of occurences — Marc Kees, Apr 30 '20 at 12:01
@MarcKees Looking at the man page, that sounds like the correct behavior. Thanks for pointing that out! — Kyle Strand, May 13 '20 at 15:44
Similar: Counting occurrences of [a] word in [a] text file. — G-Man Says 'Reinstate Monica', Oct 12 '23 at 08:19
As @user4518 notes in a comment below, the perl code example given above erroneously "only counts the occurences in the first line." — jubilatious1, Nov 17 '23 at 05:53

score 586 · Accepted Answer · edited Apr 09 '20 at 09:12

586

grep's -o will only output the matches, ignoring lines; wc can count them:

grep -o 'needle' file | wc -l

This will also match 'needles' or 'multineedle'.

To match only single words use one of the following commands:

grep -ow 'needle' file | wc -l
grep -o '\bneedle\b' file | wc -l
grep -o '\<needle\>' file | wc -l

edited Apr 09 '20 at 09:12

Alexander Pozdneev

939

answered Feb 06 '11 at 16:27

wag

35,944
12
67
51

9

Note that this requires GNU grep (Linux, Cygwin, FreeBSD, OSX). – Gilles 'SO- stop being evil' May 15 '11 at 14:37
@wag What magic does \b and \B do here? – Geek Jun 12 '14 at 08:36
10

@Geek \b matches a word boundary, \B matches NOT a word boundary. The answer above would be more correct if it used \b at both ends. – Liam Sep 25 '15 at 21:02
1

For a count of occurrences per line, combine with grep -n option and uniq -c ... grep -no '<needle>' file | uniq -c – jameswarren Oct 07 '16 at 13:56
@jameswarren uniq only removes adjacent identical lines, you need to sort before feeding to uniq if you are not already sure that duplicates will always be immediately adjacent. – tripleee Nov 03 '16 at 12:21
how to find the occurences for multiple words seperately? – ZhaoGang Sep 26 '18 at 03:11
1

Doesn't seem to work on WSL, it report a smaller number of occurences on large files. grep 'needle' file -c works in my case – quent Sep 13 '21 at 07:40
@tripleee For efficiency, use sort -u rather than sort | uniq. Here, sort is not necessary since matches from the same line in the source will be consecutive lines in the output. – Jivan Pal May 17 '22 at 16:44
2

@JivanPal This was in the context of uniq -c, which sort cannot do. Of course, if you know identical lines will always be adjacent, you don't need sort at all, which they will be if your pattern is just a static string, but not in the general case. – tripleee May 17 '22 at 17:06

score 23 · Answer 2 · edited Apr 13 '17 at 12:36

If you have GNU grep (always on Linux and Cygwin, occasionally elsewhere), you can count the output lines from grep -o: grep -o needle | wc -l.

With Perl, here are a few ways I find more elegant than yours (even after it's fixed).

perl -lne 'END {print $c} map ++$c, /needle/g'
perl -lne 'END {print $c} $c += s/needle//g'
perl -lne 'END {print $c} ++$c while /needle/g'

With only POSIX tools, one approach, if possible, is to split the input into lines with a single match before passing it to grep. For example, if you're looking for whole words, then first turn every non-word character into a newline.

# equivalent to grep -ow 'needle' | wc -l
tr -c '[:alnum:]' '[\n*]' | grep -c '^needle$'

Otherwise, there's no standard command to do this particular bit of text processing, so you need to turn to sed (if you're a masochist) or awk.

awk '{while (match($0, /set/)) {++c; $0=substr($0, RSTART+RLENGTH)}}
     END {print c}'
sed -n -e 's/set/\n&\n/g' -e 's/^/\n/' -e 's/$/\n/' \
       -e 's/\n[^\n]*\n/\n/g' -e 's/^\n//' -e 's/\n$//' \
       -e '/./p' | wc -l

Here's a simpler solution using sed and grep, which works for strings or even by-the-book regular expressions but fails in a few corner cases with anchored patterns (e.g. it finds two occurrences of ^needle or \bneedle in needleneedle).

sed 's/needle/\n&\n/g' | grep -cx 'needle'

Note that in the sed substitutions above, I used \n to mean a newline. This is standard in the pattern part, but in the replacement text, for portability, substitute backslash-newline for \n.

OJFord · Answer 3 · 2018-07-15T18:20:45.383

7

If, like me, you actually wanted ~~"both; each exactly once",~~ (this is actually "either; twice") then it's simple:

grep -E "thing1|thing2" -c

and check for the output 2.

The benefit of this approach (if exactly once is what you want) is that it scales easily.

edited Jul 15 '18 at 18:20

answered Jan 13 '17 at 13:20

OJFord

1,963

I'm not sure you're actually checking it's only appearing once? All you're looking for there is that either one of those words exist at least once. – Steve Gore Jul 11 '18 at 02:29
1

This should be the accepted answer. No need to use wc -l, grep has a built-in option to count things, and it is even named as obvious as -c for “count”! – rugk Aug 06 '20 at 20:03
6

@rugk You completely missed the first sentence in OP's post, which explicitly explains that -c only counts one occurrence per line. If a string occurs 1000 times on the same line, grep -c will still only count it as one. This answer makes no sense at all for this question. – Alexia Luna Aug 06 '21 at 21:52
The whole point of the question is exactly that the -c option does not work. – Hi-Angel Nov 04 '23 at 13:58

ripat · Answer 4 · 2011-05-15T14:03:28.147

4

Another solution using awk and needle as field separator:

awk -F'^needle | needle | needle$' '{c+=NF-1}END{print c}'

If you want to match needle followed by punctuation, change the field separator accordingly i.e.

awk -F'^needle[ ,.?]|[ ,.?]needle[ ,.?]|[ ,.?]needle$' '{c+=NF-1}END{print c}'

Or use the class: [^[:alnum:]] to encompass all non alpha characters.

edited May 15 '11 at 14:03

answered May 15 '11 at 13:54

ripat

141

1

Note that this requires an awk that supports regexp field separators (such as GNU awk). – Gilles 'SO- stop being evil' May 15 '11 at 14:38

score 3 · Answer 5 · answered Nov 05 '21 at 18:39

I had a need to do this but for more than one search term. And I wanted them to be listed in columns with the number of occurrences of each.

My bash-only, one-liner, solution is as follows:

grep -o -E 'borp|flarb' flarb.log  | sort | uniq -c
 910 borp
9090 flarb

score 1 · Answer 6 · answered Aug 08 '12 at 21:31

1

This is my pure bash solution

#!/bin/bash

B=$(for i in $(cat /tmp/a | sort -u); do
echo "$(grep $i /tmp/a | wc -l) $i"
done)

echo "$B" | sort --reverse

answered Aug 08 '12 at 21:31

Felipe

19

This is rather inefficient and brittle. Don't read lines with for and the broken quoting will cause this to fail where the input file contains lines with whitespace of shell metacharacters. – tripleee May 17 '22 at 17:13

score 1 · Answer 7 · answered Feb 06 '11 at 15:41

1

Your example only prints out the number of occurrences per-line, and not the total in the file. If that's what you want, something like this might work:

perl -nle '$c+=scalar(()=m/needle/g);END{print $c}'

answered Feb 06 '11 at 15:41

jsbillings

24,406

You are right -- my example only counts the occurences in the first line. – Feb 06 '11 at 15:49

Count total number of occurrences using grep

7 Answers7

Linked