135

I was wondering how to count the number of a specific character in each line by some text processing utilities?

For example, to count " in each line of the following text

"hello!" 
Thank you!

The first line has two, and the second line has 0.

Another example is to count ( in each line.

Anthon
  • 79,293
Tim
  • 101,790
  • 1
    Just going to add that you received much increased performance by writing your own 10 line C program for this rather than using regular expressions with sed. You should consider doing depending on the size of your input files. – user606723 Aug 14 '11 at 22:23

20 Answers20

157

You can do it with sed and awk:

$ sed 's/[^"]//g' dat | awk '{ print length }'
2
0

Where dat is your example text, sed deletes (for each line) all non-" characters and awk prints for each line its size (i.e. length is equivalent to length($0), where $0 denotes the current line).

For another character you just have to change the sed expression. For example for ( to:

's/[^(]//g'

Update: sed is kind of overkill for the task - tr is sufficient. An equivalent solution with tr is:

$ tr -d -c '"\n' < dat | awk '{ print length; }'

Meaning that tr deletes all characters which are not (-c means complement) in the character set "\n.

maxschlepzig
  • 57,532
  • 3
    +1 should be more efficient than the tr&wc version. – Stéphane Gimenez Aug 14 '11 at 19:41
  • 1
    Yes, but can it handle Unicode? – amphetamachine Aug 15 '11 at 10:51
  • @amphetamachine, yes - at least a quick test with ß (utf hex: c3 9f) (instead of ") works as expected, i.e. tr, sed and awk do complement/replacement/counting without a problem - on a Ubuntu 10.04 system. – maxschlepzig Aug 15 '11 at 18:29
  • 1
    Most versions of tr, including GNU tr and classic Unix tr, operate on single byte characters and are not Unicode compliant.. Quoted from Wikipedia tr (Unix) .. Try this snippet: echo "aā⧾c" | tr "ā⧾" b ... on Ubuntu 10.04 ... ß is a single byte Extended Latin char and is handled by tr... The real issue here is not that tr doesn't handle Unicode (because ALL characters are Unicode), it is really that tr only handles one-byte at a time.. – Peter.O Aug 15 '11 at 19:32
  • @fred, no, ß is not a single byte character - its Unicode position is U+00DF, which is coded as 'c3 9f' in UTF-8, i.e. two bytes. – maxschlepzig Aug 16 '11 at 07:20
  • What if I need to count the number of occurrences of two specific consecutive characters (e.g. ,,)? I imagine it should be easy but the sed pattern s/[^,,]//g didn't work. – Amelio Vazquez-Reina Feb 12 '14 at 22:40
  • @AmelioVazquez-Reina It cannot work, by design. Translated into human-readable prose, s/[^,,]//g means: find everything that is not a comma and remove it. Note that this is a [^character]construction, which excludes the character following the caret ^. This should explain why your multiple commas are ignored and interpreted as one single one. – syntaxerror Nov 25 '15 at 04:48
  • Consider wc -c as a potentially less-overkill alternative to awk here. – Ahmed Fasih Aug 04 '16 at 03:00
  • @AhmedFasih, wc -c counts all characters over all lines - including newlines - not the number of characters for each line. Thus, you can't use it as direct replacement for the awk part in my answer. – maxschlepzig Aug 04 '16 at 06:46
76

I would just use awk

awk -F\" '{print NF-1}' <fileName>

Here we set the field separator (with the -F flag) to be the character " then all we do is print number of fields NF - 1. The number of occurrences of the target character will be one less than the number of separated fields.

For funny characters that are interpreted by the shell you just need to make sure you escape them otherwise the command line will try and interpret them. So for both " and ) you need to escape the field separator (with \).

  • 2
    Maybe edit your answer to use singles quotes instead for escaping. It will work with any character (except '). Also, it has a strange behavior with empty lines. – Stéphane Gimenez Aug 15 '11 at 16:08
  • The question specifically uses " so I feel obliged to make the code work with it. It depends what shell you are using weather the character needs to be escaped but bash/tcsh will both need to escape " – Martin York Aug 15 '11 at 16:10
  • 1
    Of course, but there is no problem with -F'"'. – Stéphane Gimenez Aug 15 '11 at 16:12
  • 1
    +1 What a good idea to use FS.... This will resolve the blank-line showing -1, and, for example, the "$1" from the bash commandline. ... awk -F"$1" '{print NF==0?NF:NF-1}' filename – Peter.O Aug 15 '11 at 22:19
  • 1
    Also work with multiple chars as separator... useful ! – COil Sep 30 '16 at 15:35
19

Using tr ard wc:

function countchar()
{
    while IFS= read -r i; do printf "%s" "$i" | tr -dc "$1" | wc -m; done
}

Usage:

$ countchar '"' <file.txt  #returns one count per line of file.txt
1
3
0

$ countchar ')'           #will count parenthesis from stdin
$ countchar '0123456789'  #will count numbers from stdin
15

Yet another implementation that does not rely on external programs, in bash, zsh, yash and some implementations/versions of ksh:

while IFS= read -r line; do 
  line="${line//[!\"]/}"
  echo "${#line}"
done <input-file

Use line="${line//[!(]}"for counting (.

enzotib
  • 51,661
  • When the last line doesn't have a trailing \n, the while loop exits, because although it read the last line, it also returns a non-zero exit code to indicate EOF... to get around it, the following snippet works (..It has been been bugging me for a while, and I've just discovered this workaroung)... eof=false; IFS=; until $eof; do read -r || eof=true; echo "$REPLY"; done – Peter.O Aug 15 '11 at 21:42
  • @Gilles: you added a trailing / that is not needed in bash. It is a ksh requirement? – enzotib Aug 16 '11 at 07:35
  • 1
    The trailing / is needed in older versions of ksh, and IIRC in older versions of bash as well. – Gilles 'SO- stop being evil' Aug 16 '11 at 08:15
13

The answers using awk fail if the number of matches is too large (which happens to be my situation). For the answer from loki-astari, the following error is reported:

awk -F" '{print NF-1}' foo.txt 
awk: program limit exceeded: maximum number of fields size=32767
    FILENAME="foo.txt" FNR=1 NR=1

For the answer from enzotib (and the equivalent from manatwork), a segmentation fault occurs:

awk '{ gsub("[^\"]", ""); print length }' foo.txt
Segmentation fault

The sed solution by maxschlepzig works correctly, but is slow (timings below).

Some solutions not yet suggested here. First, using grep:

grep -o \" foo.txt | wc -w

And using perl:

perl -ne '$x+=s/\"//g; END {print "$x\n"}' foo.txt

Here are some timings for a few of the solutions (ordered slowest to fastest); I limited things to one-liners here. 'foo.txt' is a file with one line and one long string which contains 84922 matches.

## sed solution by [maxschlepzig]
$ time sed 's/[^"]//g' foo.txt | awk '{ print length }'
84922
real    0m1.207s
user    0m1.192s
sys     0m0.008s

## using grep
$ time grep -o \" foo.txt | wc -w
84922
real    0m0.109s
user    0m0.100s
sys     0m0.012s

## using perl
$ time perl -ne '$x+=s/\"//g; END {print "$x\n"}' foo.txt
84922
real    0m0.034s
user    0m0.028s
sys     0m0.004s

## the winner: updated tr solution by [maxschlepzig]
$ time tr -d -c '\"\n' < foo.txt |  awk '{ print length }'
84922
real    0m0.016s
user    0m0.012s
sys     0m0.004s
josephwb
  • 233
  • good idea! I expanded your table, in a new answer, feel free to edit (the final picture is not so clear, but I believe @maxschlepzig is steel the faster solution)
  • – JJoao Mar 04 '15 at 08:35
  • 1
    maxschlepzig's solution is super fast! – petertc Apr 01 '16 at 06:36
  • For your Perl answer, if you're printing the final $x in an END block, then won't you only get a single-number return? But the OP asked for a count ___per line___ ... ? – jubilatious1 Oct 17 '23 at 18:25
  • 1
    @jubilatious1 I explained in the text that my test example only had a single line, which was my use case. I came to this page (almost 10 years ago :) ) originally trying to find a way that wouldn't break with the number of matches I was dealing with. So you are correct, it does not fit the original question if a file contains more than one line. – josephwb Oct 18 '23 at 19:21