How to count the number of a specific character in each line?

Question

I was wondering how to count the number of a specific character in each line by some text processing utilities?

For example, to count " in each line of the following text

"hello!" 
Thank you!

The first line has two, and the second line has 0.

Another example is to count ( in each line.

Just going to add that you received much increased performance by writing your own 10 line C program for this rather than using regular expressions with sed. You should consider doing depending on the size of your input files. — user606723, Aug 14 '11 at 22:23

maxschlepzig · Accepted Answer · 2011-08-15T07:36:21.963

157

You can do it with sed and awk:

$ sed 's/[^"]//g' dat | awk '{ print length }'
2
0

Where dat is your example text, sed deletes (for each line) all non-" characters and awk prints for each line its size (i.e. length is equivalent to length($0), where $0 denotes the current line).

For another character you just have to change the sed expression. For example for ( to:

's/[^(]//g'

Update: sed is kind of overkill for the task - tr is sufficient. An equivalent solution with tr is:

$ tr -d -c '"\n' < dat | awk '{ print length; }'

Meaning that tr deletes all characters which are not (-c means complement) in the character set "\n.

edited Aug 15 '11 at 07:36

answered Aug 14 '11 at 19:06

maxschlepzig

57,532

3

+1 should be more efficient than the tr&wc version. – Stéphane Gimenez Aug 14 '11 at 19:41
1

Yes, but can it handle Unicode? – amphetamachine Aug 15 '11 at 10:51
@amphetamachine, yes - at least a quick test with ß (utf hex: c3 9f) (instead of ") works as expected, i.e. tr, sed and awk do complement/replacement/counting without a problem - on a Ubuntu 10.04 system. – maxschlepzig Aug 15 '11 at 18:29
1

Most versions of tr, including GNU tr and classic Unix tr, operate on single byte characters and are not Unicode compliant.. Quoted from Wikipedia tr (Unix) .. Try this snippet: echo "aā⧾c" | tr "ā⧾" b ... on Ubuntu 10.04 ... ß is a single byte Extended Latin char and is handled by tr... The real issue here is not that tr doesn't handle Unicode (because ALL characters are Unicode), it is really that tr only handles one-byte at a time.. – Peter.O Aug 15 '11 at 19:32
@fred, no, ß is not a single byte character - its Unicode position is U+00DF, which is coded as 'c3 9f' in UTF-8, i.e. two bytes. – maxschlepzig Aug 16 '11 at 07:20
What if I need to count the number of occurrences of two specific consecutive characters (e.g. ,,)? I imagine it should be easy but the sed pattern s/[^,,]//g didn't work. – Amelio Vazquez-Reina Feb 12 '14 at 22:40
@AmelioVazquez-Reina It cannot work, by design. Translated into human-readable prose, s/[^,,]//g means: find everything that is not a comma and remove it. Note that this is a [^character]construction, which excludes the character following the caret ^. This should explain why your multiple commas are ignored and interpreted as one single one. – syntaxerror Nov 25 '15 at 04:48
Consider wc -c as a potentially less-overkill alternative to awk here. – Ahmed Fasih Aug 04 '16 at 03:00
@AhmedFasih, wc -c counts all characters over all lines - including newlines - not the number of characters for each line. Thus, you can't use it as direct replacement for the awk part in my answer. – maxschlepzig Aug 04 '16 at 06:46

score 76 · Answer 2 · answered Aug 14 '11 at 22:47

76

I would just use awk

awk -F\" '{print NF-1}' <fileName>

Here we set the field separator (with the -F flag) to be the character " then all we do is print number of fields NF - 1. The number of occurrences of the target character will be one less than the number of separated fields.

For funny characters that are interpreted by the shell you just need to make sure you escape them otherwise the command line will try and interpret them. So for both " and ) you need to escape the field separator (with \).

answered Aug 14 '11 at 22:47

Martin York

952

2

Maybe edit your answer to use singles quotes instead for escaping. It will work with any character (except '). Also, it has a strange behavior with empty lines. – Stéphane Gimenez Aug 15 '11 at 16:08
The question specifically uses " so I feel obliged to make the code work with it. It depends what shell you are using weather the character needs to be escaped but bash/tcsh will both need to escape " – Martin York Aug 15 '11 at 16:10
1

Of course, but there is no problem with -F'"'. – Stéphane Gimenez Aug 15 '11 at 16:12
1

+1 What a good idea to use FS.... This will resolve the blank-line showing -1, and, for example, the "$1" from the bash commandline. ... awk -F"$1" '{print NF==0?NF:NF-1}' filename – Peter.O Aug 15 '11 at 22:19
1

Also work with multiple chars as separator... useful ! – COil Sep 30 '16 at 15:35

Stéphane Gimenez · Answer 3 · 2015-03-04T08:39:43.007

19

Using tr ard wc:

function countchar()
{
    while IFS= read -r i; do printf "%s" "$i" | tr -dc "$1" | wc -m; done
}

Usage:

$ countchar '"' <file.txt  #returns one count per line of file.txt
1
3
0

$ countchar ')'           #will count parenthesis from stdin
$ countchar '0123456789'  #will count numbers from stdin

edited Mar 04 '15 at 08:39

answered Aug 14 '11 at 18:37

Stéphane Gimenez

28,907

5

Note. tr doesn't handle characters which use more than one byte.. see Wikipedia tr (Unix) .. ie. tr is not Unicode compliant. – Peter.O Aug 15 '11 at 19:43
2

You're running 4 commands for every line of the file – Stéphane Chazelas Mar 03 '15 at 18:25
1

you need to remove whitespace characters from $IFS, otherwise read will trim them from the start and end. – Stéphane Chazelas Mar 03 '15 at 18:25
1

you can't use echo for arbitrary data – Stéphane Chazelas Mar 03 '15 at 18:27
1

@Peter.O, some tr implementations support multibyte characters, but wc -c counts bytes, not characters anyway (need wc -m for characters). – Stéphane Chazelas Mar 03 '15 at 18:28
@StéphaneChazelas: My answer was not trying to deal with the corner cases, but thanks, now it should work most of the time. – Stéphane Gimenez Mar 04 '15 at 08:42
Well that's disappointing, I was looking forward to a Battle of the Stéphanes. – Hashim Aziz Jan 14 '21 at 02:12

score 15 · Answer 4 · edited Nov 23 '14 at 19:48

15

Yet another implementation that does not rely on external programs, in bash, zsh, yash and some implementations/versions of ksh:

while IFS= read -r line; do 
  line="${line//[!\"]/}"
  echo "${#line}"
done <input-file

Use line="${line//[!(]}"for counting (.

edited Nov 23 '14 at 19:48

Stéphane Chazelas

544,893

answered Aug 14 '11 at 20:48

enzotib

51,661

When the last line doesn't have a trailing \n, the while loop exits, because although it read the last line, it also returns a non-zero exit code to indicate EOF... to get around it, the following snippet works (..It has been been bugging me for a while, and I've just discovered this workaroung)... eof=false; IFS=; until $eof; do read -r || eof=true; echo "$REPLY"; done – Peter.O Aug 15 '11 at 21:42
@Gilles: you added a trailing / that is not needed in bash. It is a ksh requirement? – enzotib Aug 16 '11 at 07:35
1

The trailing / is needed in older versions of ksh, and IIRC in older versions of bash as well. – Gilles 'SO- stop being evil' Aug 16 '11 at 08:15

score 13 · Answer 5 · edited Apr 13 '17 at 12:36

The answers using awk fail if the number of matches is too large (which happens to be my situation). For the answer from loki-astari, the following error is reported:

awk -F" '{print NF-1}' foo.txt 
awk: program limit exceeded: maximum number of fields size=32767
    FILENAME="foo.txt" FNR=1 NR=1

For the answer from enzotib (and the equivalent from manatwork), a segmentation fault occurs:

awk '{ gsub("[^\"]", ""); print length }' foo.txt
Segmentation fault

The sed solution by maxschlepzig works correctly, but is slow (timings below).

Some solutions not yet suggested here. First, using grep:

grep -o \" foo.txt | wc -w

And using perl:

perl -ne '$x+=s/\"//g; END {print "$x\n"}' foo.txt

Here are some timings for a few of the solutions (ordered slowest to fastest); I limited things to one-liners here. 'foo.txt' is a file with one line and one long string which contains 84922 matches.

## sed solution by [maxschlepzig]
$ time sed 's/[^"]//g' foo.txt | awk '{ print length }'
84922
real    0m1.207s
user    0m1.192s
sys     0m0.008s

## using grep
$ time grep -o \" foo.txt | wc -w
84922
real    0m0.109s
user    0m0.100s
sys     0m0.012s

## using perl
$ time perl -ne '$x+=s/\"//g; END {print "$x\n"}' foo.txt
84922
real    0m0.034s
user    0m0.028s
sys     0m0.004s

## the winner: updated tr solution by [maxschlepzig]
$ time tr -d -c '\"\n' < foo.txt |  awk '{ print length }'
84922
real    0m0.016s
user    0m0.012s
sys     0m0.004s

For your Perl answer, if you're printing the final $x in an END block, then won't you only get a single-number return? But the OP asked for a count ___per line___ ... ? — jubilatious1, Oct 17 '23 at 18:25
@jubilatious1 I explained in the text that my test example only had a single line, which was my use case. I came to this page (almost 10 years ago :) ) originally trying to find a way that wouldn't break with the number of matches I was dealing with. So you are correct, it does not fit the original question if a file contains more than one line. — josephwb, Oct 18 '23 at 19:21

Stéphane Chazelas · Answer 6 · 2022-03-15T07:43:21.113

13

Another awk solution:

awk '{print gsub(/"/, "")}' < input-file

edited Mar 15 '22 at 07:43

answered Nov 23 '14 at 19:53

Stéphane Chazelas

544,893

score 8 · Answer 7 · edited Dec 24 '14 at 02:25

8

For a string, the simplest would be with tr and wc (no need to overkill with awk or sed) - but note the above comments about tr, counts bytes, not characters -

echo $x | tr -d -c '"' | wc -m

where $x is the variable that contains the string (not a file) to evaluate.

edited Dec 24 '14 at 02:25

Hauke Laging

90,279

answered Dec 24 '14 at 02:02

Ocumo

81
1
1

score 8 · Answer 8 · edited Oct 01 '14 at 18:08

8

Another possible implementation with awk and gsub:

awk '{ gsub("[^\"]", ""); print length }' input-file

The function gsub is the equivalent of sed's 's///g' .

Use gsub("[^(]", "")for counting (.

edited Oct 01 '14 at 18:08

josephwb

233

answered Aug 14 '11 at 20:12

enzotib

51,661

You can save one character, i.e. when removing the stdin redirection ... ;) – maxschlepzig Aug 14 '11 at 20:34
@maxschlepzig: yeah, of course ;) – enzotib Aug 14 '11 at 20:43
1

awk '{print gsub(/"/,"")}' input-file would be enough, as "For each substring matching the regular expression r in the string t, substitute the string s, and return the number of substitutions." (man awk) – manatwork Sep 06 '11 at 12:42
@maxschlepzig, stdin redirection doesn't have to add one character (awk '...'<file is as long as awk '...' file), and has several advantages over passing the filename as argument especially with awk which chokes on some file names when they contain = characters (or start with - with older versions of busybox awk) – Stéphane Chazelas Mar 15 '22 at 08:25

score 6 · Answer 9 · edited Aug 15 '11 at 16:33

6

I decided to write a C program cause I was bored.

You should probably add input validation, but other than that's all set.

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
        char c = argv[1][0];
        char * line = NULL;
        size_t len = 0;
        while (getline(&line, &len, stdin) != -1)
        {
                int count = 0;
                char * s = line;
                while (*s) if(*s++ == c) count++;
                printf("%d\n",count);
        }
        if(line) free(line);
}

edited Aug 15 '11 at 16:33

Stéphane Gimenez

28,907

answered Aug 14 '11 at 23:28

user606723

905

1

Thanks! Thanks for being bored so that I can learn something. Oh wait, do you need a return? – Tim Aug 14 '11 at 23:31
***shrugs***, if you want to be fully correct, you also need to add a few more #includes, but the default warnings on my compiler doesn't seem to care. – user606723 Aug 14 '11 at 23:39
You can leave out the free(line) because exiting the program implicitly frees all allocated memory - then there is place for a return 0; ... ;). Even in examples it is not good style to leave the return code undefined. Btw, getline is a GNU extension - in case someone is wondering. – maxschlepzig Aug 15 '11 at 06:04
@maxschlepzig: Is the memory pointed by line allocated by getline()? Is it allocated dynamically on heap by malloc or statically on stack? You said freeing it is not necessary, so is it not allocated dynamically? – Tim Aug 15 '11 at 06:28
@Tim, yes, the memory pointed to by line is allocated by getline (after the first successful call of getline). getline also does reallocaction if necessary. That means it is allocated dynamically on the heap. And under Unix the heap of a process is destroyed on exit. Btw, the stack is also very dynamic - allocations on the stack may depend on how often a function is called or you can also create dynamically sized arrays on the stack. – maxschlepzig Aug 15 '11 at 07:30
@max: so if not considering that Unix destroys the heap of a process on exit, technically should free() be called explicitly? – Tim Aug 15 '11 at 07:34
1

@Tim, yes, e.g. if you refactor the code such that it is a standalone function - say -f, which is called several times from other code, then you have to call free after the last call of getline at the end of this function f. – maxschlepzig Aug 15 '11 at 07:44

score 4 · Answer 10 · edited Aug 15 '11 at 16:33

4

Here is another C solution that only needs STD C and less memory:

#include <stdio.h>

int main(int argc, char **argv)
{
  if (argc < 2 || !*argv[1]) {
    puts("Argument missing.");
    return 1;
  }
  char c = *argv[1], x = 0;
  size_t count = 0;
  while ((x = getc(stdin)) != EOF)
    if (x == '\n') {
      printf("%zd\n", count);
      count = 0;
    } else if (x == c)
      ++count;
  return 0;
}

edited Aug 15 '11 at 16:33

Stéphane Gimenez

28,907

answered Aug 15 '11 at 06:18

maxschlepzig

57,532

This will not report on the last line if it doesn't have a trailing '\n' – Peter.O Aug 15 '11 at 22:24
1

@fred, yes, which is on purpose, because a line without a trailing \n is no real line. This is the same behavior as with my other sed/awk (tr/awk) answer. – maxschlepzig Aug 16 '11 at 07:25
Does this C program handle multibyte characters? Does á(U+E1 "LATIN SMALL LETTER A WITH ACUTE") and a + ́(U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT") get counted in an equivalent manner? – jubilatious1 Oct 17 '23 at 19:16
1

@jubilatious1 no, it just works for ASCII subset. – maxschlepzig Oct 18 '23 at 12:52

Kannan Mohan · Answer 11 · 2014-11-23T19:32:15.543

3

We can use grep with regex to make it more simple and powerful.

To count specific character.

$ grep -o '"' file.txt|wc -l

To count special characters including whitespace characters.

$ grep -Po '[\W_]' file.txt|wc -l

Here we are selecting any character with [\S\s] and with -o option we make grep to print each match (which is, each character) in separate line. And then use wc -l to count each line.

edited Nov 23 '14 at 19:32

answered Nov 23 '14 at 17:53

Kannan Mohan

3,231

1

OP don't want to print number of all characters in a file! He wants to count/print number of a specific character. for example how many " are in each line; and for any other chars. see his question and also accepted answer. – αғsнιη Nov 23 '14 at 19:14

score 3 · Answer 12 · answered Jan 13 '15 at 17:12

Maybe a more straight forward, purely awk answer would be to use split. Split takes a string and turns it into an array, the return value is the number of array items generated + 1.

The following code will print out the number of times " appears on each line.

awk ' {print (split($0,a,"\"")-1) }' file_to_parse

more info on split http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_92.html

score 3 · Answer 13 · answered Mar 03 '15 at 17:09

3

Here is a simple Python script to find the count of " in each line of a file:

#!/usr/bin/env python2
with open('file.txt') as f:
    for line in f:
        print line.count('"')

Here we have used the count method of built-in str type.

answered Mar 03 '15 at 17:09

heemayl

56,300

JJoao · Answer 14 · 2017-02-14T16:43:38.100

Time comparison of the presented solutions (not an answer)

The efficiency of the answers is not important. Nevertheless, following @josephwb approach, I tried to time all the answers presented.

I use as input the Portuguese translation of Victor Hugo "Les Miserables" (great book!) and count the occurrences of "a". My edition has 5 volumes, many pages...

$ wc miseraveis.txt 
29331  304166 1852674 miseraveis.txt

C answers were compiled with gcc, (no optimizations).

Each answer was run 3 times and choose the best.

Don't trust too much these numbers (my machine is doing other tasks, etc, etc.). I share these times with you, because I got some unexpected results and I'm sure you will find some more...

14 of 16 timed solutions took less then 1s; 9 less then 0.1s, many of them using pipes
2 solutions, using bash line by line, processed the 30k lines by creating new processes, calculate the correct solution in 10s /20s.
grep -oP a is tree times faster then grep -o a (10;11 vs 12)
The difference between C and others is not so big as I expected. (7;8 vs 2;3)
(conclusions welcome)

(results in a random order)

=========================1 maxschlepzig
$ time sed 's/[^a]//g' mis.txt | awk '{print length}' > a2
real    0m0.704s ; user 0m0.716s
=========================2 maxschlepzig
$ time tr -d -c 'a\n' < mis.txt | awk '{ print length; }' > a12
real    0m0.022s ; user 0m0.028s
=========================3 jjoao
$ time perl -nE 'say y!a!!' mis.txt  > a1
real    0m0.032s ; user 0m0.028s
=========================4 Stéphane Gimenez
$ function countchar(){while read -r i; do echo "$i"|tr -dc "$1"|wc -c; done }

$ time countchar "a"  < mis.txt > a3
real    0m27.990s ; user    0m3.132s
=========================5 Loki Astari
$ time awk -Fa '{print NF-1}' mis.txt > a4
real    0m0.064s ; user 0m0.060s
Error : several -1
=========================6 enzotib
$ time awk '{ gsub("[^a]", ""); print length }' mis.txt > a5
real    0m0.781s ; user 0m0.780s
=========================7 user606723
#include <stdio.h> #include <string.h> // int main(int argc, char *argv[]) ...  if(line) free(line); }

$ time a.out a < mis.txt > a6
real    0m0.024s ; user 0m0.020s
=========================8 maxschlepzig
#include <stdio.h> // int main(int argc, char **argv){if (argc < 2 || !*argv[1]) { ...  return 0; }

$ time a.out a < mis.txt > a7
real    0m0.028s ; user 0m0.024s
=========================9 Stéphane Chazelas
$ time awk '{print gsub(/a/, "")}'< mis.txt > a8
real    0m0.053s ; user 0m0.048s
=========================10 josephwb count total
$ time grep -o a < mis.txt | wc -w > a9
real    0m0.131s ; user 0m0.148s
=========================11 Kannan Mohan count total
$ time grep -o 'a' mis.txt | wc -l > a15
real    0m0.128s ; user 0m0.124s
=========================12 Kannan Mohan count total
$ time grep -oP 'a' mis.txt | wc -l > a16
real    0m0.047s ; user 0m0.044s
=========================13 josephwb Count total
$ time perl -ne '$x+=s/a//g; END {print "$x\n"}'< mis.txt > a10
real    0m0.051s ; user 0m0.048s
=========================14 heemayl
#!/usr/bin/env python2 // with open('mis.txt') as f: for line in f: print line.count('"')

$ time pyt > a11
real    0m0.052s ; user 0m0.052s
=========================15 enzotib
$ time  while IFS= read -r line; do   line="${line//[!a]/}"; echo "${#line}"; done < mis.txt  > a13
real    0m9.254s ; user 0m8.724s
=========================16 bleurp
$ time awk ' {print (split($0,a,"a")-1) }' mis.txt > a14
real    0m0.148s ; user 0m0.144s
Error several -1

Marian · Answer 15 · 2015-03-04T23:07:35.747

3

For a pure bash solution (however, it's bash-specific): If $x is the variable containing your string:

x2="${x//[^\"]/}"
echo ${#x2}

The ${x// thing removes all chars except ", ${#x2} calculates the length of this rest.

(Original suggestion using expr which has problems, see comments: )

expr length "${x//[^\"]/}"

edited Mar 04 '15 at 23:07

answered Feb 25 '13 at 17:36

Marian

156

Note that it's specific to GNU expr and counts bytes, not characters. With other expr: expr "x${x...}" : "x.*" - 1 – Stéphane Chazelas Nov 23 '14 at 21:27
Oh right, thanks! I've modified it using another idea I just had, which has the advantage of not using an external program at all. – Marian Mar 04 '15 at 23:08

JJoao · Answer 16 · 2017-01-12T22:28:07.450

2

Replace a by the char to be counted. Output is the counter for each line.

perl -nE 'say y!a!!'

edited Jan 12 '17 at 22:28

answered Mar 03 '15 at 18:12

JJoao

12,170
1
23
45

score 2 · Answer 17 · answered Nov 25 '15 at 04:30

grep -n -o \" file | sort -n | uniq -c | cut -d : -f 1

where grep does all the heavy lifting: reports each character found at each line number. The rest is just to sum the count per line, and format the output.

Remove the -n and get the count for the whole file.

Counting a 1.5Meg text file in under 0.015 secs seems fast.
And does work with characters (not bytes).

jubilatious1 · Answer 18 · 2023-10-17T19:44:45.233

Using Raku (formerly known as Perl_6)

raku -ne 'put m:g/\"/.elems;'

OR

raku -ne '.match( /\"/, :global).elems.put;'

Sample Input (task is to count " doublequotes):

zero
"two"
"two","four"
"two","four","six"
"two","four","six","eight"

Sample Output:

FYI, I try very hard to stump Raku with Unicode characters and the language performs very well (it does NFC Normalization under-the-hood). It seems to have earned the moniker "Unicode-ready". Below, counting Bengali letters with Raku:

Sample Input (Bengali days-of-the-week from Wikipedia):

~$ cat  Bengali_DOW.txt
রবিবার/সূর্যবার Rabibār/Sūryabār
সোমবার/চন্দ্রবার Somabār/Chandrabār
মঙ্গলবার Mangalbār
বুধবার Budhabār
বৃহস্পতিবার/গুরুবার Brihaspatibār/Gurubār
শুক্রবার Shukrabār
শনিবার Shanibār

Sample Output (testing with first letter of each line):

~$ raku -ne 'put m:g/ <[র সো ম বু বৃ শু শ %]> /.elems;'  Bengali_DOW.txt
3
5
2
2
3
3
2

https://docs.raku.org/language/unicode#Normalization
https://raku.org

score 0 · Answer 19 · answered Mar 15 '22 at 07:11

0

Everyone is complicating things so much. I'll give the cleanest, simplest and probably the most performant answer for you:

grep -onF '"' input.txt | uniq -c

Add cut -d: -f1 if you would like better format.

answered Mar 15 '22 at 07:11

Weihang Jian

1,227

2

grep -o | uniq (including the final cut) was already given 7 years before by user79743 – xhienne Oct 18 '23 at 00:15
Will your answer count non-ASCII characters? https://unix.stackexchange.com/questions/263691/grep-is-not-matching-non-ascii-characters – jubilatious1 Oct 18 '23 at 16:44
As xhienne writes, duplicate of: https://unix.stackexchange.com/a/245314/1131 – maxschlepzig Jan 20 '24 at 19:35
Also, it's not the ' most performant answer' - I did a quick test on 2.7 GiB of randomly generated ASCII input and on my Linux laptop in runs in 12 s wall clock time, whereas the tr+awk method finishes in 10 seconds. – maxschlepzig Jan 20 '24 at 19:47

javadr · Answer 20 · 2022-03-07T06:12:09.850

-1

Although most of the answers are really fascinating, I just want to mention another one with the aid of awk:

$ printf "\"hello!\"                                                                                  
  Thank you!" > data
$ awk '{counter=0; for(i=1;i<=length;i++) if(substr($0,i,1)=="&quot;") counter++; print counter}' data
2
0

In the first part, it counts the number of " and then at the END it prints the counter.

edited Mar 07 '22 at 06:12

answered Mar 06 '22 at 12:36

javadr

131

The issue in the answer is fixed now. Please see the update. – javadr Mar 07 '22 at 10:54

How to count the number of a specific character in each line?

20 Answers20

Linked