Consider this test file:
$ cat text.txt
this file has "many" words, some
with punctuation. some repeat,
many do not.
To get a word count:
$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
2 some
2 many
1 words
1 with
1 this
1 repeat
1 punctuation
1 not
1 has
1 file
1 do
How it works
grep -oE '[[:alpha:]]+' text.txt
This returns all words, minus any spaces or punctuation, with one word per line.
sort
This sorts the words into alphabetical order.
uniq -c
This counts the number of times each word occurs. (For uniq
to work, its input must be sorted.)
sort -nr
This sorts the output numerically so that the most frequent word is at the top.
Handling mixed case
Consider this mixed-case test file:
$ cat Text.txt
This file has "many" words, some
with punctuation. Some repeat,
many do not.
If we want to count some
and Some
as the same:
$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
2 some
2 many
1 words
1 with
1 This
1 repeat
1 punctuation
1 not
1 has
1 file
1 do
Here, we added the -f
option to sort
so that it would ignore case and the -i
option to uniq
so that it also would ignore case.
Excluding stop words
Suppose that we want to exclude these stop words from the count:
$ cat stopwords
with
not
has
do
So, we add grep -v
to eliminate these words:
$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
2 some
2 many
1 words
1 This
1 repeat
1 punctuation
1 file