find n most frequent words in a file with a stop words list from the command line

Question

I want to find the most frequent words in a text file, with using a stop words list. I already have this code:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10 > test.txt

from an old post but my file contains something like this:

240 
 21 ipsum
 20 Lorem
 11 Textes
 9 Blindtexte
 7 Text
 5 F
 5 Blindtext
 4 Texte
 4 Buchstaben

The first one is just a Space and in the text they are punctuation marks (like points), but I don´t want this, so what does I have to add?

John1024 · Accepted Answer · 2016-10-15T06:21:29.497

Consider this test file:

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

To get a word count:

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

How it works

grep -oE '[[:alpha:]]+' text.txt

This returns all words, minus any spaces or punctuation, with one word per line.
sort

This sorts the words into alphabetical order.
uniq -c

This counts the number of times each word occurs. (For uniq to work, its input must be sorted.)
sort -nr

This sorts the output numerically so that the most frequent word is at the top.

Handling mixed case

Consider this mixed-case test file:

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

If we want to count some and Some as the same:

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

Here, we added the -f option to sort so that it would ignore case and the -i option to uniq so that it also would ignore case.

Excluding stop words

Suppose that we want to exclude these stop words from the count:

$ cat stopwords 
with
not
has
do

So, we add grep -v to eliminate these words:

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

score 1 · Answer 2 · answered Oct 15 '16 at 11:35

Command:

cat text.txt | tr ' ' '\n' | grep -v 'words\|word2' | sort | uniq -c | sort -nk1

How this works

Following is the file content

$ cat file.txt

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

$ cat file.txt|tr ' ' '\n'| grep -v -w 'an\|a\|is'| sort| uniq -c| sort -nk1|tail
      1 unknown
      1 when
      2 and
      2 dummy
      2 Ipsum
      2 Lorem
      2 of
      2 text
      2 type
      3 the

Description : Translate the space with new line, then excude the words from the list, then sort it and counted for the frequent used one's

find n most frequent words in a file with a stop words list from the command line

2 Answers2

How it works

Handling mixed case

Excluding stop words

How this works

Linked