Strip most frequent words from text

Question

I have a simple problem, but unfortunately I don't know where to even start (I'm just starting out). So, what I want to do is ultimately increase my vocabulary. I got the idea to strip the most commonly used words out of news articles. I found a list of the 5,000 most commonly used words and saved it. After I get the most commonly used words stripped out, I can create a corpus in TextSTAT and do a word frequency count and choose which words I want to learn that way. But how do I get the words in my most commonly used words list to be removed from the articles I'll be saving?

Missing informations : What is the source, and how it is designed ? One word per line ? — Gilles Quénot, Nov 08 '13 at 20:23
It could be done, but this kind of problem requires too much manipulation to be solved comfortably using shell scripts, especially for a beginner. Try a programming language such as Python, Ruby, or Perl. — 200_success, Nov 08 '13 at 20:33

score 5 · Answer 1 · edited Nov 08 '13 at 21:15

5

Assuming you've got files named "news.articles1", "news.articles2", etc, and you've got your commonly-used words in a file named "stop.words"

cat news.articles* | tr -s '[:blank:]' '[\n*]' |
tr '[:upper:]' '[:lower:]' | fgrep -v -f stop.words

The output of that pipeline ought to contain none of your commonly-used words. You may need to remove all punctuation with an additional step in the pipeline, like:

tr -d '[:punct:]'

A good English-language version of "stop.words" is often in /usr/share/groff/<version>/eign.

edited Nov 08 '13 at 21:15

Stéphane Chazelas

544,893

answered Nov 08 '13 at 20:34

I needed to add the -w option to the fgrep command in order to get it to work. Here's the full command (I also piped in the remove punctuation command): cat news.articles* | tr -s '[:blank:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | fgrep -v -w -f stop.words – skathan Mar 23 '15 at 13:57

Strip most frequent words from text

1 Answers1

Linked