I have a simple problem, but unfortunately I don't know where to even start (I'm just starting out). So, what I want to do is ultimately increase my vocabulary. I got the idea to strip the most commonly used words out of news articles. I found a list of the 5,000 most commonly used words and saved it. After I get the most commonly used words stripped out, I can create a corpus in TextSTAT and do a word frequency count and choose which words I want to learn that way. But how do I get the words in my most commonly used words list to be removed from the articles I'll be saving?
Asked
Active
Viewed 1,692 times
3
-
1Missing informations : What is the source, and how it is designed ? One word per line ? – Gilles Quénot Nov 08 '13 at 20:23
-
2It could be done, but this kind of problem requires too much manipulation to be solved comfortably using shell scripts, especially for a beginner. Try a programming language such as Python, Ruby, or Perl. – 200_success Nov 08 '13 at 20:33
1 Answers
5
Assuming you've got files named "news.articles1", "news.articles2", etc, and you've got your commonly-used words in a file named "stop.words"
cat news.articles* | tr -s '[:blank:]' '[\n*]' |
tr '[:upper:]' '[:lower:]' | fgrep -v -f stop.words
The output of that pipeline ought to contain none of your commonly-used words. You may need to remove all punctuation with an additional step in the pipeline, like:
tr -d '[:punct:]'
A good English-language version of "stop.words" is often in /usr/share/groff/<version>/eign
.

Stéphane Chazelas
- 544,893
-
I needed to add the
-w
option to thefgrep
command in order to get it to work. Here's the full command (I also piped in the remove punctuation command):cat news.articles* | tr -s '[:blank:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | fgrep -v -w -f stop.words
– skathan Mar 23 '15 at 13:57