44

I want to find, say, 10 most common word in a text file. Firstly, solution should be optimized for keystrokes (in other words - my time). Secondly, for the performance. Here is what I have so far to get top 10:

cat test.txt | tr -c '[:alnum:]' '[\n*]' | uniq -c | sort -nr | head  -10
  6 k
  2 g
  2 e
  2 a
  1 r
  1 k22
  1 k
  1 f
  1 eeeeeeeeeeeeeeeeeeeee
  1 d

I could make a java, python etc. program where I store (word, numberOfOccurences) in a dictionary and sort the value or I could use MapReduce, but I optimize for keystrokes.

Are there any false positives? Is there a better way?

6 Answers6

62

That's pretty much the most common way of finding "N most common things", except you're missing a sort, and you've got a gratuitious cat:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -ci | sort -nr | head  -10

If you don't put in a sort before the uniq -ci you'll probably get a lot of false singleton words. uniq only does unique runs of lines, not overall uniqueness.

You might want to use a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named eign in it which contains a pretty decent list of stop words. My Arch distro has /usr/share/groff/current/eign, but I think I've also seen /usr/share/dict/eign or /usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f -i /usr/share/groff/current/eign |
sort | uniq -ci | sort -nr | head  -10

My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.

The -w flag on fgrep enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i". The -i flag on uniq and fgrep ignores case while comparing words.

Stephen Kitt
  • 434,908
  • 3
    Does cat add some significant performance overhead? I like the pipe syntax. What does the * in '[\n*]' do? – Lukasz Madon Jun 24 '12 at 01:04
  • 1
    If you like the "cat test.txt", then by all means use it. I've read an article someplace where Dennis Ritchie says that the "cat something | somethingelse" syntax is more widely used, and that the '< something' syntax was something of a mistake, since it's single purpose. –  Jun 24 '12 at 22:25
  • What if I want to find the most common directory name in a find output? That is, split words on / instead of whitespace characters and similar. – erb Feb 15 '16 at 13:03
  • 1
    @erb - you would probably do something like: find somewhere optoins | tr '/' '\n' | sort | uniq -c | sort -k1.1nr | head -10 –  Feb 15 '16 at 15:03
  • @BruceEdiger That did it, thanks! Now, I have an additional question which proved more difficult: How do I list the most common directory paths in a find output? (the directory path with the most files in them) Or alternatively: how do I list the most common string beginnings, where only string beginnings ending with / are valid (in order to prevent [./d, ./di, ./dir, ./dir/]). – erb Feb 15 '16 at 15:16
  • 1
    @erb - ask that as a question, not in a comment. You will have more room to frame your question, so as to get the answer you need. Give example input, and desired output. You might get some reputation points for asking a good question, and I will get points for giving a better answer than I can in a comment. –  Feb 15 '16 at 17:39
  • @BruceEdiger I solved it another way: https://github.com/ErikBjare/filelists/blob/master/common_paths.py Thanks for the help none the less. – erb Feb 15 '16 at 21:46
  • You could use grep: cat test.txt | grep -o -E '\w+' | sort | uniq -c | sort -nr | head -10 – jrc Aug 19 '17 at 22:13
  • How can I skip words containing 3 chars or less? – morpheus Apr 01 '20 at 22:13
  • @morpheus - by inserting an extra step in the pipeline after the fgrep entry, something like; grep '^.....*$' | –  Apr 02 '20 at 13:42
  • @BruceEdiger thanks! that works! you the man! – morpheus Apr 02 '20 at 14:59
  • On a mac, this will be in a similar directory: find /usr/share/groff/ -name 'eign' – Jake Ireland Sep 21 '20 at 01:14
10

This works better with utf-8:

$ sed -e 's/\s/\n/g' < test.txt | sort | uniq -c | sort -nr | head  -10
7

Let's use AWK!

This function lists the frequency of each word occurring in the provided file in Descending order:

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

You can call it on your file like this:

$ cat your_file.txt | wordfrequency

and for the top 10 words:

$ cat your_file.txt | wordfrequency | head -10

Source: AWK-ward Ruby

Sheharyar
  • 453
6

This is a classic problem that got some resonance in 1986, when Donald Knuth implemented a fast solution with hash tries in an 8-pages-long program to illustrate his literate programming technique, while Doug McIlroy, the godfather of Unix pipes, responded with a one-liner, that was not as quick, but got the job done:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q

Of course, McIlroy's solution has time complexity O(N log N), where N is a total number of words. There are much faster solutions. For example:

Here is a C++ implementation with the upper bound time complexity O((N + k) log k), typically – nearly linear.

Below is a fast Python implementation using hash dictionaries and heap with time complexity O(N + k log Q), where Q is a number of unique words:

import collections, re, sys

filename = sys.argv[1]
k = int(sys.argv[2]) if len(sys.argv)>2 else 10

text = open(filename).read()
counts = collections.Counter(re.findall('[a-z]+', text.lower()))
for i, w in counts.most_common(k):
    print(i, w)

Here is an extremely fast solution in Rust by Anders Kaseorg.

CPU time comparison (in seconds):

                                     bible32       bible256
Rust (prefix tree)                   0.632         5.284
C++ (prefix tree + heap)             4.838         38.587
Python (Counter)                     9.851         100.487
Sheharyar (AWK + sort)               30.071        251.301
McIlroy (tr + sort + uniq)           60.251        690.906

Notes:

  • bible32 is Bible concatenated with itself 32 times (135 MB), bible256 – 256 times respectively (1.1 GB).
  • Python scripts's non-linear slow down is caused purely by the fact that it processes files completely in memory, so the overheads are getting bigger for huge files.
  • If there was a Unix tool that could construct a heap and pick n elements from top of the heap, the AWK solution could achieve near-linear time complexity, while currently it is O(N + Q log Q).
3

Let's use Haskell!

This is turning into a language war, isn't it?

import Data.List
import Data.Ord

main = interact $ (=<<) (\x -> show (length x) ++ " - " ++ head x ++ "\n")
                . sortBy (flip $ comparing length)
                . group . sort
                . words

Usage:

cat input | wordfreq

Alternatively:

cat input | wordfreq | head -10
BlackCap
  • 261
  • 1
  • 3
  • 10
  • a modified version ignoring case: https://pastebin.com/57T5B6BY – Axel Latvala Oct 24 '17 at 06:08
  • Works much slower than classic sort | uniq -c | sort -nr. – Andriy Makukha Jul 09 '19 at 09:43
  • @AndriyMakukha The bottleneck is that strings are linked lists of characters in Haskell. We could get C-like speeds by switching to Text or ByteString instead, which is as simple as importing it qualified and prefixing the functions with the qualifier. – BlackCap Jul 11 '19 at 23:31
  • https://pastebin.com/QtJjQwT9 significantly faster version, written for readability – BlackCap Jul 12 '19 at 00:55
3

Something like this should work using python which is commonly available:

cat slowest-names.log | python -c 'import collections, sys; print collections.Counter(sys.stdin);'

This assumes word per line. If there are more, splitting should be easy as well.

  • python3 and nicer output cat README.md | python -c 'import collections, sys, pprint; pprint.pprint(collections.Counter(sys.stdin));' – Lukasz Madon Aug 27 '19 at 12:23