2

How can I find count of every word in a file?

I want a histogram of each word in text pipe or document. New line and empty lines will exist in document. I stripped everything except for [a-zA-Z].

> cat doc.txt 
word second third

word really > cat doc.txt | ... # then count occurrences of each word
# and print in descending order separated by delimiter word 2 really 1 second 1 third 1

It needs to be somewhat efficient as file is 1GB text and cannot work with exponential time load.

user14492
  • 853
  • Does https://unix.stackexchange.com/q/398413/117549 help? – Jeff Schaller Aug 12 '20 at 14:55
  • Do you really want a fish-specific answer? I'd use perl – glenn jackman Aug 12 '20 at 14:58
  • @glennjackman perl is perfectly fine. fish is there to make sure it work on fish and isn't just bash/zsh specific answer as those aren't that useful for me – user14492 Aug 12 '20 at 15:26
  • Is Unix-system one word or two? What about GNU/Linux? – Kusalananda Aug 12 '20 at 16:42
  • @Kusalananda like I said only [a-zA-Z] so hypen cannot exist; only letters small and capital case :) What about GNU/Linux. I chose to tag macOS because I want to make sure that people assume macOS flavor of tools and only expect macOS default tools to exist (installing a separate tool is overkill imo). – user14492 Aug 12 '20 at 20:22
  • @user14492 Ah, I missed the bit where you deleted all non-letters. By GNU/Linux I meant to ask whether it was to be counted as one or two words, but by deleting the non-letter / it's clear that it's a single word. – Kusalananda Aug 12 '20 at 21:21

2 Answers2

7

Try this:

grep -o '\w*' doc.txt | sort | uniq -c | sort -nr
  • -o Print each match instead of matching lines
  • \w* Match word characters
  • sort sort the matches before piping to uniq.
  • uniq -c print the uniqe lines and the number of occurences -c
  • sort -nr Reverse sort by number of occurences.

Output:

  2 word
  1 third
  1 second
  1 really

Alternative:

Use awk for the exact output:

$ grep -o '\w*' doc.txt \
| awk '{seen[$0]++} END{for(s in seen){print s,seen[s]}}' \
| sort -k2r

word 2 really 1 second 1 third 1

pLumo
  • 22,565
  • With GNU awk, don't need external sort: END {PROCINFO["sorted_in"] = "@val_num_desc"; for (s in seen) print s, seen[s]} – glenn jackman Aug 12 '20 at 16:35
0
perl -lnE '
  $count{$_}++ for /[[:alpha:]]+/g;
  END {
    say "@$_" for
      sort {$b->[1] <=> $a->[1] || $a->[0] cmp $b->[0]}
      map {[$_, $count{$_}]}
      keys %count
  }
' doc.txt

This will consume lots more memory than pLumo's initial solution.

glenn jackman
  • 85,964