15

I need to find the 10 most frequent words in a .csv file. The file is structured so that each line contains comma-separated words. If the same word is repeated more than once in the same line, it should be counted as one. So, in the example below:

green,blue,blue,yellow,red,yellow
red,blue,green,green,green,brown

green, blue and red should be counted as 2 and yellow and brown as 1

I know similar questions have been asked before, and one solution was:

<file.csv tr -c '[:alnum:]' '[\n*]' | sort|uniq -c|sort -nr|head  -10

But this will count the number of time a word appears in the same line, like this:

  4 green
  3 blue
  2 yellow
  2 red
  1 brown

and this is not actually what I need. Any help? Also I will appreciate a short explanation of the command and why does the command I found in similar questions does not do what I need.

ginopino
  • 380
  • If a word is repeated within a line, are the duplicated words always next to another copy of the same word, or may a line read red,blue,green,brown,green,green? – Kusalananda Jun 01 '20 at 21:20
  • no, the order is casual, so they could be or not next to another. – ginopino Jun 01 '20 at 21:22

8 Answers8

16

With GNU grep or compatible:

$ grep -nEo '\w+' file.csv|sort -u|cut -d: -f2-|sort|uniq -c|sort -k1rn|head
      2 blue
      2 green
      2 red
      1 brown
      1 yellow
  • 1
    I wouldn't have thought to combine grep's -o and -n like that to identify which come from the same line. That's brilliantly simple. – JoL Jun 03 '20 at 00:51
11

I would probably reach for perl

  • Use uniq from the List::Util module to de-duplicate each row.
  • Use a hash to count the resulting occurrences.

For example

perl -MList::Util=uniq -F, -lnE '
  map { $h{$_}++ } uniq @F 
  }{ 
  foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown

If you have no option except the sort and uniq coreutils, then you can implement a similar algorithm with the addition of a shell loop

while IFS=, read -a words; do 
  printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
  2 red
  2 green
  2 blue
  1 yellow
  1 brown

however please refer to Why is using a shell loop to process text considered bad practice?

steeldriver
  • 81,074
  • 1
    The shell solution imply calling sort -u once per each line in the file. For short files it may be ok, for longer files, it is very slow. –  Jun 02 '20 at 13:00
9

You can use awk with an associative array and a simple logic check.

awk -F, '
  {split("", c); for (i=1; i<=NF; i++) 
      if (!c[$i]){c[$i]++;wds[$i]++}}
  END{for (wd in wds) print wds[wd], wd}' file

Output

1 brown
2 red
1 yellow
2 blue
2 green

Walkthrough

Set the field separator to ,

awk -F, '

You are going to count c to see if more than one occurrence of a word is on a line so make sure the word count is null at the beginning of each line with delete c; or split("", c) and then iterate over the fields

      {split("", c); for (i=1; i<=NF; i++) 

or

      {delete c; for (i=1; i<=NF; i++) 

if you have not seen the word $i yet on this line !c[$i] then increment the counter for that word c[$i]++ (to 1 so the condition test fails if it occurs again on the same line) and then increment the overall counting for that word wds[$i]++ when the test does not fail

      if (!c[$i]){c[$i]++;wds[$i]++}}

When the file is done then just iterate over the wds array and print the count wds[wd] and the word wd

      END{for (wd in wds) print wds[wd], wd}' file

just for fun

A hacky one with no awk associative array bits

awk -F, '{for (i=1; i<=NF; i++) print NR, $i}' file | 
    sort | uniq | awk '{print $2}'| sort | uniq -c | sort -nr

awk out the fields so they are preceded by their line numbers then sort | uniq to lose the line dupes, awk again to lose the numbering and then revert to your original code.

bu5hman
  • 4,756
8

Using awk:

awk -F , '
        {
                delete seen
                for (i = 1; i <= NF; ++i) seen[$i]++ || ++count[$i]
        }
        END {
                for (word in count) printf "%d\t%s\n", count[word], word
        }' file |
sort -rn | head

The loop in the first block counts a word if it hasn't been seen before in the current line. The loop in the END block outputs counts and words.

For those that likes "one-liners":

awk -F, -v OFS="\t" '{delete s;for(i=1;i<=NF;++i)s[$i]++||++c[$i]} END {for(w in c)print c[w],w}' file | sort -rn | head

A similar approach in the zsh shell:

IFS=,
declare -aU words
declare -A count
while read -r -A words; do
        for word in $words; do
                count[$word]=$(( $count[$word] + 1 ))
        done
done <file

printf '%2$d\t%1$s\n' ${(kv)count} |
sort -rn | head

This reads each comma-delimited lines into an array, words, which is kept unique (only the first copy of each word is added to the array).

For each line read, the unique words are counted by incrementing the corresponding entry in the associative array count.

After reading all words, the accumulated words and their counts are outputted and sorted on the counts. The output is truncated at 10 lines using head.

The ${(kv)count} expansion will evaluate to a list of keys and values from the count associative array. These are used by printf to print the values and keys as newline-delimited pairs. The format string used with printf first picks out the value, then the key, but since these come in the wrong order from the ${(kv)count} expansion we use 2$ and 1$ to select arguments out of order.

Kusalananda
  • 333,661
5

There is an script that does what is asked mostly in awk:

awk -F, '
{ 
       i = split( "" , seen ) ;
       while( ++i <= NF ) if( ++seen[$i] == 1 ) count[$i]++; 
}END{
       for( word in count ) print count[word] , word
}'     file | sort -rn | head

It works by:

  • For each line on the input file:
  • re-initalize i to zero and clear array seen for each new line i=split("",seen).
  • generate the array seen for each field ++seen[$i]
  • on the first time (on this line) that the a field is seen, count it. (count[$i]++).
  • After all lines have been processed END,
  • for each word that has been counted for( word in count ),
  • print all words and their counts print count[word] , word.
  • Finally, after awk has generated its output, sort it numerically sort -rn
  • and select the first 10 lines head.

We can write it in a slightly more cryptic one-liner:

awk -F, '{i=split("",a);while(++i<=NF)a[$i]++||c[$i]++}END{for(i in c)print c[i],i}' file|sort -rn|head
2

I'll assume that the words don't contain any embedded commas and that they aren't quoted, or each word is consistently quoted (e.g. "foo",bar,"foo" is fine but "foo",bar,foo would consider "foo" and foo to be different words and "foo,bar",qux being two words is right out). Otherwise you need a proper CSV handling tool.

I'll also assume that empty “words” don't appear in the data or don't matter.

Under these assumptions, you can use sed to remove duplicate words on each line without too much pain. I do it this way: first add commas at the beginning and end of the line, and duplicate commas inside the line, so that each word is fully surrounded by commas on each side that “belong” to it. Then remove duplicate comma-surrounded words on the line. After that, you can use your planned strategy of converting commas to line breaks and counting duplicated lines.

sed -e 's/,/,,/g; s/^/,/; s/$/,/; :s; s/\(,[^,][^,]*,\)\(.*\)\1/\1\2/; t s' |
tr , '\n' |
sort | uniq -c |
sort -nr | head

2

I'd use Python for this. The reason why is because the code is easy to read and modify.

import collections
import operator
import sys

word_count = collections.Counter()

with open(sys.argv[1]) as f: for l in f: words = set(l.strip().split(',')) word_count.update(words)

words_kv = word_count.items() words_kv = sorted(words_kv, key = operator.itemgetter(0)) # First sort alphabetically. words_kv = sorted(words_kv, key = operator.itemgetter(1), reverse = True) # Then sort by most common.

Then end result is a list that is sorted first by most common then alphabetically. The reason this works is because sorting is stable in Python according to: https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts

for word, count in words_kv: print(count, word)

Output:

2 blue
2 green
2 red
1 brown
1 yellow
  • 1
    I think if we're importing collections in Python then Counter would also be a nice way to go: with open('file.csv') as f: from collections import Counter; c = Counter(); [c.update(set(l.strip().split(','))) for l in f] ; [print(f"{count} {color}") for color, count in c.most_common()] – user1717828 Jun 20 '20 at 04:07
  • @user1717828 that made a big improvement in code quality. Thank you for the suggestion. – Mr. Chem Question Jun 20 '20 at 21:15
-1

You can stick with your proposed solution with minor modifications to count line-by-line:

$ while read line; do echo $line | tr , '\n' | sort | uniq; done <file.csv | sort | uniq -c | sort -nr
      2 red
      2 green
      2 blue
      1 yellow
      1 brown

It's doing the same thing as the code you provided, except it's doing the uniq part each line at a time and then counting at the end.

user1717828
  • 3,542