How can I find the most frequent word in a .csv file, ignoring duplicates on each line?

Question

I need to find the 10 most frequent words in a .csv file. The file is structured so that each line contains comma-separated words. If the same word is repeated more than once in the same line, it should be counted as one. So, in the example below:

green,blue,blue,yellow,red,yellow
red,blue,green,green,green,brown

green, blue and red should be counted as 2 and yellow and brown as 1

I know similar questions have been asked before, and one solution was:

<file.csv tr -c '[:alnum:]' '[\n*]' | sort|uniq -c|sort -nr|head  -10

But this will count the number of time a word appears in the same line, like this:

  4 green
  3 blue
  2 yellow
  2 red
  1 brown

and this is not actually what I need. Any help? Also I will appreciate a short explanation of the command and why does the command I found in similar questions does not do what I need.

If a word is repeated within a line, are the duplicated words always next to another copy of the same word, or may a line read red,blue,green,brown,green,green? — Kusalananda, Jun 01 '20 at 21:20
no, the order is casual, so they could be or not next to another. — ginopino, Jun 01 '20 at 21:22

Stéphane Chazelas · Answer 1 · 2020-06-04T21:32:19.867

16

With GNU grep or compatible:

$ grep -nEo '\w+' file.csv|sort -u|cut -d: -f2-|sort|uniq -c|sort -k1rn|head
      2 blue
      2 green
      2 red
      1 brown
      1 yellow

edited Jun 04 '20 at 21:32

answered Jun 02 '20 at 09:49

Stéphane Chazelas

544,893

1

I wouldn't have thought to combine grep's -o and -n like that to identify which come from the same line. That's brilliantly simple. – JoL Jun 03 '20 at 00:51

steeldriver · Accepted Answer · 2020-06-02T13:16:07.013

I would probably reach for perl

Use uniq from the List::Util module to de-duplicate each row.
Use a hash to count the resulting occurrences.

For example

perl -MList::Util=uniq -F, -lnE '
  map { $h{$_}++ } uniq @F 
  }{ 
  foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown

If you have no option except the sort and uniq coreutils, then you can implement a similar algorithm with the addition of a shell loop

while IFS=, read -a words; do 
  printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
  2 red
  2 green
  2 blue
  1 yellow
  1 brown

however please refer to Why is using a shell loop to process text considered bad practice?

The shell solution imply calling sort -u once per each line in the file. For short files it may be ok, for longer files, it is very slow. — , Jun 02 '20 at 13:00

bu5hman · Answer 3 · 2020-07-04T07:44:10.117

You can use awk with an associative array and a simple logic check.

awk -F, '
  {split("", c); for (i=1; i<=NF; i++) 
      if (!c[$i]){c[$i]++;wds[$i]++}}
  END{for (wd in wds) print wds[wd], wd}' file

Output

1 brown
2 red
1 yellow
2 blue
2 green

Walkthrough

Set the field separator to ,

awk -F, '

You are going to count c to see if more than one occurrence of a word is on a line so make sure the word count is null at the beginning of each line with delete c; or split("", c) and then iterate over the fields

      {split("", c); for (i=1; i<=NF; i++)

or

      {delete c; for (i=1; i<=NF; i++)

if you have not seen the word $i yet on this line !c[$i] then increment the counter for that word c[$i]++ (to 1 so the condition test fails if it occurs again on the same line) and then increment the overall counting for that word wds[$i]++ when the test does not fail

      if (!c[$i]){c[$i]++;wds[$i]++}}

When the file is done then just iterate over the wds array and print the count wds[wd] and the word wd

      END{for (wd in wds) print wds[wd], wd}' file

just for fun

A hacky one with no awk associative array bits

awk -F, '{for (i=1; i<=NF; i++) print NR, $i}' file | 
    sort | uniq | awk '{print $2}'| sort | uniq -c | sort -nr

awk out the fields so they are preceded by their line numbers then sort | uniq to lose the line dupes, awk again to lose the numbering and then revert to your original code.

You are welcome! Since in the main code you present you use split instead of delete, it should work for any compliant awk, not only gawk, mawk, etc. But the last hacky one does not work for me. — Quasímodo, Jun 01 '20 at 22:08
@bu5hman The syntax for delete array has been accepted in POSIX (read the last note with date 2020-03-20 09:27 in the bug report it declares the change Resolved => Applied. Maybe the web version has not been updated, but the spec has been. — , Jun 02 '20 at 18:56
@Isaac That does not mean the spec has been already updated, as I have found out a few days ago when I asked about a very similar case in Is awk's nextfile specified in POSIX?. — Quasímodo, Jun 04 '20 at 00:44

Kusalananda · Answer 4 · 2020-06-03T07:49:40.987

Using awk:

awk -F , '
        {
                delete seen
                for (i = 1; i <= NF; ++i) seen[$i]++ || ++count[$i]
        }
        END {
                for (word in count) printf "%d\t%s\n", count[word], word
        }' file |
sort -rn | head

The loop in the first block counts a word if it hasn't been seen before in the current line. The loop in the END block outputs counts and words.

For those that likes "one-liners":

awk -F, -v OFS="\t" '{delete s;for(i=1;i<=NF;++i)s[$i]++||++c[$i]} END {for(w in c)print c[w],w}' file | sort -rn | head

A similar approach in the zsh shell:

IFS=,
declare -aU words
declare -A count
while read -r -A words; do
        for word in $words; do
                count[$word]=$(( $count[$word] + 1 ))
        done
done <file

printf '%2$d\t%1$s\n' ${(kv)count} |
sort -rn | head

This reads each comma-delimited lines into an array, words, which is kept unique (only the first copy of each word is added to the array).

For each line read, the unique words are counted by incrementing the corresponding entry in the associative array count.

After reading all words, the accumulated words and their counts are outputted and sorted on the counts. The output is truncated at 10 lines using head.

The ${(kv)count} expansion will evaluate to a list of keys and values from the count associative array. These are used by printf to print the values and keys as newline-delimited pairs. The format string used with printf first picks out the value, then the key, but since these come in the wrong order from the ${(kv)count} expansion we use 2$ and 1$ to select arguments out of order.

Beware the zsh one introduces a command injection vulnerability (try after echo 'x]+psvar[0$(echo gotcha>&2)' > file). You could change to count[$word]=$(( $count[$word] + 1 )) or to avoid the double dereference do expr='count[$word]++'; ...; ((expr)) — Stéphane Chazelas, Jun 03 '20 at 05:28
See also printf '%2$d\t%1$s\n' ${(kv)count} | sort -rn | head or print -raC2 -- ${(kvOa)count} | sort -rn | head — Stéphane Chazelas, Jun 03 '20 at 06:25
@StéphaneChazelas Ouch, too early in the morning for magic! But yes, that would be neater. — Kusalananda, Jun 03 '20 at 07:43

score 5 · Answer 5 · 2020-06-02T21:56:04.580

There is an script that does what is asked mostly in awk:

awk -F, '
{ 
       i = split( "" , seen ) ;
       while( ++i <= NF ) if( ++seen[$i] == 1 ) count[$i]++; 
}END{
       for( word in count ) print count[word] , word
}'     file | sort -rn | head

It works by:

For each line on the input file:
re-initalize i to zero and clear array seen for each new line i=split("",seen).
generate the array seen for each field ++seen[$i]
on the first time (on this line) that the a field is seen, count it. (count[$i]++).
After all lines have been processed END,
for each word that has been counted for( word in count ),
print all words and their counts print count[word] , word.
Finally, after awk has generated its output, sort it numerically sort -rn
and select the first 10 lines head.

We can write it in a slightly more cryptic one-liner:

awk -F, '{i=split("",a);while(++i<=NF)a[$i]++||c[$i]++}END{for(i in c)print c[i],i}' file|sort -rn|head

score 2 · Answer 6 · answered Jun 03 '20 at 10:07

I'll assume that the words don't contain any embedded commas and that they aren't quoted, or each word is consistently quoted (e.g. "foo",bar,"foo" is fine but "foo",bar,foo would consider "foo" and foo to be different words and "foo,bar",qux being two words is right out). Otherwise you need a proper CSV handling tool.

I'll also assume that empty “words” don't appear in the data or don't matter.

Under these assumptions, you can use sed to remove duplicate words on each line without too much pain. I do it this way: first add commas at the beginning and end of the line, and duplicate commas inside the line, so that each word is fully surrounded by commas on each side that “belong” to it. Then remove duplicate comma-surrounded words on the line. After that, you can use your planned strategy of converting commas to line breaks and counting duplicated lines.

sed -e 's/,/,,/g; s/^/,/; s/$/,/; :s; s/\(,[^,][^,]*,\)\(.*\)\1/\1\2/; t s' |
tr , '\n' |
sort | uniq -c |
sort -nr | head

Mr. Chem Question · Answer 7 · 2020-06-20T21:14:58.583

2

I'd use Python for this. The reason why is because the code is easy to read and modify.

import collections
import operator
import sys
word_count = collections.Counter()
with open(sys.argv[1]) as f:
    for l in f:
        words = set(l.strip().split(','))
        word_count.update(words)
words_kv = word_count.items()
words_kv = sorted(words_kv, key = operator.itemgetter(0)) # First sort alphabetically.
words_kv = sorted(words_kv, key = operator.itemgetter(1), reverse = True) # Then sort by most common.
Then end result is a list that is sorted first by most common then alphabetically. The reason this works is because sorting is stable in Python according to: https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts
for word, count in words_kv:
    print(count, word)

Output:

2 blue
2 green
2 red
1 brown
1 yellow

edited Jun 20 '20 at 21:14

answered Jun 19 '20 at 23:08

Mr. Chem Question

211
2
5

1

I think if we're importing collections in Python then Counter would also be a nice way to go: with open('file.csv') as f: from collections import Counter; c = Counter(); [c.update(set(l.strip().split(','))) for l in f] ; [print(f"{count} {color}") for color, count in c.most_common()] – user1717828 Jun 20 '20 at 04:07
@user1717828 that made a big improvement in code quality. Thank you for the suggestion. – Mr. Chem Question Jun 20 '20 at 21:15

score -1 · Answer 8 · answered Jun 20 '20 at 03:42

You can stick with your proposed solution with minor modifications to count line-by-line:

$ while read line; do echo $line | tr , '\n' | sort | uniq; done <file.csv | sort | uniq -c | sort -nr
      2 red
      2 green
      2 blue
      1 yellow
      1 brown

It's doing the same thing as the code you provided, except it's doing the uniq part each line at a time and then counting at the end.

How can I find the most frequent word in a .csv file, ignoring duplicates on each line?

8 Answers8

Then end result is a list that is sorted first by most common then alphabetically. The reason this works is because sorting is stable in Python according to: https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts