0

So I have a large file which is in a columns comprised of cities. I want to write a program that will find the total number of instances of each city and display it in a new file. How should I go about doing that.

This is an example, there are way more values than this

If the file is like this

City           
Manhattan   
Cork       
Manhattan  
Chennai
Chennai

How output should look

City
Manhattan 2 
Cork      1
Chennai   2
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Alex
  • 9
  • 3
    Looks like homework. See uniq -c after a sort. – xenoid Aug 06 '19 at 14:47
  • See these: https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file/174421#174421 https://unix.stackexchange.com/questions/434860/trying-to-find-the-frequency-of-words-in-a-file-using-a-script https://unix.stackexchange.com/questions/322909/word-frequency-gawk-memory-leak https://unix.stackexchange.com/questions/62056/whats-the-easiest-way-to-make-a-list-of-most-common-words-in-a-list?noredirect=1&lq=1 https://unix.stackexchange.com/questions/80017/how-to-count-total-number-of-words-in-a-file – Jeff Schaller Aug 06 '19 at 14:57
  • your question is an easier version of the duplicate... – pLumo Aug 06 '19 at 14:57

2 Answers2

1

Do it with datamash:

datamash -g1 -s -H count 1 < infile

Output:

GroupBy(City)   count(City)
Chennai 2
Cork    1
Manhattan   2
Thor
  • 17,182
0

Use Linux built-in commands sort and uniq:

cat DATAFILE | sort | uniq --count

This will give you something like:

  2 Chennai
  1 City           
  1 Cork       
  2 Manhattan

Explanation: This uses the command uniq which normally removes repeated lines, keeping only one instance. With the option "--count" it issues an additional count of the number of repetitions. In order for uniq to work, lines need to be sorted - otherwise duplicates would not be in consecutive lines as necessary. sort does exactly that, sort lines by alphabet.

Do you need these in a different column order? Does the first line have to be ignored? If so, please also let us know whether city names may have more than one word in them.

Ned64
  • 8,726