How to get unique occurence of words from a vey large file?

Question

I have been asked to write a word frequency analysis program using the unix/ shell-scripts with the following requirements:

Input is a text file with one word per line
Input words are drawn from the Compact Oxford English Dictionary New Edition
Character encoding is UTF-8
Input file is 1 Pebibyte (PiB) in length
Output is of the format “ Word occurred N times”

I am aware of one of the way to begin with as below --- cat filename | xargs -n1 | sort | uniq -c > newfilename

What should be the best optimal way to do that considering performance as well?

have you tried using uniq and sort commands? also add sample input/output (say 5-10 lines).. title and description is not matching.. — Sundeep, Dec 29 '17 at 06:46
This really sounds like homework. The optimal solution is to to what you did. Ask someone else. The best is to learn how to use of cat xargs uniq and |. — bu5hman, Dec 29 '17 at 06:48
If you look over to the right on this very page, and scroll down a bit, you will find a header "related." Click on the links there and read them. You will find what you need to do your homework. Yes, it will involve learning. — Wildcard, Dec 29 '17 at 06:54
You cannot use sort with a Petabyte of data. You have to split the input data into files /data sets of reasonable size (which you probably have to determine by trial and error) and add up the results. You need not write smaller files, you can use e.g. awk to create a new sort | uniq -c pipeline for each data set and save a lot of I/O by that. You can save even more I/O by writing the partial results to tmpfs and adding them up there. — Hauke Laging, Dec 31 '17 at 12:03
The COED is about 150K entries (according to Wikipedia), fairly easy to to do with associative arrays, bash can likely do it. With 1PiB your main problem is IO, you have to read the file only once (which the associative array allow you to do). Of course you can wonder where they will find the 1PiB file, since this isn't a trivial size... — xenoid, Dec 31 '17 at 14:50
@HaukeLaging splitting the file triples the I/O (1 read, one write, one read). Not something you want to do if you have 1PiB of input. — xenoid, Dec 31 '17 at 14:52
@xenoid And that's why I explained that splitting the data does not require reading or writing files. — Hauke Laging, Dec 31 '17 at 15:35

jesse_b · Accepted Answer · 2018-01-02T01:04:37.200

NOTICE:

This is a paid product, although open source so you could install and run it yourself for free. However you can get a free trial to test it in our cloud if you like. I don't necessarily expect you to purchase an account but if you have a need to process data in very large text files, Manta will do exactly that beautifully.

Additionally I work for Joyent, the company that sells the product so take my opinion with a grain of salt but I encourage you to try the product for yourself and let it prove itself.

Joyent's object store Manta is perfectly suited for working with large data inputs and running compute against them on our systems.

The uses for Manta are vast but I'll focus on your question specifically:

Running Compute on Data

Upload some datasets:

$ curl -sL http://www.gutenberg.org/ebooks/1661.txt.utf-8 | \
    mput -H 'content-type: text/plain' ~~/stor/books/sherlock_holmes.txt
$ curl -sL http://www.gutenberg.org/ebooks/76.txt.utf-8 | \
    mput -H 'content-type: text/plain' ~~/stor/books/huck_finn.txt
$ curl -sL http://www.gutenberg.org/ebooks/2701.txt.utf-8 | \
    mput -H 'content-type: text/plain' ~~/stor/books/moby_dick.txt
$ curl -sL http://www.gutenberg.org/ebooks/345.txt.utf-8 | \
    mput -H 'content-type: text/plain' ~~/stor/books/dracula.txt

Running jobs on your data

Here's an example job that counts the number of times the word "vampire" appears in Dracula.

$ echo ~~/stor/books/dracula.txt | mjob create -o -m "grep -ci vampire"
added 1 input to 7b39e12b-bb87-42a7-8c5f-deb9727fc362
32

this command creates a job to run the user script grep -ci vampire on each input object and then submits ~~/stor/books/dracula.txt as the only input to the job. The name of the job is (in this case) 7b39e12b-bb87-42a7-8c5f-deb9727fc362. When the job completes, the result is placed in an output object, which you can see with the mjob outputs command

You can use a similar invocation to run the same job on all of the objects under ~~/stor/books:

$ mfind -t o ~~/stor/books | mjob create -o -m "grep -ci human"
added 5 inputs to 69219541-fdab-441f-97f3-3317ef2c48c0
13
48
18
4
6

In this example, the system runs 5 invocations of grep. Each of these is called a task. Each task produces one output, and the job itself winds up with 5 separate outputs.

Map and Reduce Phases

We've just described the "map" phase of traditional map-reduce computations. The "map" phase performs the same computation on each of the input objects. The reduce phase typically combines the outputs from the map phase to produce a single output.

One of the earlier examples computed the number of times the word "human" appeared in each book. We can use a simple awk script in the reduce phase to get the total number the of times "human" appears in all the books.

$ mfind -t o ~~/stor/books | \
        mjob create -o -m "grep -ci human" -r "awk '{s+=\$1} END{print s}'"
added 5 inputs to 12edb303-e481-4a39-b1c0-97d893ce0927
89

This job has two phases: the map phase runs grep -ci human on each input object, then the reduce phase runs the awk script on the concatenated output from the first phase. awk '{s+=$1} END {print s}' sums a list of numbers, so it sums the list of numbers that come out of the first phase. You can combine several map and reduce phases. The outputs of any non-final phases become inputs for the next phase, and the outputs of the final phase become job outputs.

I'm not quite sure what you are looking for exactly but this is closer to the command in your question:

echo ~~/stor/books/dracula.txt | mjob create -o -m "cat" -r "tr -s '[:blank:]' '[\n*]'" -r "sort" -r "uniq -c" >./tmp/test.txt

Output

   2559
      1 "'Are
      1 "'E's
      1 "'I
      1 "'Ittin'
      1 "'Little
      1 "'Lucy,
      1 "'Maybe
      1 "'Miss
      2 "'My
      1 "'Never
      1 "'No'
      1 "'Ow
      1 "'Silence!
      1 "'That's
      1 "'Tyke
      1 "'Wilhelmina'--I
      1 "'Yes,
      8 "A
      ...

How to get unique occurence of words from a vey large file?

1 Answers1

NOTICE:

Running Compute on Data

Map and Reduce Phases