NOTICE:
This is a paid product, although open source so you could install and run it yourself for free. However you can get a free trial to test it in our cloud if you like. I don't necessarily expect you to purchase an account but if you have a need to process data in very large text files, Manta will do exactly that beautifully.
Additionally I work for Joyent, the company that sells the product so take my opinion with a grain of salt but I encourage you to try the product for yourself and let it prove itself.
Joyent's object store Manta is perfectly suited for working with large data inputs and running compute against them on our systems.
The uses for Manta are vast but I'll focus on your question specifically:
Upload some datasets:
$ curl -sL http://www.gutenberg.org/ebooks/1661.txt.utf-8 | \
mput -H 'content-type: text/plain' ~~/stor/books/sherlock_holmes.txt
$ curl -sL http://www.gutenberg.org/ebooks/76.txt.utf-8 | \
mput -H 'content-type: text/plain' ~~/stor/books/huck_finn.txt
$ curl -sL http://www.gutenberg.org/ebooks/2701.txt.utf-8 | \
mput -H 'content-type: text/plain' ~~/stor/books/moby_dick.txt
$ curl -sL http://www.gutenberg.org/ebooks/345.txt.utf-8 | \
mput -H 'content-type: text/plain' ~~/stor/books/dracula.txt
Running jobs on your data
Here's an example job that counts the number of times the word "vampire" appears in Dracula.
$ echo ~~/stor/books/dracula.txt | mjob create -o -m "grep -ci vampire"
added 1 input to 7b39e12b-bb87-42a7-8c5f-deb9727fc362
32
this command creates a job to run the user script grep -ci vampire
on each input object and then submits ~~/stor/books/dracula.txt
as the only input to the job. The name of the job is (in this case) 7b39e12b-bb87-42a7-8c5f-deb9727fc362
. When the job completes, the result is placed in an output object, which you can see with the mjob outputs
command
You can use a similar invocation to run the same job on all of the objects under ~~/stor/books
:
$ mfind -t o ~~/stor/books | mjob create -o -m "grep -ci human"
added 5 inputs to 69219541-fdab-441f-97f3-3317ef2c48c0
13
48
18
4
6
In this example, the system runs 5 invocations of grep
. Each of these is called a task. Each task produces one output, and the job itself winds up with 5 separate outputs.
We've just described the "map" phase of traditional map-reduce computations. The "map" phase performs the same computation on each of the input objects. The reduce phase typically combines the outputs from the map phase to produce a single output.
One of the earlier examples computed the number of times the word "human" appeared in each book. We can use a simple awk
script in the reduce phase to get the total number the of times "human" appears in all the books.
$ mfind -t o ~~/stor/books | \
mjob create -o -m "grep -ci human" -r "awk '{s+=\$1} END{print s}'"
added 5 inputs to 12edb303-e481-4a39-b1c0-97d893ce0927
89
This job has two phases: the map phase runs grep -ci human
on each input object, then the reduce phase runs the awk
script on the concatenated output from the first phase. awk '{s+=$1} END {print s}'
sums a list of numbers, so it sums the list of numbers that come out of the first phase. You can combine several map and reduce phases. The outputs of any non-final phases become inputs for the next phase, and the outputs of the final phase become job outputs.
I'm not quite sure what you are looking for exactly but this is closer to the command in your question:
echo ~~/stor/books/dracula.txt | mjob create -o -m "cat" -r "tr -s '[:blank:]' '[\n*]'" -r "sort" -r "uniq -c" >./tmp/test.txt
Output
2559
1 "'Are
1 "'E's
1 "'I
1 "'Ittin'
1 "'Little
1 "'Lucy,
1 "'Maybe
1 "'Miss
2 "'My
1 "'Never
1 "'No'
1 "'Ow
1 "'Silence!
1 "'That's
1 "'Tyke
1 "'Wilhelmina'--I
1 "'Yes,
8 "A
...
uniq
andsort
commands? also add sample input/output (say 5-10 lines).. title and description is not matching.. – Sundeep Dec 29 '17 at 06:46cat
xargs
uniq
and|
. – bu5hman Dec 29 '17 at 06:48sort
with a Petabyte of data. You have to split the input data into files /data sets of reasonable size (which you probably have to determine by trial and error) and add up the results. You need not write smaller files, you can use e.g.awk
to create a newsort | uniq -c
pipeline for each data set and save a lot of I/O by that. You can save even more I/O by writing the partial results to tmpfs and adding them up there. – Hauke Laging Dec 31 '17 at 12:03bash
can likely do it. With 1PiB your main problem is IO, you have to read the file only once (which the associative array allow you to do). Of course you can wonder where they will find the 1PiB file, since this isn't a trivial size... – xenoid Dec 31 '17 at 14:50