1

I have a file that contains all the search string, i'm taking all the strings from that file and greping them in another file one by one, now this is taking too long how do i implement parallel command into this.

while read line; do
line2=`grep -w "$line" $file2`

if [[ ! -z $line2 ]] then echo "$line: present" >> exclusion_list_$$.txt echo "$line2" >> exclusion_list_$$.txt

echo "grep line $line2 " fi done < exclusion.txt

i was thinking maybe taking all the inner while command and putting them in a function and calling the function parallely.

I'm new to this please let me know if this is the right way to go about or if any other way would be efficient.

amar2108
  • 145

2 Answers2

1

It seems your problem is the one covered in https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:

grep -f regexps.txt bigfile

Or if the regexps are fixed strings:

grep -F -f regexps.txt bigfile

There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending on the disk system it may be faster or slower to parallelize. The only way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of bigfile, but if regexps.txt is so big it cannot fit into memory, then you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if the lines you are looking for in bigfile all looks like:

ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2

then your regexps.txt can be converted from:

ID1.*Identifier1
ID2.*Identifier2

into:

ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2

This way you can use grep -F which takes around 80% less memory and is much faster.

If it still does not fit in memory you can do this:

parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile | \
  sort -un | perl -pe 's/^\d+://'

The 1M should be your free memory divided by the number of CPU threads and divided by 200 for grep -F and by 1000 for normal grep. On GNU/Linux you can do:

free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
           END { print sum }' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^\d+://'

If you can live with duplicated lines and wrong order, it is faster to do:

parallel --pipepart -a regexps.txt --block $percpu --compress \
  grep -F -f - bigfile

Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on the regexps:

cat regexp.txt | parallel --pipe -L1000 --roundrobin --compress \
  grep -f - -n bigfile | \
  sort -un | perl -pe 's/^\d+://'

The command will start one grep per CPU and read bigfile one time per CPU, but as that is done in parallel, all reads except the first will be cached in RAM. Depending on the size of regexp.txt it may be faster to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in parallel. This is true for some RAID systems and for some network file systems. To parallelize the reading of bigfile:

parallel --pipepart --block 100M -a bigfile -k --compress \
  grep -f regexp.txt

This will split bigfile into 100MB chunks and run grep on each of these chunks. To parallelize both reading of bigfile and regexp.txt combine the two using --fifo:

parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt \
  \| parallel --pipe -L1000 --roundrobin grep -f - {}

If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready for Lucene.

Ole Tange
  • 35,514
0

You might have heard of GNU parallel. This won't work here...

To take advantage of parallelization it has to be a gigantic file and bash won't do, you'll have to switch to C or other real programming languages.

Your code will have to:

  • Determine the length L of the file
  • Fork into X processes
  • They all have to start reading at the n*X_n bit of the file
  • They have to stop after reading L/X bits
  • Examine the part they read
  • Use IPC to synchronize the writing to the endfile

It's certainly possible but the length of your file has to be a couple of terabytes before it's worth even considering this option.

Garo
  • 2,059
  • 11
  • 16