Increase speed of Bash script which used grep into a while loop

Question

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

see https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice — Sundeep, Feb 27 '18 at 09:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see https://bioinformatics.stackexchange.com/ — Sundeep, Feb 27 '18 at 09:24
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done.. — Sundeep, Feb 27 '18 at 09:44

score 3 · Answer 1 · answered Feb 27 '18 at 09:22

3

You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.

answered Feb 27 '18 at 09:22

Stephen Kitt

434,908

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python? – Shred Feb 27 '18 at 09:37
Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl. – Stephen Kitt Feb 27 '18 at 09:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process. – Gordon Davisson Mar 09 '18 at 08:24

amisax · Accepted Answer · 2018-03-09T10:25:31.017

2

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    y=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
    }
 }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 09 '18 at 10:25

answered Feb 27 '18 at 10:09

amisax

3,025

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line. – Shred Feb 27 '18 at 10:52
I have updated based on your explanation – amisax Feb 27 '18 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast. – Shred Mar 02 '18 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX – amisax Mar 08 '18 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;${odd##*_} / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have. – Shred Mar 08 '18 at 16:12
@Shred , your last comment clarifies the division operation. I have updated the answer. – amisax Mar 09 '18 at 05:39
Division doesn't work. It prints "nucleotidic_cov : 0.000" . Just to clarify, Y is always lower than "cov." , so the value is always bigger than 1. – Shred Mar 09 '18 at 09:33
Please include a sample line - because it is not clear what is cov. and what is XX. – amisax Mar 09 '18 at 09:44
Sorry, Y is always lower than 1. Maybe could this be the trouble here? Cuz awk works only with integer? Perhaps, question updated with sample input/output. – Shred Mar 09 '18 at 09:55
Let us continue this discussion in chat. – amisax Mar 09 '18 at 10:25

Wouter Verhelst · Answer 3 · 2018-02-27T10:16:38.607

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "${even##}" | tr -d [^A] | wc -m)

But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...;                              # calculate your Y value here
while(my $odd = <ARGV>) {                 # Read a line from the file(s) passed
                                          # on the command line
    chomp $odd;                           # lose the newline
    my @split = split /_/, $odd;          # split the read line on a "_" boundary
                                          # into an array
    print join("_", @split[0..3]) . "\n"; # print the first four elements of the
                                          # array, separated by "_"
    print $split[$#split] / $y . "\n";    # Treat the final element of the
                                          # @split array as a number, divide it
                                          # by $y, and output the result
    my %charcount = (                     # Initialize a hash table
        A => 0,
        G => 0,
        C => 0,
        T => 0
    );
    my $even = <ARGV>;                    # read the even line
    chomp $even;
    foreach my $char(split //,$even) {    # split the string into separate
                                          # characters, and loop over them
        $charcount{$char}++;              # Count the correct character
    }
    my $total = $charcount{A} + $charcount{G} + $charcount{C} + $charcount{T};
    my $gc = $charcount{G} + $charcount{C};
    my $perc = $gc / $total;
    print "GC_CONT: $perc\n";             # Do our final calculations and
                                          # output the result
}

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

Already switched from using tr, which slows the work more than now. — Shred, Feb 27 '18 at 09:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on. — Shred, Feb 28 '18 at 13:32
So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked. — Wouter Verhelst, Mar 05 '18 at 10:22

score 0 · Answer 4 · answered Feb 27 '18 at 09:31

You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

Increase speed of Bash script which used grep into a while loop

4 Answers4