Improvement of a bash script to make some basic statistics on files

Question

I have some files like this:

file_1

chrV    20924149
chrX    17718866
chrIV   17493793
chrII   15279345
chrI    15072423
chrIII  13783700
chrM    13794

file_2

chrI    230218
chrII   813184
chrIII  316620
chrIV   1531933
chrIX   439888
chrM    85779
chrV    576874
chrVI   270161
chrVII  1090940
chrVIII 562643
chrX    745751
chrXI   666816
chrXII  1078177
chrXIII 924431
chrXIV  784333
chrXV   1091291
chrXVI  948066

And I need to get the mean and the total values from column 2 and the max and min values from the files. I got some ideas from stackoverflow and made this ugly bash script.

#!usr/bin/env bash
for VARIABLE in Data/*.sizes
do
    echo $VARIABLE
    echo  'Genome length:'
    awk -F '\t' '{ sum += $2 } END { print sum }' $VARIABLE
    echo 'Chr number:'
    awk -F '\t' '{ NR $1 } END { print NR }' $VARIABLE
    echo 'Chr mean length:'
    awk -F '\t' '{ total += $2 } END { print total/NR }' $VARIABLE
    echo 'Longest Chr:'
    awk -v max=0 '{if($2>max){want=$1" "$2; max=$2}}END{print want}' $VARIABLE
    echo 'Smallest Chr:'
    awk 'NR == 1 || $2 < min {line = $1; min = $2}END{print line " " min}' $VARIABLE
    echo " "
done

It worked but if you have any better ideas and maybe a way to make this more general, because time to time this is done in some similar files.

I would appreciate any inputs because I don't usually use awk and bash.

I got this printed out:

Data/file_1
Genome length:
100286070
Chr number:
7
Chr mean length:
1.43266e+07
Longest Chr:
chrV 20924149
Smallest Chr:
chrM 13794

If you need your code reviewed, the site for that is [codereview.se] — muru, Sep 12 '20 at 14:36
You should be using a single awk script for the body of the loop at least, not a bunch of individual awk scripts, and chances are you don't need the shell loop anyway and could just call awk with Data/*.sizes as the argument. I recommend the book Effective Awk Programming, 5th Edition, by Arnold Robbins to learn how to use awk. — Ed Morton, Sep 12 '20 at 16:39

AdminBee · Accepted Answer · 2020-09-14T13:53:21.550

The following awk script will accomplish the task. I will write it as explicit awk program file because of the length - which is mainly due to the function to print the analysis results; the actual calculations are rather short:

If you have GNU `awk` for the `ENDFILE` block:

Program file (let's call it analyze_genome_g.awk):

#!/usr/bin/gawk -f
Begin of file, characterized by FNR, the per-file line-counter, being 1.
Initialize statistics: set sum, min, and max to first chromosome length
and name of longest/shortest ('long'/'short') to first chromosome name.
FNR==1{s=min=max=$2; short=long=$1}
All other lines: Update sum, min, and max lengths
FNR>1{s=s+$2;if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}}
End-of-file (GNU awk feature!): Print statistics
ENDFILE{
    printf("%s\n",FILENAME);
    printf("- Genome length         : %d\n",s);
    printf("- Nr. of chromosomes    : %d\n",FNR);
    printf("- Mean chomosome length : %.1f\n",s/FNR);
    printf("- Shortest chromosome   : %s (length=%d)\n",short,min);
    printf("- Longest chromosome    : %s (length=%d)\n",long,max);
    printf("\n");
}

You can call it as

gawk -f analyze_genome_g.awk file_1 file_2 ...

Output:

file_1
- Genome length         : 100286070
- Nr. of chromosomes    : 7
- Mean chomosome length : 14326581.4
- Shortest chromosome   : chrM (length=13794)
- Longest chromosome    : chrV (length=20924149)
file_2

Genome length         : 12157105
Nr. of chromosomes    : 17
Mean chomosome length : 715123.8
Shortest chromosome   : chrM (length=85779)
Longest chromosome    : chrIV (length=1531933)

Other `awk` variants:

If your awk doesn't know the ENDFILE condition, a little workaround is required - basically saving the file properties in temporary variables and print the statistics at either the beginning of a new file (for the previous file), or in the END block when the last file was processed.

To make this more convenient, we define a function printstats() which does the output.

Program file (analyze_genome.awk):

#!/usr/bin/awk -f
function printstats()
{
    printf("%s\n",last_fn);
    printf("- Genome length         : %d\n",s);
    printf("- Nr. of chromosomes    : %d\n",last_fnr);
    printf("- Mean chomosome length : %.1f\n",s/last_fnr);
    printf("- Shortest chromosome   : %s (length=%d)\n",short,min);
    printf("- Longest chromosome    : %s (length=%d)\n",long,max);
    printf("\n");
}
Begin of file
FNR==1 always works, but now we have to save file properties, too.
If it is not the first file (NR, the global line counter, is larger than
FNR, the per-file line-counter), print statistics (of the previous file).
FNR==1{
    if (NR>1) printstats();
    s=min=max=$2; short=long=$1;
    last_fn=FILENAME; last_fnr=1;
}
FNR>1{
    s=s+$2; if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1};
    last_fnr++;
}
END{printstats()}

You can call it similarly as

awk -f analyze_genome.awk file_1 file_2 ...

As a general note, using shell loops to process text files is disrecommended as it is rather inefficient; awk and the like can perform almost all text-processing tasks and many statistical calculations much faster.

AdminBee one of the best answers I have receive in a long time in any site of stackoverflow. I really really appreciate your time for a so nice answer. Thank you very much. — Paulo Sergio Schlogl, Sep 14 '20 at 13:23

Rakesh Sharma · Answer 2 · 2020-09-13T14:47:53.503

You can produce your stat reports in a yaml fashion using the GNU version of the desk calculator. The below is a heavily commented version of the dc code.

#!/usr/bin/env bash
for VARIABLE in Data/*.sizes
do
    printf '%s:\n' "$VARIABLE" 
< "$VARIABLE" awk '{$1="["$1"]";sub(/^-/,"_",$2)}1' \
| dc -e "
[32adnn]si  # two-spaces indent in reporting
[
lix[Genome length:]   n32an lsp
lix[Chr number:]      n32an lkp
lix[Chr mean length:] n32an /1.0*p
lix[Longest Chr:]     n32an lM     n32an lmp
lix[Smallest Chr:]    n32an lN     n32an lnp
q
]sR
[dsmrdsMr]s+
[dsnrdsNr]s-
[
?z0=R  # report stats @ eof
lk1+sk # increment line kounter
dls+ss # update running sum
dlm<+  # update max
dln>-  # update min
cz0=?  # call myself recursively to read next line 
]s?
[
?       # read the first line
1skdss  # initialize knt, sum
dsmrdsM # initialize max
sNsn    # initialize min
cl?x    # read next line
]sI
lIx     # set the ball rolling, kinda like main() 
"

Results:

Data/file_1.sizes:
  Genome length: 100286070
  Chr number: 7
  Chr mean length: 14326581.0
  Longest Chr: chrV 20924149
  Smallest Chr: chrM 13794
Data/file_2.sizes:
  Genome length: 12157105
  Chr number: 17
  Chr mean length: 715123.0
  Longest Chr: chrIV 1531933
  Smallest Chr: chrM 85779

Thank you Rakesh. I really appreciate. – Paulo Sergio Schlogl Sep 14 '20 at 13:25 — Paulo Sergio Schlogl, Sep 14 '20 at 13:25

Praveen Kumar BS · Answer 3 · 2020-09-13T08:33:58.857

-2

awk 'BEGIN{sum=0}{sum=sum+$2}END{print sum/finalcountofline}' filename ====Mean

awk 'BEGIN{sum=0}($2 > sum){sum=$2}END{print sum}' filename ===Max

awk 'NR==1{sum=$2}($2 < sum){sum=$2}END{print sum}' filename ===min

edited Sep 13 '20 at 08:33

answered Sep 12 '20 at 20:04

Praveen Kumar BS

5,211

2

Why do you divide by 74 for the mean? Should it it be like the OP has, and divide by NR? Also, using "sum" is a legal, but distracting variable name when you're using it as a min or max. I'd also encourage initializing "max" with the same NR==1 technique that you (and the OP) used for min. – Jeff Schaller Sep 13 '20 at 01:23

Improvement of a bash script to make some basic statistics on files

3 Answers3

If you have GNU awk for the ENDFILE block:

Begin of file, characterized by FNR, the per-file line-counter, being 1.

Initialize statistics: set sum, min, and max to first chromosome length

and name of longest/shortest ('long'/'short') to first chromosome name.

All other lines: Update sum, min, and max lengths

End-of-file (GNU awk feature!): Print statistics

Other awk variants:

Begin of file

FNR==1 always works, but now we have to save file properties, too.

If it is not the first file (NR, the global line counter, is larger than

FNR, the per-file line-counter), print statistics (of the previous file).

If you have GNU `awk` for the `ENDFILE` block:

Other `awk` variants: