0

I have some files like this:

  • file_1
    chrV    20924149
    chrX    17718866
    chrIV   17493793
    chrII   15279345
    chrI    15072423
    chrIII  13783700
    chrM    13794
    
  • file_2
    chrI    230218
    chrII   813184
    chrIII  316620
    chrIV   1531933
    chrIX   439888
    chrM    85779
    chrV    576874
    chrVI   270161
    chrVII  1090940
    chrVIII 562643
    chrX    745751
    chrXI   666816
    chrXII  1078177
    chrXIII 924431
    chrXIV  784333
    chrXV   1091291
    chrXVI  948066
    

And I need to get the mean and the total values from column 2 and the max and min values from the files. I got some ideas from stackoverflow and made this ugly bash script.

#!usr/bin/env bash

for VARIABLE in Data/*.sizes do echo $VARIABLE echo 'Genome length:' awk -F '\t' '{ sum += $2 } END { print sum }' $VARIABLE echo 'Chr number:' awk -F '\t' '{ NR $1 } END { print NR }' $VARIABLE echo 'Chr mean length:' awk -F '\t' '{ total += $2 } END { print total/NR }' $VARIABLE echo 'Longest Chr:' awk -v max=0 '{if($2>max){want=$1" "$2; max=$2}}END{print want}' $VARIABLE echo 'Smallest Chr:' awk 'NR == 1 || $2 < min {line = $1; min = $2}END{print line " " min}' $VARIABLE echo " " done

It worked but if you have any better ideas and maybe a way to make this more general, because time to time this is done in some similar files.

I would appreciate any inputs because I don't usually use awk and bash.

I got this printed out:

Data/file_1
Genome length:
100286070
Chr number:
7
Chr mean length:
1.43266e+07
Longest Chr:
chrV 20924149
Smallest Chr:
chrM 13794
AdminBee
  • 22,803
  • 4
    If you need your code reviewed, the site for that is [codereview.se] – muru Sep 12 '20 at 14:36
  • 2
    You should be using a single awk script for the body of the loop at least, not a bunch of individual awk scripts, and chances are you don't need the shell loop anyway and could just call awk with Data/*.sizes as the argument. I recommend the book Effective Awk Programming, 5th Edition, by Arnold Robbins to learn how to use awk. – Ed Morton Sep 12 '20 at 16:39

3 Answers3

2

The following awk script will accomplish the task. I will write it as explicit awk program file because of the length - which is mainly due to the function to print the analysis results; the actual calculations are rather short:

If you have GNU awk for the ENDFILE block:

Program file (let's call it analyze_genome_g.awk):

#!/usr/bin/gawk -f

Begin of file, characterized by FNR, the per-file line-counter, being 1.

Initialize statistics: set sum, min, and max to first chromosome length

and name of longest/shortest ('long'/'short') to first chromosome name.

FNR==1{s=min=max=$2; short=long=$1}

All other lines: Update sum, min, and max lengths

FNR>1{s=s+$2;if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}}

End-of-file (GNU awk feature!): Print statistics

ENDFILE{ printf("%s\n",FILENAME); printf("- Genome length : %d\n",s); printf("- Nr. of chromosomes : %d\n",FNR); printf("- Mean chomosome length : %.1f\n",s/FNR); printf("- Shortest chromosome : %s (length=%d)\n",short,min); printf("- Longest chromosome : %s (length=%d)\n",long,max); printf("\n"); }

You can call it as

gawk -f analyze_genome_g.awk file_1 file_2 ...

Output:

file_1
- Genome length         : 100286070
- Nr. of chromosomes    : 7
- Mean chomosome length : 14326581.4
- Shortest chromosome   : chrM (length=13794)
- Longest chromosome    : chrV (length=20924149)

file_2

  • Genome length : 12157105
  • Nr. of chromosomes : 17
  • Mean chomosome length : 715123.8
  • Shortest chromosome : chrM (length=85779)
  • Longest chromosome : chrIV (length=1531933)

Other awk variants:

If your awk doesn't know the ENDFILE condition, a little workaround is required - basically saving the file properties in temporary variables and print the statistics at either the beginning of a new file (for the previous file), or in the END block when the last file was processed.

To make this more convenient, we define a function printstats() which does the output.

Program file (analyze_genome.awk):

#!/usr/bin/awk -f
function printstats()
{
    printf("%s\n",last_fn);
    printf("- Genome length         : %d\n",s);
    printf("- Nr. of chromosomes    : %d\n",last_fnr);
    printf("- Mean chomosome length : %.1f\n",s/last_fnr);
    printf("- Shortest chromosome   : %s (length=%d)\n",short,min);
    printf("- Longest chromosome    : %s (length=%d)\n",long,max);
    printf("\n");
}

Begin of file

FNR==1 always works, but now we have to save file properties, too.

If it is not the first file (NR, the global line counter, is larger than

FNR, the per-file line-counter), print statistics (of the previous file).

FNR==1{ if (NR>1) printstats(); s=min=max=$2; short=long=$1; last_fn=FILENAME; last_fnr=1; }

FNR>1{ s=s+$2; if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}; last_fnr++; }

END{printstats()}

You can call it similarly as

awk -f analyze_genome.awk file_1 file_2 ...

As a general note, using shell loops to process text files is disrecommended as it is rather inefficient; awk and the like can perform almost all text-processing tasks and many statistical calculations much faster.

AdminBee
  • 22,803
1

You can produce your stat reports in a yaml fashion using the GNU version of the desk calculator. The below is a heavily commented version of the dc code.

#!/usr/bin/env bash
for VARIABLE in Data/*.sizes
do
    printf '%s:\n' "$VARIABLE" 
< "$VARIABLE" awk '{$1="["$1"]";sub(/^-/,"_",$2)}1' \
| dc -e "
[32adnn]si  # two-spaces indent in reporting
[
lix[Genome length:]   n32an lsp
lix[Chr number:]      n32an lkp
lix[Chr mean length:] n32an /1.0*p
lix[Longest Chr:]     n32an lM     n32an lmp
lix[Smallest Chr:]    n32an lN     n32an lnp
q
]sR
[dsmrdsMr]s+
[dsnrdsNr]s-
[
?z0=R  # report stats @ eof
lk1+sk # increment line kounter
dls+ss # update running sum
dlm<+  # update max
dln>-  # update min
cz0=?  # call myself recursively to read next line 
]s?
[
?       # read the first line
1skdss  # initialize knt, sum
dsmrdsM # initialize max
sNsn    # initialize min
cl?x    # read next line
]sI
lIx     # set the ball rolling, kinda like main() 
"

Results:

Data/file_1.sizes:
  Genome length: 100286070
  Chr number: 7
  Chr mean length: 14326581.0
  Longest Chr: chrV 20924149
  Smallest Chr: chrM 13794

Data/file_2.sizes: Genome length: 12157105 Chr number: 17 Chr mean length: 715123.0 Longest Chr: chrIV 1531933 Smallest Chr: chrM 85779

-2
awk 'BEGIN{sum=0}{sum=sum+$2}END{print sum/finalcountofline}' filename ====Mean

awk 'BEGIN{sum=0}($2 > sum){sum=$2}END{print sum}' filename ===Max

awk 'NR==1{sum=$2}($2 < sum){sum=$2}END{print sum}' filename ===min
  • 2
    Why do you divide by 74 for the mean? Should it it be like the OP has, and divide by NR? Also, using "sum" is a legal, but distracting variable name when you're using it as a min or max. I'd also encourage initializing "max" with the same NR==1 technique that you (and the OP) used for min. – Jeff Schaller Sep 13 '20 at 01:23