The following awk
script will accomplish the task. I will write it as explicit awk
program file because of the length - which is mainly due to the function to print the analysis results; the actual calculations are rather short:
If you have GNU awk
for the ENDFILE
block:
Program file (let's call it analyze_genome_g.awk
):
#!/usr/bin/gawk -f
Begin of file, characterized by FNR, the per-file line-counter, being 1.
Initialize statistics: set sum, min, and max to first chromosome length
and name of longest/shortest ('long'/'short') to first chromosome name.
FNR==1{s=min=max=$2; short=long=$1}
All other lines: Update sum, min, and max lengths
FNR>1{s=s+$2;if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}}
End-of-file (GNU awk feature!): Print statistics
ENDFILE{
printf("%s\n",FILENAME);
printf("- Genome length : %d\n",s);
printf("- Nr. of chromosomes : %d\n",FNR);
printf("- Mean chomosome length : %.1f\n",s/FNR);
printf("- Shortest chromosome : %s (length=%d)\n",short,min);
printf("- Longest chromosome : %s (length=%d)\n",long,max);
printf("\n");
}
You can call it as
gawk -f analyze_genome_g.awk file_1 file_2 ...
Output:
file_1
- Genome length : 100286070
- Nr. of chromosomes : 7
- Mean chomosome length : 14326581.4
- Shortest chromosome : chrM (length=13794)
- Longest chromosome : chrV (length=20924149)
file_2
- Genome length : 12157105
- Nr. of chromosomes : 17
- Mean chomosome length : 715123.8
- Shortest chromosome : chrM (length=85779)
- Longest chromosome : chrIV (length=1531933)
Other awk
variants:
If your awk
doesn't know the ENDFILE
condition, a little workaround is required - basically saving the file properties in temporary variables and print the statistics at either the beginning of a new file (for the previous file), or in the END
block when the last file was processed.
To make this more convenient, we define a function printstats()
which does the output.
Program file (analyze_genome.awk
):
#!/usr/bin/awk -f
function printstats()
{
printf("%s\n",last_fn);
printf("- Genome length : %d\n",s);
printf("- Nr. of chromosomes : %d\n",last_fnr);
printf("- Mean chomosome length : %.1f\n",s/last_fnr);
printf("- Shortest chromosome : %s (length=%d)\n",short,min);
printf("- Longest chromosome : %s (length=%d)\n",long,max);
printf("\n");
}
Begin of file
FNR==1 always works, but now we have to save file properties, too.
If it is not the first file (NR, the global line counter, is larger than
FNR, the per-file line-counter), print statistics (of the previous file).
FNR==1{
if (NR>1) printstats();
s=min=max=$2; short=long=$1;
last_fn=FILENAME; last_fnr=1;
}
FNR>1{
s=s+$2; if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1};
last_fnr++;
}
END{printstats()}
You can call it similarly as
awk -f analyze_genome.awk file_1 file_2 ...
As a general note, using shell loops to process text files is disrecommended as it is rather inefficient; awk
and the like can perform almost all text-processing tasks and many statistical calculations much faster.
Data/*.sizes
as the argument. I recommend the book Effective Awk Programming, 5th Edition, by Arnold Robbins to learn how to use awk. – Ed Morton Sep 12 '20 at 16:39