I am trying to parse the output from antismash to count the number of BGC. I found the scripts that people have with python which I know nothing about, so I am trying to figure this out using bash script.
The gene bank format files with predicted clusters look like this:
head -30 sca_11_chr8_3_0.region001.gbk
LOCUS sca_11_chr8_3_0 45390 bp DNA linear UNK 01-JAN-1980
DEFINITION sca_11_chr8_3_0.
ACCESSION sca_11_chr8_3_0
VERSION sca_11_chr8_3_0
KEYWORDS .
SOURCE
ORGANISM
.
COMMENT ##antiSMASH-Data-START##
Version :: 6.0.1-a859617(changed)
Run date :: 2021-10-31 18:00:02
NOTE: This is a single cluster extracted from a larger record!
Orig. start :: 169481
Orig. end :: 214871
##antiSMASH-Data-END##
FEATURES Location/Qualifiers
protocluster 1..45390
/aStool="rule-based-clusters"
/contig_edge="False"
/core_location="join{[194767:194871](-),
[194650:194652](-), [191596:194619](-),
[189481:191503](-)}"
/cutoff="20000"
/detection_rule="cds(Condensation and (AMP-binding or
A-OX))"
/neighbourhood="20000"
/product="NRPS"
/protocluster_number="1"
/tool="antismash"
proto_core complement(join(20001..22022,22116..25138,25170..25171,
tail sca_11_chr8_3_0.region001.gbk
44881 ggagcttgtg gagagaagtg agacgtatcg cacgaatgct cttcagcaga tgctgggcag
44941 ttagaggatt tgcactttag tttcatagag ttgatgtgtc gaggagataa tttgagatac
45001 cagtatatgt aatttaccta cctacctagt cgagattgga cattgtacaa gagaaataac
45061 aactaactat acgagacaag cctgatgtgt tgatagtttc attcatgtct ggtgtttgtg
45121 gcatgtttat gttggagtag ctgtacagaa gataccgcgc tattcccagt gatcatggcc
45181 cccacgcctc caactcggca cctgaccttg atcccctttg ggaagcatgt ctcagtgtct
45241 cagccgtgag ccgtagaggc tgcacagcat ggagaagctg tcctgtcaat tcaggggatt
45301 tgcccacggg ggctatcata tgatgaatct cggacaccct acacgttgtt accgcctttc
45361 ttagctcctg ctggtagccg tcccctgaac
//
At first I concatenated the gbk files into one so it contains all the predicted clusters and then grep the characters that gave me the locus id, start and end of cluster and cluster type.
cat sca_*.gbk > Necha2_SMclusters.gbk
grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk > Necha2_SMclusters_filtered.txt
which gives me a file like this
DEFINITION sca_32_chr11_3_0.
Orig. start :: 381231
Orig. end :: 428233
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
DEFINITION sca_32_chr11_3_0.
Orig. start :: 464307
Orig. end :: 486217
/product="terpene"
/product="terpene"
/product="terpene"
/product="terpene"
DEFINITION sca_33_chr6_1_0.
Orig. start :: 140267
Orig. end :: 227928
/product="NRPS-like"
/product="T1PKS"
/product="NRPS-like"
/product="T1PKS"
/product="NRPS-like"
/product="NRPS-like"
/product="NRPS-like"
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
DEFINITION sca_39_chr11_5_0.
Orig. start :: 270154
Orig. end :: 324310
/product="NRPS"
/product="NRPS"
/product="NRPS"
/product="NRPS"
From this file I want to obtain a file which looks like this.
Locus name start end ClusterType
sca_9_chr7_10_0. 369577 421460 T1PKS,NRPS
sca_33_chr6_1_0. 140267 227928 NRPS-like, T1PKS
sca_32_chr11_3_0 381231 428233 T1PKS
For now this is what I need a file with all predicted clusters in it.
Thank you so much!!
ClusterType
isterpene
? How can we know to skip that? I have written scripts using SwissKnife in Perl for this and I am sure there will be dedicated parses included in both bioPython and bioPerl. – terdon Nov 01 '21 at 16:20cat sca_*.gbk > Necha2_SMclusters.gbk grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk
as the sample input while you don't need to do any of that if you're using awk and it's probably making what you need to do with the awk script harder than it has to be as that's stripping whatever separates the blocks and/or combining them into one stream, you can just run awk directly onsca_*.gbk
so THAT is the sample input you should provide, not the output of running some other tools on them. – Ed Morton Nov 01 '21 at 16:26sca_*.gbk
files in terms of what each record contains (the 3 fields you want plus a couple you don't), 1 or more records per file, and separators between records to be the sample input and show the ouput you expect given that input. – Ed Morton Nov 01 '21 at 16:37//
which always (at least in my experience) appear by themselves and as the only characters on the line. – terdon Nov 01 '21 at 16:38grep "DEFINITION\|Orig\|product="
(if that's not true the existing approach falls apart), other lines they don't want in between of whatever content, and blocks are separated by lines that are just//
per your comment. Given that, it's trivial to create a MCVE. I'm just going to do it... – Ed Morton Nov 01 '21 at 16:59