I am trying to parse several files to extract specific lines and output them into another file. However, the location of these information within my files can change according to a specific parameter.
For this I am thinking to use the if-statement. In most of the case what I need to extract is located in the lines 6 and 7:
# IGBLASTN 2.5.1+
# Query: RL0575_B2_no210_RL0575_B2_ACPA_positive_LC
# Database: human_gl_V human_gl_D human_gl_J BCR_C_all.fa
# Domain classification requested: imgt
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches having the same score and percent identity, if present, are$
IGLV4-69*01 IGLJ1*01 VL No In-frame Yes +
and for this I do:
a=`ls *LC.fa | awk -F "." '{print $1}'`; #here i just strip the name of the files for the loop
for i in $a;
do cat $i.fmt7 | awk 'NR==6, NR==7' > $i.parsed.txt;
done
However there are some cases in which the file has that information in lines 8 and 9, because there is an additional note in line 6:
# IGBLASTN 2.5.1+
# Query: RL0624_B10_no15_RL0624_B10_ACPA_positive_LC
# Database: human_gl_V human_gl_D human_gl_J BCR_C_all.fa
# Domain classification requested: imgt
# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches having the same score and percent identity, if present, are$
IGKV3-20*01 IGKJ2*01 VK Yes In-frame No -
I would like to proceed in a way similar to above, but
a=`ls *LC.fa | awk -F "." '{print $1}'`; #here i just strip the name of the files for the loop
for i in $a;
if [my condition?] # <== here I do not know how to formulate the condition!
then
cat $i.fmt7 | awk 'NR==8, NR==9'
else
cat $i.fmt7 | awk 'NR==6, NR==7' > $i.parsed.txt;
fi
done
How can I ensure to extract the correct lines despite the varying length of the preamble? Note that the files contain more data lines than presented here, so it is not simply the last two lines I need to extract.
Any idea is really appreciated.
if
, was it a wrong idea @Kusalananda?, if not what could have been the strategy in that occasion? thanks in advance for any explanation. – fusion.slope Jun 09 '20 at 12:10tail -n 2
would give you that. – Kusalananda Jun 09 '20 at 12:38# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence
some some files need tail -2 othe needs tail -4.. – fusion.slope Jun 09 '20 at 12:49