1

I am trying to parse several files to extract specific lines and output them into another file. However, the location of these information within my files can change according to a specific parameter.

For this I am thinking to use the if-statement. In most of the case what I need to extract is located in the lines 6 and 7:

# IGBLASTN 2.5.1+
# Query: RL0575_B2_no210_RL0575_B2_ACPA_positive_LC
# Database: human_gl_V human_gl_D human_gl_J BCR_C_all.fa
# Domain classification requested: imgt

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches having the same score and percent identity, if present, are$
IGLV4-69*01     IGLJ1*01        VL      No      In-frame        Yes     +

and for this I do:

a=`ls *LC.fa | awk -F "." '{print $1}'`; #here i just strip the name of the files for the loop
for i in $a;
            do cat $i.fmt7 | awk 'NR==6, NR==7' > $i.parsed.txt;
done

However there are some cases in which the file has that information in lines 8 and 9, because there is an additional note in line 6:

# IGBLASTN 2.5.1+
# Query: RL0624_B10_no15_RL0624_B10_ACPA_positive_LC
# Database: human_gl_V human_gl_D human_gl_J BCR_C_all.fa
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches having the same score and percent identity, if present, are$
IGKV3-20*01     IGKJ2*01        VK      Yes     In-frame        No      -

I would like to proceed in a way similar to above, but

a=`ls *LC.fa | awk -F "." '{print $1}'`; #here i just strip the name of the files for the loop
for i in $a;  
            if [my condition?]  # <== here I do not know how to formulate the condition!
            then
               cat $i.fmt7 | awk 'NR==8, NR==9' 
            else
               cat $i.fmt7 | awk 'NR==6, NR==7' > $i.parsed.txt;
            fi
done

How can I ensure to extract the correct lines despite the varying length of the preamble? Note that the files contain more data lines than presented here, so it is not simply the last two lines I need to extract.

Any idea is really appreciated.

AdminBee
  • 22,803
fusion.slope
  • 684
  • 5
  • 17
  • can you edit the question and be more specific about what you want? You only want a comment in the source? Or in the output? In the first case, put the comment on a new line after the do and the cat on a new line after the comment. – ruud Jun 09 '20 at 07:44
  • Would it be correct to say you want to extract the last two rows of the file, always? – Kusalananda Jun 09 '20 at 07:53
  • no the files are much longer, I took only the first n lines for this example. Because I would like to work on the first lines for the if statement.. – fusion.slope Jun 09 '20 at 07:56
  • in general the usage of the conditions if, was it a wrong idea @Kusalananda?, if not what could have been the strategy in that occasion? thanks in advance for any explanation. – fusion.slope Jun 09 '20 at 12:10
  • @fusion.slope Well, if you just want the last two lines of a file, tail -n 2 would give you that. – Kusalananda Jun 09 '20 at 12:38
  • no the files differe in the number of rows because of the reason mentioned above. Some files have more rows because of this row # Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence some some files need tail -2 othe needs tail -4.. – fusion.slope Jun 09 '20 at 12:49

2 Answers2

2

It would seem that your files contain one relevant data line, and the rest is either empty or a comment line starting with #; however the last one is a header you want to retain. Your problem appears to be that the number of comment lines varies.

If the task actually is to extract the header and this one data line for outputting into a "parsed" summary file, you could instruct awk to ignore all lines that are empty or start with the # character, without being the header identified by the # V-(D)-J starting pattern, as in:

awk '$0~/^# V-\(D\)-J/ || ($0!~/^#/ && NF>0) {print}' input_file > parsed_file

If, on the other hand, your file contains more than one data line, and you want to print only the header and the first data line, the awk command would have to look like this:

awk '$0~/^# V-\(D\)-J/ {print} ($0!~/^#/ && NF>0) {print;exit}' input_file > parsed_file

To do this in a shell loop, you could do

for file in *LC.fa
do
    infile="${file%.*}.fmt7"
    outfile="${file%.*}.parsed.txt"
    awk '$0~/^# V-\(D\)-J/ || ($0!~/^#/ && NF>0) {print}' "$infile" > "$outfile"
done

or

for file in *LC.fa
do
    infile="${file%.*}.fmt7"
    outfile="${file%.*}.parsed.txt"
    awk '$0~/^# V-\(D\)-J/ {print} ($0!~/^#/ && NF>0) {print;exit}' "$infile" > "$outfile"
done

respectively.

This loop is more robust as the parsing of the output of ls, which is highly disrecommended.

Some explanation on the awk command

awk works on a "condition-rule" syntax, where conditions are in the "main" program space and corresponding rules in { ... }.

In the first example, we have one condition and one rule:

  • If the line (dereferenced by $0) matches the regular expression ^# V-\(D\)-J, i.e. starts (^) with the string V-(D)-J
  • or (||) it does not start with a # (the $0!~/^#/ expression) and in addition is non-empty, i.e. it has at least one field (NF>0 - we could also have abbreviated this as simply NF) as defined by the "field separator" variable (defaults to whitespace)

then print the line.

This will print the header and any consecutive data lines.

In the second example, we have two conditions with associated rules:

  • If the line starts with the string # V-(D)-J, print the line.
  • If the line does not start with the #, and is not empty, print the line and then immediately exit, i.e. terminate the awk processing of the file.

That way, the "header" is printed, but once the first "data" line is encountered and printed, we stop the execution, which would then only print the header plus the first data line for each file.

AdminBee
  • 22,803
  • I want to keep the name of the file, this is why i do the ls.. – fusion.slope Jun 09 '20 at 07:57
  • 1
    The outfile= ... statement in the loop is designed to strip everything after the last ., and then append ".parsed.txt". – AdminBee Jun 09 '20 at 07:58
  • yes thanks i noticed now.. however the output are not the one i expect.. they are identical to the input and not return me the row that i want – fusion.slope Jun 09 '20 at 07:59
  • not working..the output is identical to the input.. there is the fmt7 extension in the input file? – fusion.slope Jun 09 '20 at 08:19
  • You are right, I overlooked that (although that shouldn't produce output identical to input ... strange). You can add an "echo processing "$infile" above the awk line to verify the correct input filename is used. Also, please check manually if the pure awk statement, when applied to the correct file, produces the desired output. I tried it with your example content and it worked on my end ... – AdminBee Jun 09 '20 at 08:23
  • thank you very now it works, could you please give me small explanation of this step: '$0~/^# V-\(D\)-J/ {print} ($0!~/^#/ && NF>0) ? – fusion.slope Jun 09 '20 at 12:07
  • 1
    Added, see revised post - hope it is understandable :) – AdminBee Jun 09 '20 at 12:18
  • 1
    yes it is perfect and I learned the reply to the problem. – fusion.slope Jun 09 '20 at 13:02
  • if you think the problem is clear, well formulated and shows effort would be nice a vote up @AdminBee – fusion.slope Jun 12 '20 at 11:16
-1

You may setup a for loop and invoke sed to generate the parse file.

for f in *LC.fa; do
  if=${f%.*}.fmt
  of=${f%.*}.parsed.txt
  sed -e '
    8N;9q
    6N;/\n./q;d
  ' < "$if" > "$of"
done