2

I came across this one-liner script fu for getting rid of newline characters in a fixed width text file. The idea is to change a file full of entries like:

>IGHV1-18*01
CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAG
GTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGC
TGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC
AATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACA
GACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCC
GTGTATTACTGTGCGAGAGA

to

>IGHV1-18*01
CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA

I am not very experienced with AWK so I figured it would be a good learning experience to try and decipher it. However, I am having difficulties. Specifically about multiple blocks coming after each other, is the first block an implicit for-loop?

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa
terdon
  • 242,166
posdef
  • 579

2 Answers2

4

awk will read, line by line (you might consider as a block, but it is a line, ending by line-feed or CR)

let's break that code

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'

As you can see in man awk, awk programs are in the form of /pattern/ { actions}, thus the program turns into :

  • /^>/ {printf("\n%s\n",$0);next; }

    • for lines that begin with > ( /^>/ )
    • print the line surrounded by \n ( printf("\n%s\n",$0) )
    • fetch the next line ( next ), do not go to the next awk command.
  • { printf("%s",$0);}

    • for all patterns ( pattern clause is empty )
    • print line without a newline character ( printf("%s",$0); )
  • END {printf("\n");}

    • after the end of the file ( or files) ( END )
    • print a newline ( printf "\n" ; )
terdon
  • 242,166
Archemar
  • 31,554
4

Kinda, yes. Only it's not implicit. The format is actually:

/foo/{something}

Which is the same as

if(/foo/){something}

In other words, if the current line matches foo (in your example, if it matches >), then print a newline, the current line and another newline.

The next ensures that if the 1st block is executed, the script skips the rest of the blocks and moves on to the next line. The oneliner could also be written like this:

awk '{
        if(/^>/){
            printf("\n%s\n",$0);
        }
        else{ 
            printf("%s",$0);
        }
        END {
                printf("\n");
        }' < file.fa

Finally, since the simple print call of awk adds a newline, you could use a slightly simpler version of the above:

awk '/^>/{print "\n"$0;next;}{printf("%s",$0);} END{print}' file.fa
terdon
  • 242,166
  • 1
    The implicit part was the second {} block which seemingly corresponds to the else. Thanks for the explanation! As a side question, one shortcoming of this script is that it adds an extra newline in the beginning of the file, which breaks some tools. Would it be possible to avoid that somehow? – posdef Sep 12 '16 at 14:22
  • 1
    @posdef ah, I see, but that's not implicit either. It's just that the next in the 1st block ensures that the second is only run if the 1st fails. So yes, it acts like an else but isn't really implicit (there's an explicit next). And yes, that newline was there in the biostars answer. The simplest way to get rid of it would be to pass the output through | tail -n +2. You might also be interested in the scripts in my answer here, by the way. I find the tbl format much more useful than the fake fasta with the whole seq on one line. – terdon Sep 12 '16 at 14:26
  • you mean tail -n +2 i suppose? – posdef Sep 13 '16 at 11:12
  • 1
    @posdef whoops, yes, indeed I do. I'll use my magic mod powers to edit my precious comment. Sorry! – terdon Sep 13 '16 at 11:26