match first column of file a with paragraphs of file b

Question

I have 2 files. The first, fileA looks like

TCONS_00000066  XLOC_000030     -       u       q1:XLOC_000030|TCONS_00000066|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000130  XLOC_000057     -       u       q1:XLOC_000057|TCONS_00000130|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000395  XLOC_000206     -       u       q1:XLOC_000204|TCONS_00000393|0|0.000000|0.000000|0.000000|0.000000|-

FileB looks like:

>TCONS_00000001 gene=XLOC_000001
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA

Now, fileA contains selected transcript numbers in the first column and fileB contains sequences of all transcripts. I want to scan fileB for the first column of fileA and print the trailing sequences of matching transcripts along with the transcript number.

So just to be clear: if TCONS[^ ]* in file a is also found at the head of a [GTAC]* block in b then that block should be printed - but not otherwise, right? It is a little confusing because if i understand you correctly then your example data does not produce a single desirable match. — mikeserv, Mar 12 '15 at 11:42
yes u are correct. if TCON[]* from file a is found before any block in file b that is to be printed. all those into a third file mentioning the TCON * number ...here in this case starting from TCONS_00000066 XLOC_000030 — dinesh, Mar 12 '15 at 11:47

mikeserv · Answer 1 · 2015-03-14T08:43:11.047

sed -n 's|^\(TCONS_[0-9]*\) .*|\
         /^>\1 /,/\\n>/P|p' fileA |
sed -e '$!N' -f - -e D fileB >outfile

...that should accomplish what you're after. For every line in fileA that begins with the string TCONS_ and any digits followed by a space character the first sed will print a line like:

/^>TCONS_000001 /,/\n>/P

...where the 000001 is whatever the generating line's numeric sequence is.

The second sed is given three scripts - all of which it applies to its named input file - fileB.

The first is $!N which instructs it to append the Next input line to pattern space for every line !but the $last.
The next is stdin - -f - - which the first sed constructs for it as noted.
- Each line the first sed prints at the second is a 2-address range//,// instructing the second sed to Print up to the first \newline in pattern space for every line falling between the two addresses.
The last script is just D which instructs sed to Delete up to the first occurring \newline in pattern space and try the three scripts again.

The result is that a match for any range the first sed scripts for the second begins when the block heading is at the ^head of pattern space - an iteration after it is pulled in w/ Next and the previous line is Deleted - and ends when the next block heading first appears and is still trailing pattern space as delimited by a \newline character. Because the second sed never Prints further than that delimiter, sed slides through fileB on a one-line lookahead - printing all blocks headed by strings found in column 1 of fileA and nothing else.

ANOTHER (more complicated/efficient) METHOD

Polished up:

also_fasta()( IFS='
';  get_match_lines(){
        tr   -s  '[:space:]'    \\n  |
        grep "$1" | grep -nFf - "$3"
    }
    get_script(){
        set \\ '\1,$p;n\' '1,\1b\1\' \
               '/^>/!b\1|p' '$c\' q
        sed -n "s|\([^:]*\).*|:\1$*"
    }
    get_match_lines "$@" < "$2"      |
    get_script | sed -nf - "$3"
)

If you define that function in your shell and call it like:

also_fasta 'TCONS_[0-9]*' fileA fileB

...it should do the trick.

I came up with that after doing some tests. While the first definitely works, it is also definitely slow for large inputs.

After copying all of the [ACGT]* lines in your example to my clipboard, I created sample data like:

seq -w -s " gene=XLOC_dummy
$(        xsel -bo | sed 's/ *$//')
>TCONS_"  0 512 99999999 >/tmp/temp

The above uses seq to write blocks like:

>TCONS_99993600 gene=XLOC_dummy
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA

... to the file /tmp/temp where the numeric portion of the >TCONS_[0-9]* string is incremented from 00000512 through 99999999 at an interval of 512 per block. The whole file came to about 185mbs give or take.

I then did:

grep -F 00\  </tmp/temp | cut -d\> -f2 >/tmp/tempA

...for the pattern file (which just narrows the selection to any TCONS_[0-9]*00<space> matches).

The line counts for both files were:

wc -l /tmp/temp*
  2734369 /tmp/temp
     7813 /tmp/tempA
  2742182 total

While I didn't run the two seds against input of even that size - the largest I tried with the sed | sed suggestion was a data file a tithe of that size - and a pattern file selected on 2\ - which does come close to this, though.

Anyway, I thought about what was going on and I realized that with a pattern file approaching 8000 regexps then every time a line is Deleted it's got to work it's way back through checking all of those regexps that came before - over and over again. This is probably not optimally done.

At least in my generated input though I did have one thing going for me - it was more or less sequential. Following that thread I realized it didn't even have to be sequential if I could work by line number rather than regexp - so I turned to grep.

The command I ran (and the basis for my function above) is:

sed  -n 's|^\(TCONS_[0-9]*\) .*|>\1 |p
'    < /tmp/tempA              |
grep -nFf - /tmp/temp          |
sed  -n 's|\([^:]*\):.*|:\1\
        \1,$p;n;1,\1b\1\
        /^>/!b\1|p;$c\' -e q   |
sed  -nf - /tmp/temp

If you use this you should substitute in fileB for /tmp/temp and fileA for /tmp/tempA.

grep -F works with Fixed strings rather than regexp - and is far faster (especially considering we're working w/ head-of-line matches) - and the rest basically ignores all of grep's results excepting the line numbers it returns at the head of each. The sed in the middle there massages grep's output in such a way that the sed which winds up processing the data file proper will never backtrack its script even once - it will progress through its script file as it progresses through its input.

The sed in the middle writes something like the following at the last sed for every line grep prints:

:2732801                     #define branch label :LINENO
2732801,$p                   #if LINENO thru last line print
n                            #overwrite current line w/ next line
1,2732801b2732801            #if first line thru LINENO branch to :LINENO
/^>/!b2732801                #if !not /^>/ branch to :LINENO

So for every one of grep's matches sed implements two tiny little read loops: the first is a noop in which sed overwrites the current line with the next until it has incremented the current line number to grep's next match. The second is a print loop in which sed continually prints the current line then overwrites it with the next until a line matches /^>/. In this way sed processes its script in lockstep with its infile - never advancing any further in the script than the next matching line number will allow.

This outperforms the other script by several orders of magnitude. It processes the 185mbs like...

( also_fasta 'TCONS_[0-9]*' /tmp/tempA /tmp/temp; ) \
   3.93s user 0.39s system 100% cpu 4.281 total

...where the other handled input 10% of that size in...

( sed -n 's|^\(TCONS_[0-9]*\) .*|/>\1 /,/\\n>/P|p' /tmp/tempA  |sed -e  -f... ) \
108.58s user 0.04s system 100% cpu 1:48.56 total

More specifically, the line counts from the other script's input were:

wc -l /tmp/temp*
  273435 /tmp/temp
     782 /tmp/tempA
  274217 total

didnt get anything from above...how do you want me to proceed?? — dinesh, Mar 13 '15 at 04:27

terdon · Answer 2 · 2015-03-18T11:34:26.197

3

If you're going to do this kind of thing often, I suggest you use a couple of tools I have¹ which are very useful for this sort of thing:

FastaToTbl:

#! /bin/sh
gawk '{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}'  "$@"

Save that file as FastaToTbl in ~/bin and make it executable with chmod 755 ~/bin/FastaToTbl.

TblToFasta

#! /bin/sh
gawk '{
  sequence=$NF

  ls = length(sequence)
  is = 1
  fld  = 1

  while (fld < NF)
  {
     if (fld == 1){printf ">"}
     printf "%s " , $fld
      if (fld == NF-1)
      {
        printf "\n"
      }
      fld = fld+1
  }
   while (is <= ls)
  {
    printf "%s\n", substr(sequence,is,60)
    is=is+60
  }
}' "$@"

Save that file as TblToFasta in ~/bin and make it executable with chmod 755 ~/bin/TblToFasta.

Now that you have those two scripts set up, you can use FastaToTbl to change your FASTA sequences to Tbl format (Sequence name, TAB, sequence) which is far easier to search through:

FastaToTbl fileB | grep -f <(awk '{print $1}' fileA)  | TblToFasta

That uses the FastaToTbl script to put sequence and ID on the same file. The awk '{print $1}' fileA prints the 1st field of fileA and, because we're using <() this can be passed as a "file" to grep's -f option which asks for a file of patterns to search for. Finally, TblToFasta returns the FASTA format.

Finally, you might want to check out my retrieveseqs.pl script which can do this and many other things. For example, in this case, you could have just done:

retrieveseqs.pl -f fileB fileA

¹_{Thanks to J.F. Abril who originally wrote these two scripts.}

edited Mar 18 '15 at 11:34

answered Mar 12 '15 at 14:12

terdon

242,166

nothing working for me..:(. just getting the fileB with all sequences again and again. – dinesh Mar 13 '15 at 04:50
@dinesh sorry, the awk script was wrong, my bad. I replaced it with a different approach. However, if you work with sequences often, I recommend you use one of the other two, more sophisticated solutions. – terdon Mar 13 '15 at 13:03
I think you can make each of those scripts just a little more efficient by dropping sh altogether. They should work, I think, with #!/path/to/gawk bang-lines - in which case you can drop everything outside the '' layer. If you do that the kernel will just pass "$@" to gawk automatically. If the /bin/gawk pathing is uncertain you can use /bin/env - gawk instead - which is a little more likely to resolve on any machine and is very likely still way cheaper than getting a whole shell to call it up. Another advantage is that you'll free up single-quotes to use in-script. – mikeserv Mar 13 '15 at 22:29
@terdon...where is the perl script you wrote. it worked like miracle!!!! thank you for that.. – dinesh Mar 14 '15 at 04:02
@mikeserv....thank you for the follow through. i will try it – dinesh Mar 14 '15 at 04:02
@dinesh you're welcome but if one of these answers solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites. – terdon Mar 14 '15 at 12:48

match first column of file a with paragraphs of file b

2 Answers2

ANOTHER (more complicated/efficient) METHOD

Linked

Related