sed -n 's|^\(TCONS_[0-9]*\) .*|\
/^>\1 /,/\\n>/P|p' fileA |
sed -e '$!N' -f - -e D fileB >outfile
...that should accomplish what you're after. For every line in fileA that begins with the string TCONS_
and any digits followed by a space character the first sed
will print a line like:
/^>TCONS_000001 /,/\n>/P
...where the 000001
is whatever the generating line's numeric sequence is.
The second sed
is given three scripts - all of which it applies to its named input file - fileB.
The first is $!N
which instructs it to append the N
ext input line to pattern space for every line !
but the $
last.
The next is stdin - -f -
- which the first sed
constructs for it as noted.
- Each line the first
sed
prints at the second is a 2-address range//,//
instructing the second sed
to P
rint up to the first \n
ewline in pattern space for every line falling between the two addresses.
The last script is just D
which instructs sed
to D
elete up to the first occurring \n
ewline in pattern space and try the three scripts again.
The result is that a match for any range the first sed
scripts for the second begins when the block heading is at the ^
head of pattern space - an iteration after it is pulled in w/ N
ext and the previous line is D
eleted - and ends when the next block heading first appears and is still trailing pattern space as delimited by a \n
ewline character. Because the second sed
never P
rints further than that delimiter, sed
slides through fileB on a one-line lookahead - printing all blocks headed by strings found in column 1 of fileA and nothing else.
ANOTHER (more complicated/efficient) METHOD
Polished up:
also_fasta()( IFS='
'; get_match_lines(){
tr -s '[:space:]' \\n |
grep "$1" | grep -nFf - "$3"
}
get_script(){
set \\ '\1,$p;n\' '1,\1b\1\' \
'/^>/!b\1|p' '$c\' q
sed -n "s|\([^:]*\).*|:\1$*"
}
get_match_lines "$@" < "$2" |
get_script | sed -nf - "$3"
)
If you define that function in your shell and call it like:
also_fasta 'TCONS_[0-9]*' fileA fileB
...it should do the trick.
I came up with that after doing some tests. While the first definitely works, it is also definitely slow for large inputs.
After copying all of the [ACGT]*
lines in your example to my clipboard, I created sample data like:
seq -w -s " gene=XLOC_dummy
$( xsel -bo | sed 's/ *$//')
>TCONS_" 0 512 99999999 >/tmp/temp
The above uses seq
to write blocks like:
>TCONS_99993600 gene=XLOC_dummy
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA
... to the file /tmp/temp
where the numeric portion of the >TCONS_[0-9]*
string is incremented from 00000512
through 99999999
at an interval of 512 per block. The whole file came to about 185mbs give or take.
I then did:
grep -F 00\ </tmp/temp | cut -d\> -f2 >/tmp/tempA
...for the pattern file (which just narrows the selection to any TCONS_[0-9]*00<space>
matches).
The line counts for both files were:
wc -l /tmp/temp*
2734369 /tmp/temp
7813 /tmp/tempA
2742182 total
While I didn't run the two sed
s against input of even that size - the largest I tried with the sed | sed
suggestion was a data file a tithe of that size - and a pattern file selected on 2\
- which does come close to this, though.
Anyway, I thought about what was going on and I realized that with a pattern file approaching 8000 regexps then every time a line is D
eleted it's got to work it's way back through checking all of those regexps that came before - over and over again. This is probably not optimally done.
At least in my generated input though I did have one thing going for me - it was more or less sequential. Following that thread I realized it didn't even have to be sequential if I could work by line number rather than regexp - so I turned to grep
.
The command I ran (and the basis for my function above) is:
sed -n 's|^\(TCONS_[0-9]*\) .*|>\1 |p
' < /tmp/tempA |
grep -nFf - /tmp/temp |
sed -n 's|\([^:]*\):.*|:\1\
\1,$p;n;1,\1b\1\
/^>/!b\1|p;$c\' -e q |
sed -nf - /tmp/temp
If you use this you should substitute in fileB
for /tmp/temp
and fileA
for /tmp/tempA
.
grep -F
works with Fixed strings rather than regexp - and is far faster (especially considering we're working w/ head-of-line matches) - and the rest basically ignores all of grep
's results excepting the line numbers it returns at the head of each. The sed
in the middle there massages grep
's output in such a way that the sed
which winds up processing the data file proper will never backtrack its script even once - it will progress through its script file as it progresses through its input.
The sed
in the middle writes something like the following at the last sed
for every line grep
prints:
:2732801 #define branch label :LINENO
2732801,$p #if LINENO thru last line print
n #overwrite current line w/ next line
1,2732801b2732801 #if first line thru LINENO branch to :LINENO
/^>/!b2732801 #if !not /^>/ branch to :LINENO
So for every one of grep
's matches sed
implements two tiny little read loops: the first is a noop in which sed
overwrites the current line with the next until it has incremented the current line number to grep
's next match. The second is a print loop in which sed
continually prints the current line then overwrites it with the next until a line matches /^>/
. In this way sed
processes its script in lockstep with its infile - never advancing any further in the script than the next matching line number will allow.
This outperforms the other script by several orders of magnitude. It processes the 185mbs like...
( also_fasta 'TCONS_[0-9]*' /tmp/tempA /tmp/temp; ) \
3.93s user 0.39s system 100% cpu 4.281 total
...where the other handled input 10% of that size in...
( sed -n 's|^\(TCONS_[0-9]*\) .*|/>\1 /,/\\n>/P|p' /tmp/tempA |sed -e -f... ) \
108.58s user 0.04s system 100% cpu 1:48.56 total
More specifically, the line counts from the other script's input were:
wc -l /tmp/temp*
273435 /tmp/temp
782 /tmp/tempA
274217 total
TCONS[^ ]*
in file a is also found at the head of a[GTAC]*
block in b then that block should be printed - but not otherwise, right? It is a little confusing because if i understand you correctly then your example data does not produce a single desirable match. – mikeserv Mar 12 '15 at 11:42