match two files and filter out

Question

I have two files

file A

>TCONS_00000075 gene=XLOC_000030
CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAGCGCTACGGTC
CACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTTCCTCATCGAGCAGCAGCAAC
TCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACACGGCAACAACACTTTATACTTGTGTATAAT
>TCONS_00013830 gene=XLOC_006942
AAACACGGTTAGCTTGATATCACTGATGATCGATGGGATAGAGTCAGAGAACATCTTGTTCCTTAATTAT
CTCAATTCGTGAGATGTTGGACGATATCTCGATAGGGAGAGAAGGCGTTGTTCTGGATCATCACCGTGCT
CAGGGGTCAATTTTACACTGAGCAGGGGCAAAGACGTAAATTTTTACTTCCTTACTTGAGTAAGAGCAAG
TTTAATACTACAACCAACTACTACAAACTCCAATTCATTTATAACCAATCTAATAACTTATTCATACAAT
AGTTACCTATAAGCATATACTACACACACAACGTATTGGAATCCTCCGTGCTGCTGCTGGCTACAGATCT

file B

XLOC_000030
XLOC_000059
XLOC_000210

FileA is a FASTA sequence file. Each line starting with > is a sequence name and the lines beneath it are the sequence. I want to extract the sequences of those IDs mentioned in FileB. In this case:

file C

>TCONS_00000075 gene=XLOC_000030
CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAGCGCTACGGTC
CACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTTCCTCATCGAGCAGCAGCAAC
TCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACACGGCAACAACACTTTATACTTGTGTATAAT

I tried this command:

perl -pe 's/\n//; s/>(.*)/\n>$1\t/' A |grep -f <(awk '{print $1}' B) |sed 's/\t/\n/' | fold -w 60 > C

but it's not working.

"It's not working" What is not working? That is not a a perl command, it is a command sequence. Which programs of the four that you run does not do what you want? Make a minimal example of what goes wrong by throwing away 3 of the four steps in your commandline, and provide the input and required output and actual for that step only. — Anthon, Apr 10 '15 at 05:31
@rahul questions on another site are not duplicates. In any case, that question is irrelevant, this is completely different. — terdon, Apr 10 '15 at 12:49
@user106326 I just tried your command and it returns the first sequence of fileA, exactly as you want. How is it failing? — terdon, Apr 10 '15 at 12:50
My answer to the duplicate question (the one with TblToFasta and FastaToTbl) will work for you. — terdon, Apr 10 '15 at 13:00

Trockton · Answer 1 · 2015-04-14T13:06:59.050

0

#!/bin/bash
while read line
do
        grep -A 1 $line filea >> filec
done < fileb

edited Apr 14 '15 at 13:06

answered Apr 10 '15 at 09:18

Trockton

1

The output is >TCONS_00000075 gene=XLOC_000030 and that's simply wrong. – A.B. Apr 10 '15 at 12:59
I wrongly assumed the patterns to match were on the same line as the required content, adding adding -A 1 flag to grep will fix this. – Trockton Apr 14 '15 at 13:03

score 0 · Answer 2 · edited Jun 11 '20 at 12:04

Tip: Use database tools for database work.

If you end up spending all of your time just working out the mechanics of looking things up as opposed to doing the lookups themselves, and your commands to look things up are long combinations of perl, sed, awk, and grep that start to resemble modem line noise, then it is time to consider using actual database tools for your database instead of cobbling bespoke queries together with text-processing tools.

select * from RNA where gene in ('XLOC_000030','XLOC_000059','XLOC_000210');

is a lot simpler to work with.

That said, have some text processing modem line noise.

Don't duplicate work.

You are transforming your database on the fly from a multiple-line-per-record form into a one-line-per-record form. Do that once, not every query. Have a Makefile that says:

A.flat: A
        sed -e '/^>/s/$$/ /;:a;$$!N;s/\n //;ta;s/^>//;P;D' $^ > $@.tmp
        mv $@.tmp $@

Then just run make every time that A could have changed. (Don't forget that make relies on that whitespace at the start of those lines being a TAB character.)

That sed program is:

# Append a space to the first line in the record.
/^>/s/$$/ /
:a
# Join if not EOF
$!N
# Eliminate newline and space if the joined line began with a space.
s/\n //
# Loop around if so.
ta
# Eliminate the initial start of record character, as that is now the line break.
s/^>//
# Print and remove the whole record from the buffer.  Then start again.
P
D

Querying

You've over-complicated your query command, which is actually just

grep -f B A.flat > C.flat

Converting the query output back to multiple-line records is more line noise:

sed -e 's/^/>/' C.flat|fold -s -w70|sed -e 's/^[^>]/ /' > C

Omitting the intermediate files thus gives:

grep -f B A.flat|sed -e 's/^/>/'|fold -s -w70|sed -e 's/^[^>]/ /' > C

And again …

To repeat: Actual database tools would be much better. If you have a large number of records, which seems likely given what you appear (from this and other questions) to be doing, searches, insertions, and deletions are going to be very inefficient using the text-processing tools method. An actual database could be indexed on the gene field, in contrast.

match two files and filter out

2 Answers2

Tip: Use database tools for database work.

Don't duplicate work.

Querying

And again …