1

I have several files in a directory: file1.txt file2.txt file3.txt

For each file in the directory, I want to print only unique columns. Column 1 will match either 3 or 4; I want to print the columns that are unique and save it as f_parsed.txt

file1.txt:

gene1   description1    gene1   gene88 
gene56  description2    gene67  gene56
gene6   description3    gene95  gene6

file1_parsed.txt:

gene1   description1    gene88  
gene56  description2    gene67  
gene6   description3    gene95  

Here is the code that I have so far:

for f in *.txt ; do while IFS= read -r line; do awk  -F "," '{if ($3 = $1) {print $1, $2, $3} else {print $1, $2, $4}}' > $f_parsed.txt;
    done

Then, for each parsed file, I want to grep the gene in column 3 of f_parsed.txt and look for it in file_B.txt and return line with match and the following line. All lines with a match saved as match1.txt (next file will become match2.txt)

file_B.fasta looks like this:

>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene6 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene4 | jeiai | dhdhd
GTCAGTTTTTA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA...
...
cat f_parsed.txt | while IFS= read -r line; do grep "$3" file_B.fasta  |awk '{x=NR+1}(NR<=x){print}' > match1.txt ; done

My final output for the example file I began with should be called match1.txt and look like

>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA...

Thanks in advance! I know the code is very rough, but I'm a beginner.

  • 1
    Also, please remember that this site is not full of bioinformaticians so you will need to explain things like "fasta". That said, I am a bioinformatician and have been working with fasta files for almost 20 years now, and I don't understand what a "linearized fasta file" means. – terdon Mar 04 '21 at 18:32
  • Related: https://unix.stackexchange.com/q/169716/135943 – Wildcard Mar 04 '21 at 18:51
  • Thanks @Wildcard – lepomis8 Mar 04 '21 at 18:53
  • Thanks for the edit, that helps a lot! One thing though, fasta allows for multiple sequence lines under the same identifier. Your examples are all on a single line (I guess that's what you meant by linearized). Does this mean we can assume that all of your sequence entries in your fasta will have exactly one line of sequence? – terdon Mar 04 '21 at 18:58
  • @terdon no problem! Your suggestions made it possible. Yes, I've got a different script which put all of the sequences on one line for all entries. – lepomis8 Mar 04 '21 at 19:01
  • You might also be interested in the two scripts I've posted here. Those two are my workhorses for simple fasta searching. – terdon Mar 04 '21 at 19:02

2 Answers2

1

One approach can be as follows. We read in the fasta file first and build up an array keyed on the gene name. Value corresponding to this key is the current n next lines , newline separated.

Output is saved in match*.txt files.

awk -F '|' '
  # @the beginning of file, get its type
  FNR==1 {  inCsv = !(inFasta = FS == "|") }

get gene name n record next line number

inFasta && /^>/ { t=$0; gene=$1 gsub(/^.|[[:space:]]*$/, "", gene) nxtln=NR+1 }

fill up the value for the current gene

inFasta && NR==nxtln { a[gene] = t ORS $0 }

we are in CSV file

close previously open filehandle

open fresh file handle (match*.txt)

write to filehandle based on equality

of field1 and field3

inCsv && NF>3 { if (FNR == 1) { close(outf) outf = "match" ++k ".txt" } print a[$($1==$3?4:3)] > outf }

' file_B.fasta FS=, file*.txt

$ cat match1.txt
>gene88 | shahid | ahifehhuh
TAGTCTTTCAAAAGA...
>gene67 | vdiic | behej
GTCAGTTTTTA...
>gene95 | siis | ahifehhniniuh
TAGTCTTTCAAAAGA..
guest_7
  • 5,728
  • 1
  • 7
  • 13
0
awk '{if($1 == $3) {print $1,$2,$NF}else{if($1 == $NF){print $1,$2,$3}}}' filename

output

gene1 description1 gene88
gene56 description2 gene67
gene6 description3 gene95
  • 1
    Please note that this answers only the first part of the OPs question, you may want to expand it to solve the other part, too. – AdminBee Mar 05 '21 at 08:29