I've been trying to merge two big files based on two key columns (Chromosome and Position) and I found out that the most efficient way seems to be awk
.
A sample of how my files look like is:
file1.txt
Gene_ID Chromosome Position Fst
ENSG00000141424 18 33688658 0
ENSG00000141424 18 33688669 0
ENSG00000141424 18 33688681 0
ENSG00000141424 18 33688683 0.0111734
ENSG00000141424 18 33688720 0
ENSG00000141424 18 33688726 0
ENSG00000141424 18 33688743 0
ENSG00000141424 18 33688745 0
ENSG00000141424 18 33688763 0
And the other file:
file2.txt
Chromosome Start End Ref Alt RS_ID
1 10019 10020 TA T rs775809821
1 10020 10020 A - rs775809821
1 10055 10055 - A rs768019142
1 10055 10055 T TA rs768019142
1 10108 10108 C T rs62651026
1 10109 10109 A T rs376007522
1 10128 10128 A AC rs796688738
1 10128 10128 - C rs796688738
1 10139 10139 A T rs368469931
1 10144 10145 TA T rs144773400
I want to get a third file looking like this:
Gene_ID Chromosome Position RS_ID Fst
ENSG00000141424 18 33688658 rs1504554... 0
I've tried using awk
and I think the syntax is OK but what I get is a file containing file1.txt
and file2.txt
concatenated.
awk 'FS=" "; OFS=" ";NR=FNR{A[$1,$2]=$6;next}{$5=A[$2,$3];print}' file1.txt file2.txt > file3.txt
Any ideas of what I might be doing wrong?