Comparing two tsv files

Question

I am trying to compare two tsv files. The file to be queried (file1) looks like:

Chr      Start      End
chr1    234738546   234738934
chr1    234792654   234793537
chr1    234908151   234908864
chr1    235097868   235098170
chr1    236080566   236081347
chr1    240307621   240308262
chr1    240308207   240308637
chr1    240308546   240308962
chr1    242627058   242627262
chr1    243923195   243923709

The second column of another file (file2) contains numbers that I wish to check, if lying, between the numbers in column 2 and 3 in the and repeat it until condition is satisfied.

eg: 242627060 lies between 242627058 & 242627262

File2 looks like:

Chr    Centre_Coord Ignore_this_col   Secondary Information
chr1    234765055   234765056   NR_033927_LINC00184     .   +
chr1    234782033   234782034   NR_125944_LOC101927787  .   +
chr1    234859787   234859788   NR_038856_LINC01132     .   +
chr1    234895802   234895803   NR_148962_PP2672        .   -
chr1    235099745   235099746   NR_125945_LOC101927851  .   -
chr1    235324564   235324565   NR_144491_RBM34         .   -
chr1    235097888   235291252   NR_002956_SNORA14B      .   -
chr1    235097869   235353431   NR_039908_MIR4753       .   -
chr1    235324564   235324565   NR_027762_RBM34         .   -
chr1    235324564   235324565   NM_001346738_RBM34      .   -

and gives me output as follows:

chr1:242627058-242627262,  242627060

where the - separated coordinates are from file1 and the comma separated from second column of file2.

I have already tried using awk and while loop but for some reason I couldn't do it.

while read a b c; do col2=$b; col3=$3; tail -n +1 path/to/file2 | awk 'BEGIN{OFS="\t"}{if($2>=$col2 && $2<=$col3) {print $a,$col2,$col3,$2}; break; else continue}' > rohit_TSS.txt; done < file1

Do you only have data on chromosome 1 (chr1), or are there other chromosomes in the files (other values in the first column)? What are the actual size of these files (data files used in bioinformatics can be quite big)? — Kusalananda, Nov 25 '19 at 14:39
There are other chromosome too. The file sizes are not greater than 5 MBs — Rohit Satyam, Nov 26 '19 at 06:29

bu5hman · Answer 1 · 2019-11-26T16:42:31.797

Probably easier to do this in 2 steps.

Throw everything into a helper file and sort.

awk 'FNR>1{print $1, $2, $3, $4 }' file1 file2 | sort -k1 >> file3

Then just run through them with awk in a single pass.

awk '{if (NF == 3) {chr=$1; lo=$2; hi=$3} else { if ($1==chr && $2>=lo && $2<=hi) print $1":"lo"-"hi", "$2}}' file3

A walk through the awk...... You know which lines in file3 came from file1 because they only have 3 fields, file2 has more so....

if (NF == 3) {chr=$1; lo=$2; hi=$3}

is a test that is true when you are on a line (in file3) that came from file1. Every time you find a line from file1 then you want to get the lo and hi values as well as the current chromosome

else

Otherwise we are on a line from file2 so just.....

 if ($1==chr && $2>=lo && $2<=hi) print $1":"lo"-"hi", "$2}

And if we are in the same chromosome and the value of interest $2 is between the lo and hi limits we remember from before then we print in your format.

Output was

chr1:235097868-235098170, 235097869
chr1:235097868-235098170, 235097888

note

In fact you can forget the first awk and just

cat file1 file2 | sort > file3

And since it sorts on the entire line it should be chr agnostic.

hi @bu5hman. Can you help me understand what will (NF == 3) {chr=$1; lo=$2; hi=$3} do? Because when I print it, it gives me awk: program limit exceeded: maximum number of fields size=32767 FILENAME="file3" FNR=1 NR=1. The command used was awk '{if (NF == 3) {chr=$1; lo=$2; hi=$3} print $chr, $lo, $hi}' file3 > filex — Rohit Satyam, Nov 26 '19 at 05:46
@RohitSatyam I don't think that the code that you quote appears at all in the answer. — Kusalananda, Nov 26 '19 at 09:29
The code you quote awk '{if (NF == 3) {chr=$1; lo=$2; hi=$3} print $chr, $lo, $hi}' file3 runs fine on my machine. Have included a bit of an explanation above, Hope it helps. — bu5hman, Nov 26 '19 at 15:38

score -2 · Answer 2 · answered Nov 25 '19 at 15:31

-2

for C in `cat file2 |awk -F" " '{ print $2 }' ` ; do 
   echo "Checking $C .." ;
   cat file1 | awk -v var=$C -F" " '{ if ( var  >=$2 && var <=$3 ) print $1":"$2"-"$3", "var  ;  }'; 
done

Later u can remove echo "Checking $C .." ;

Checking 234765055 ..
Checking 234782033 ..
Checking 234859787 ..
Checking 234895802 ..
Checking 235099745 ..
Checking 235324564 ..
Checking 235097888 ..
chr1:235097868-235098170, 235097888
Checking 235097869 ..
chr1:235097868-235098170, 235097869
Checking 235324564 ..
Checking 235324564 ..

answered Nov 25 '19 at 15:31

vkkindia

11

1

Note that this will hurt if file2 is several gigabytes big. You are also reading the entirety of file1 for each line in file2. This is very very slow. Depending on the sizes of the files (which we know nothing about), you may want to read some of the data into an array in awk and then read the other file in the same awk program, or you may have to resort to reading lines from each file in an alternating fashion if both are very big (this would require that both files are sorted). – Kusalananda Nov 25 '19 at 15:46
There's also http://porkmail.org/era/unix/award.html, https://mywiki.wooledge.org/Quotes, http://mywiki.wooledge.org/BashFAQ/082, and https://unix.stackexchange.com/q/321697/133219 issues plus redundantly setting FS to its default value and putting a condition inside the action part of the awk script instead of in the condition part and other issues. – Ed Morton Nov 25 '19 at 17:14

Comparing two tsv files

2 Answers2