I am trying to compare two tsv files. The file to be queried (file1) looks like:
Chr Start End
chr1 234738546 234738934
chr1 234792654 234793537
chr1 234908151 234908864
chr1 235097868 235098170
chr1 236080566 236081347
chr1 240307621 240308262
chr1 240308207 240308637
chr1 240308546 240308962
chr1 242627058 242627262
chr1 243923195 243923709
The second column of another file (file2) contains numbers that I wish to check, if lying, between the numbers in column 2 and 3 in the and repeat it until condition is satisfied.
eg: 242627060
lies between 242627058
& 242627262
File2 looks like:
Chr Centre_Coord Ignore_this_col Secondary Information
chr1 234765055 234765056 NR_033927_LINC00184 . +
chr1 234782033 234782034 NR_125944_LOC101927787 . +
chr1 234859787 234859788 NR_038856_LINC01132 . +
chr1 234895802 234895803 NR_148962_PP2672 . -
chr1 235099745 235099746 NR_125945_LOC101927851 . -
chr1 235324564 235324565 NR_144491_RBM34 . -
chr1 235097888 235291252 NR_002956_SNORA14B . -
chr1 235097869 235353431 NR_039908_MIR4753 . -
chr1 235324564 235324565 NR_027762_RBM34 . -
chr1 235324564 235324565 NM_001346738_RBM34 . -
and gives me output as follows:
chr1:242627058-242627262, 242627060
where the -
separated coordinates are from file1
and the comma separated from second column of file2
.
I have already tried using awk
and while loop but for some reason I couldn't do it.
while read a b c; do col2=$b; col3=$3; tail -n +1 path/to/file2 | awk 'BEGIN{OFS="\t"}{if($2>=$col2 && $2<=$col3) {print $a,$col2,$col3,$2}; break; else continue}' > rohit_TSS.txt; done < file1
chr1
), or are there other chromosomes in the files (other values in the first column)? What are the actual size of these files (data files used in bioinformatics can be quite big)? – Kusalananda Nov 25 '19 at 14:39