1

I have a large file that needs to be filtered by the first field (which is never repeated). Example as below:

NC_056429.1_398 2   3   0.333333    0.333333    0.333333    0.941178
NC_056429.1_1199    2   0   0.333333    0.333333    0.333333    0.941178
NC_056442.1_7754500 0   3   0.800003    0.199997    0.000000    0.000001
NC_056442.1_7754657 1   2   0.000000    0.199997    0.800003    0.888891
NC_056442.1_7754711 2   0   0.888891    0.111109    0.000000    0.800002
NC_056442.1_7982565 0   1   0.800003    0.199997    0.000000    0.666580
NC_056442.1_7982610 1   0   0.800003    0.199997    0.000000    0.000000
NC_056442.1_7985311 2   0   0.888891    0.111109    0.000000    0.000000

I am trying to use awk to filter a file in a shell script by the first column, and I need to use a variable because its in a while loop. The while loop calls in a text file such as:

NC_056442.1 7870000    # 1st field = $chrname, 2nd field = $pos
NC_056443.1 1570000 

Previously in the script, I find a target value using a calculation with $pos to get $startpos and $endpos as shown below:

chrname="NC_056442.1" # column 1 in pulled file
startpos=7754657 # calculated in prior script
endpos=7982610 # calculated in prior script
start=${chrname}_${startpos} # this was an attempt to simplify the awk command
end=${chrname}_${endpos} 
awk -v s="$start" -v e-"$end" '/s/,/e/' file.txt > cut_file.txt 

If I manually type in the values, like below, I get a file that includes lines 5-8 only.

awk '/NC_056442.1_7754657/,/NC_056442.1_7982610/' file.txt > cut_file.txt

Output File

NC_056442.1_7754657 1   2   0.000000    0.199997    0.800003    0.888891
NC_056442.1_7754711 2   0   0.888891    0.111109    0.000000    0.800002
NC_056442.1_7982565 0   1   0.800003    0.199997    0.000000    0.666580
NC_056442.1_7982610 1   0   0.800003    0.199997    0.000000    0.000000

I am struggling because I do not know how to get the s and e variables to actually run. I have tried a variety of options including "ENVIRON[]". As someone relatively new to bash (and a first post here), I do not know how to troubleshoot this. I am open to answers outside of awk. Please let me know if I need to rephrase my question or add more information.

1 Answers1

1

Don't try to do this by matching regular expressions. Instead, use _ or space as awk's field separator so you have the chromosome and positions in easy to use variables:

start=1234567
end=7654321
awk -v s="$start" -v e="$end" -F '[ _]' '$3 >= s && $3 <= e' file.txt > cut_file.txt 

Also, avoid using CAPS for variable names in shell scripts. By convention, the global environment variables are capitalized so if you use capital letters for your own variables, that can lead to naming collisions and hard to find bugs.


Now, you haven't shown us the loop you are using. Whatever it is though, you would be better off looping in awk itself instead of the shell. Shell loops are slow.

terdon
  • 242,166
  • Thank you @terdon! After a little bit of troubleshooting with my data (since I didn't provide you enough info), I was able to use a code based off your answer. Since there are multiple chromosomes, I still had to add in that filter, but since "NC_056430.1" also has an underscore, I used a new variable chrnum=${chrname:3} so it could match with $2 in the following awk line: awk -v c=$chrnum -v s=$startpos -v e=$endpos -F'[\t_]' '$2 == c && $3 >= s && $3 <= e' file > cut_file. Thanks for answering the question, even though I surely could improve the speed and quality of the code. – natasha15 Jan 16 '23 at 21:07
  • @natasha15 you're welcome! You might be interested in our sister site, [bioinformatics.se]. People here won't know about VCF files (for example, they won't know they require a header or that they are a tab-separated format etc.) since they are only well known in the specific sub-field of bioinformatics. In any case, I urge you to ask a question about the entire job you have to do. Whenever you find yourself using a shell loop to process files, you are doing it wrong by definition and any other approach will be faster and easier. – terdon Jan 16 '23 at 21:54