How to use variables (that change in loop) in awk statement to cut a file

Question

I have a large file that needs to be filtered by the first field (which is never repeated). Example as below:

NC_056429.1_398 2   3   0.333333    0.333333    0.333333    0.941178
NC_056429.1_1199    2   0   0.333333    0.333333    0.333333    0.941178
NC_056442.1_7754500 0   3   0.800003    0.199997    0.000000    0.000001
NC_056442.1_7754657 1   2   0.000000    0.199997    0.800003    0.888891
NC_056442.1_7754711 2   0   0.888891    0.111109    0.000000    0.800002
NC_056442.1_7982565 0   1   0.800003    0.199997    0.000000    0.666580
NC_056442.1_7982610 1   0   0.800003    0.199997    0.000000    0.000000
NC_056442.1_7985311 2   0   0.888891    0.111109    0.000000    0.000000

I am trying to use awk to filter a file in a shell script by the first column, and I need to use a variable because its in a while loop. The while loop calls in a text file such as:

NC_056442.1 7870000    # 1st field = $chrname, 2nd field = $pos
NC_056443.1 1570000

Previously in the script, I find a target value using a calculation with $pos to get $startpos and $endpos as shown below:

chrname="NC_056442.1" # column 1 in pulled file
startpos=7754657 # calculated in prior script
endpos=7982610 # calculated in prior script
start=${chrname}_${startpos} # this was an attempt to simplify the awk command
end=${chrname}_${endpos} 
awk -v s="$start" -v e-"$end" '/s/,/e/' file.txt > cut_file.txt

If I manually type in the values, like below, I get a file that includes lines 5-8 only.

awk '/NC_056442.1_7754657/,/NC_056442.1_7982610/' file.txt > cut_file.txt

Output File

NC_056442.1_7754657 1   2   0.000000    0.199997    0.800003    0.888891
NC_056442.1_7754711 2   0   0.888891    0.111109    0.000000    0.800002
NC_056442.1_7982565 0   1   0.800003    0.199997    0.000000    0.666580
NC_056442.1_7982610 1   0   0.800003    0.199997    0.000000    0.000000

I am struggling because I do not know how to get the s and e variables to actually run. I have tried a variety of options including "ENVIRON[]". As someone relatively new to bash (and a first post here), I do not know how to troubleshoot this. I am open to answers outside of awk. Please let me know if I need to rephrase my question or add more information.

Hello, would it be possible to add some input files, one where you have all the start/end position and one which contains a minimum reproducible example, and an expected output from these — Marius_Couet, Jan 16 '23 at 07:56
Where is the loop? What is it you are looping over? Trying to do this with a shell loop will be very, very slow. Do you have a list of start and end positions? Is that in a file? Is that what you are looping over? — terdon, Jan 16 '23 at 09:53
Please read correct-bash-and-shell-script-variable-capitalization and then fix your variable names. To use the values of shell variables in an awk script, see how-do-i-use-shell-variables-in-an-awk-script. Range expressions are generally best avoided, see is-a-start-end-range-expression-ever-useful-in-awk — Ed Morton, Jan 16 '23 at 13:46
Regarding I need to use a variable because its in a while loop - that's probably a bad starting point, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice. — Ed Morton, Jan 16 '23 at 13:49
@Marius_Couet I added more context to the question, as I was unable to find a place to upload an input file, so I'm guessing this edit was what you were asking for? — natasha15, Jan 16 '23 at 19:29
@terdon I have no doubt that my code could be improved greatly by a non-amatuer, but I am okay with a slower script as long as this code is functional. Hopefully the added info clears up what the while loop is using. I do not have start and end positions in the file, but I do have a position, and I use this code to find startpos and endpos: https://stackoverflow.com/questions/17853037/in-a-column-of-numbers-find-the-closest-value-to-some-target-value. — natasha15, Jan 16 '23 at 19:35
@EdMorton, thank you for the suggested readings. Still learning how to use this language. — natasha15, Jan 16 '23 at 20:52
@natasha15 yes that was what I was thinking about, you already got an answer that suits you, but otherwise next time try adding inputs file that will give you a reproducible example. The while loop calls in a text file such as: in this part you give values that don't match in your other files. — Marius_Couet, Jan 17 '23 at 07:54

score 1 · Accepted Answer · answered Jan 16 '23 at 10:00

1

Don't try to do this by matching regular expressions. Instead, use _ or space as awk's field separator so you have the chromosome and positions in easy to use variables:

start=1234567
end=7654321
awk -v s="$start" -v e="$end" -F '[ _]' '$3 >= s && $3 <= e' file.txt > cut_file.txt

Also, avoid using CAPS for variable names in shell scripts. By convention, the global environment variables are capitalized so if you use capital letters for your own variables, that can lead to naming collisions and hard to find bugs.

Now, you haven't shown us the loop you are using. Whatever it is though, you would be better off looping in awk itself instead of the shell. Shell loops are slow.

answered Jan 16 '23 at 10:00

terdon

242,166

Thank you @terdon! After a little bit of troubleshooting with my data (since I didn't provide you enough info), I was able to use a code based off your answer. Since there are multiple chromosomes, I still had to add in that filter, but since "NC_056430.1" also has an underscore, I used a new variable chrnum=${chrname:3} so it could match with $2 in the following awk line: awk -v c=$chrnum -v s=$startpos -v e=$endpos -F'[\t_]' '$2 == c && $3 >= s && $3 <= e' file > cut_file. Thanks for answering the question, even though I surely could improve the speed and quality of the code. – natasha15 Jan 16 '23 at 21:07
@natasha15 you're welcome! You might be interested in our sister site, [bioinformatics.se]. People here won't know about VCF files (for example, they won't know they require a header or that they are a tab-separated format etc.) since they are only well known in the specific sub-field of bioinformatics. In any case, I urge you to ask a question about the entire job you have to do. Whenever you find yourself using a shell loop to process files, you are doing it wrong by definition and any other approach will be faster and easier. – terdon Jan 16 '23 at 21:54

How to use variables (that change in loop) in awk statement to cut a file

Output File

1 Answers1