Dealing with the post-processing of multi-column csv file contained many (10000+) lines:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
9000, lig662, 1, 0.421573, -7.8400, 153
10V2, lig807, 1, 0.42692, -8.0300, 149
3000, lig158, 1, 0.427342, -8.1900, 147
3001, lig158, 1, 0.427342, -8.1900, 147
10V2, lig342, 1, 0.432943, -9.4200, 137
10V1, lig807, 1, 0.434338, -8.0300, 147
4000, lig236, 1, 0.440377, -7.3200, 156
10V1, lig342, 1, 0.441205, -9.4200, 135
4000, lig497, 1, 0.442088, -7.7900, 148
9000, lig28, 1, 0.442239, -7.5200, 152
3001, lig296, 1, 0.444512, -7.8900, 146
10V2, lig166, 1, 0.447681, -7.1500, 157
....
4000, lig612, 1, 0.452904, -7.0200, 158
9000, lig123, 1, 0.461601, -6.8000, 160
10V1, lig166, 1, 0.463963, -7.1500, 152
10V1, lig369, 1, 0.465029, -7.3600, 148
What I did so far
I am using the following awk
code integrated into a bash
function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):
take_top44 () {
# Take the top lines from the initial CSV
awk -v lines="$(wc -l < original.csv)" '
BEGIN{
top=int(lines/100)
}
FNR>(top){exit}
1
' original.csv >> csv_with_top_lines.csv
}
What I want to do now
How could I modify my awk
code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored)
)?
For example I need to use the lowest value (which is always on the second line, minForth = 0.421573
) as the reference and print all lines from the CSV where $4
is smaller than a selected threshold (say, 20% above the minForth
):
$4<=(1+0.2)*min))'
....
in the middle of it as we can't copy/paste that as-is to test with. Simply come up with minimal, complete, textual, testable sample input and expected output and post that in your questions so we can easily see what it is you're trying to do AND copy/paste it to test with without having to edit it to get it into shape. – Ed Morton May 31 '21 at 17:17