Use awk to select only select lines with one field smaller than threshold from CSV file

Question

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
9000, lig662, 1, 0.421573, -7.8400, 153
10V2, lig807, 1, 0.42692, -8.0300, 149
3000, lig158, 1, 0.427342, -8.1900, 147
3001, lig158, 1, 0.427342, -8.1900, 147
10V2, lig342, 1, 0.432943, -9.4200, 137
10V1, lig807, 1, 0.434338, -8.0300, 147
4000, lig236, 1, 0.440377, -7.3200, 156
10V1, lig342, 1, 0.441205, -9.4200, 135
4000, lig497, 1, 0.442088, -7.7900, 148
9000, lig28, 1, 0.442239, -7.5200, 152
3001, lig296, 1, 0.444512, -7.8900, 146
10V2, lig166, 1, 0.447681, -7.1500, 157
....
4000, lig612, 1, 0.452904, -7.0200, 158
9000, lig123, 1, 0.461601, -6.8000, 160
10V1, lig166, 1, 0.463963, -7.1500, 152
10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () {
    # Take the top lines from the initial CSV
    awk -v lines="$(wc -l < original.csv)" '
    BEGIN{
      top=int(lines/100)
    }
    FNR>(top){exit}
    1
    ' original.csv >> csv_with_top_lines.csv
}

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is smaller than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'

When I suggested providing more concise sample input/output in your other questions I didn't mean show the start and end of some input and stick .... in the middle of it as we can't copy/paste that as-is to test with. Simply come up with minimal, complete, textual, testable sample input and expected output and post that in your questions so we can easily see what it is you're trying to do AND copy/paste it to test with without having to edit it to get it into shape. — Ed Morton, May 31 '21 at 17:17

ibuprofen · Answer 1 · 2021-06-02T01:22:30.230

This is a rather verbose script - Do not use any short-cuts and print information to stderr. As for the sh part you could typically add options to set the values "Globals" at top, so that you could call with options in addition to arguments. I.e.:

my_script --max-factor 0.15 -p 20 --out-file foo.csv *.csv

So by this filtering on rescored and percent of lines. The verbose parts can obviously be deleted.

#!/bin/sh
Globals with defaults set
num_lines=0
max_percent_lines=10
max_factor=0.2
fn_in=""
Default out. Optionally set to another file.
fn_out=/dev/stdout
As /dev/null to quiet information
fn_err=/dev/stderr
get_num_lines()
{
    num_lines=$(wc -l< "$1")
}
print_filtered()
{
    awk 

    -v num_lines="$num_lines" 

    -v max_percent_lines="$max_percent_lines" 

    -v max_factor="$max_factor" 

    -v fn_err="$fn_err" 

    '
    BEGIN {
        FS=", "
        # Exclude header
        max_line = (1 + num_lines / 100 * max_percent_lines)
        # Truncate
        max_line -= max_line % 1
        printf "Lines       : %d\n",
            num_lines - 1 >>fn_err
        printf "Line Max    : %d (%d%%)\n",
            max_line, max_percent_lines >>fn_err
    }
    NR == 2 {
        max_rescored = ($4 + $4 * max_factor)
        printf "Rescored Max: %f\n", max_rescored >>fn_err
    }
    NR > 1 {
        print $0
    }
    NR >= max_line {
        printf "Max Line    : %d (Exit)\n", max_line >>fn_err
        exit
    }
    $4 >= max_rescored && NR > 2 {
        printf "Max Rescored: %f (Exit)\n", $4 >>fn_err
        exit
    }
    ' "$fn_in" >>"$fn_out"
}
Here one could loop multiple input files

Command line options

As per request in comments.

To get the options there are numerous of ways. Simplest would be positional arguments. For example:

Usage: script percent margin <files ...>

In the script one would then say:

percent=$1
margin=$2
shift
shift
... loop files ...

If one like to be a bit more fancy / flexible one could do something like this;

First write a help function. It could be something like. (The use of basename and $0 can likely be discussed):

print_help() {
    printf "Usage: %s [OPTIONS] <FILES ...>\n" "$(basename "$0")"
    printf "\nSome description\n"
    printf "\nOPTIONS\n"
    printf "  -p --percent-lines  V  Print percent of file. Default %s\n" "$max_percent_lines"
    printf "  -r --max-factor     V  Max rescored increase. Default %s\n" "$max_factor"
    printf "  -o --out-file       V  Output file. Default stdout\n"
    printf "  -q --quiet             Silence information\n"
    printf "  -h --help              This help\n"
    printf "  --                     Everything after this is input files\n"
    printf "\nEverything after first unrecognized option is treated as a file.\n"
}

Where one typically call it by print_help >&2 as to print to stderr and not stdout.

With help above one use a semi standard way. It does not take -abc or --foo=123, but each option and argument has to be separated by spaces.

Optionally, no pun intended, check out posts like

getopt, getopts or manual parsing - what to use when I want to support both short and long options?

Then a simple way for rest of script, with some naive error checking could be:


# While not empty
while ! [ -z "$1" ]; do
    case "$1" in
    -h|--help)
        print_help >&2
        exit 1
        ;;
    -p|--percent-lines)
        shift
        max_percent_lines="$1"
        ;;
    -r|--max-factor)
        shift
        max_factor="$1"
        ;;
    -o|--out-file)
        shift
        fn_out="$1"
        ;;
    -q|--quiet)
        fn_err="/dev/null"
        ;;
    --)
        break
        ;;
    *)
        break
        ;;
    esac
    # Next argument
    shift
done
if ! [ -r "$1" ]; then
    printf "Unable to read file: `%s'\n" "$1" >&2
    exit 1
fi
Print header from first file
head -n 1 "$1" >>"$fn_out"
for fn_in in "$@"; do
    printf "Processing '%s'\n" "$fn_in" >>"$fn_err"
    if ! [ -r "$1" ]; then
        printf "Unable to read file: `%s'\n" "$1" >&2
        exit 1
    fi
    get_num_lines
    print_filtered
done

One could implement more validation of options, i.e. ensuring it is numbers etc.

exellent solution! indeed I indicated incorrectly contidition thank you for the correction. I will test your version asap. Cheers — Hot JAMS, Jun 01 '21 at 09:11
...just one question: in orde to use all of those options flaggs with the bash script my_script --max-factor 0.15 -p 20 how these two flaggs --max-factor and -p should be defined in the script specifictly ?? Or are they automatically directly assotiated to the global variables ? — Hot JAMS, Jun 01 '21 at 09:41
@HotJAMS: You would typically loop and set. Can update with a short example. Only have to finish some work first ;P — ibuprofen, Jun 01 '21 at 12:04
no problem! I ought to accept another answer that do the same job but the code is more compact. However thank you very much for your elegant sollution as well as for the exellent explanaation of the command line options. Cheers — Hot JAMS, Jun 01 '21 at 13:48

AdminBee · Accepted Answer · 2021-06-01T13:34:29.230

If you simply want to filter all lines where the 4th field is below a threshold, the following awk command would work:

awk -F',' -v margin=0.2 'FNR==2 {min=$4} FNR>1&&($4<=(1+margin)*min)' input.csv

or, if you want to have the header in the filtered output, too:

awk -F',' -v margin=0.2 'FNR==2 {min=$4} FNR==1||($4<=(1+margin)*min)' input.csv

This will set the filed separator to , (but note that your file is non-standard CSV since you have additional spaces separating your fields) and import a variable margin with a value of 0.2 into the awk program.

Inside the program, it will set the variable min value to the value in the 4th column if we are on line 2 (FNR==2). It will then only print the current line if either we are on line 1 (the header - if you want it) or if we are in the data part of the file and the 4th field is smaller than 1+margin times the minimum value.

Use awk to select only select lines with one field smaller than threshold from CSV file

2 Answers2

Globals with defaults set

Default out. Optionally set to another file.

As /dev/null to quiet information

Here one could loop multiple input files

Command line options

Print header from first file