0

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
9000, lig662, 1, 0.421573, -7.8400, 153
10V2, lig807, 1, 0.42692, -8.0300, 149
3000, lig158, 1, 0.427342, -8.1900, 147
3001, lig158, 1, 0.427342, -8.1900, 147
10V2, lig342, 1, 0.432943, -9.4200, 137
10V1, lig807, 1, 0.434338, -8.0300, 147
4000, lig236, 1, 0.440377, -7.3200, 156
10V1, lig342, 1, 0.441205, -9.4200, 135
4000, lig497, 1, 0.442088, -7.7900, 148
9000, lig28, 1, 0.442239, -7.5200, 152
3001, lig296, 1, 0.444512, -7.8900, 146
10V2, lig166, 1, 0.447681, -7.1500, 157
....
4000, lig612, 1, 0.452904, -7.0200, 158
9000, lig123, 1, 0.461601, -6.8000, 160
10V1, lig166, 1, 0.463963, -7.1500, 152
10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () {
    # Take the top lines from the initial CSV
    awk -v lines="$(wc -l < original.csv)" '
    BEGIN{
      top=int(lines/100)
    }
    FNR>(top){exit}
    1
    ' original.csv >> csv_with_top_lines.csv
}

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is smaller than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'
AdminBee
  • 22,803
Hot JAMS
  • 197
  • 2
  • 6
  • 2
    When I suggested providing more concise sample input/output in your other questions I didn't mean show the start and end of some input and stick .... in the middle of it as we can't copy/paste that as-is to test with. Simply come up with minimal, complete, textual, testable sample input and expected output and post that in your questions so we can easily see what it is you're trying to do AND copy/paste it to test with without having to edit it to get it into shape. – Ed Morton May 31 '21 at 17:17

2 Answers2

1

This is a rather verbose script - Do not use any short-cuts and print information to stderr. As for the sh part you could typically add options to set the values "Globals" at top, so that you could call with options in addition to arguments. I.e.:

my_script --max-factor 0.15 -p 20 --out-file foo.csv *.csv

So by this filtering on rescored and percent of lines. The verbose parts can obviously be deleted.

#!/bin/sh

Globals with defaults set

num_lines=0 max_percent_lines=10 max_factor=0.2

fn_in=""

Default out. Optionally set to another file.

fn_out=/dev/stdout

As /dev/null to quiet information

fn_err=/dev/stderr

get_num_lines() { num_lines=$(wc -l< "$1") } print_filtered() { awk
-v num_lines="$num_lines"
-v max_percent_lines="$max_percent_lines"
-v max_factor="$max_factor"
-v fn_err="$fn_err"
' BEGIN { FS=", " # Exclude header max_line = (1 + num_lines / 100 * max_percent_lines) # Truncate max_line -= max_line % 1 printf "Lines : %d\n", num_lines - 1 >>fn_err printf "Line Max : %d (%d%%)\n", max_line, max_percent_lines >>fn_err } NR == 2 { max_rescored = ($4 + $4 * max_factor) printf "Rescored Max: %f\n", max_rescored >>fn_err } NR > 1 { print $0 } NR >= max_line { printf "Max Line : %d (Exit)\n", max_line >>fn_err exit } $4 >= max_rescored && NR > 2 { printf "Max Rescored: %f (Exit)\n", $4 >>fn_err exit } ' "$fn_in" >>"$fn_out" }

Here one could loop multiple input files


Command line options

As per request in comments.

To get the options there are numerous of ways. Simplest would be positional arguments. For example:

Usage: script percent margin <files ...>

In the script one would then say:

percent=$1
margin=$2
shift
shift
... loop files ...

If one like to be a bit more fancy / flexible one could do something like this;

First write a help function. It could be something like. (The use of basename and $0 can likely be discussed):

print_help() {
    printf "Usage: %s [OPTIONS] <FILES ...>\n" "$(basename "$0")"
    printf "\nSome description\n"
    printf "\nOPTIONS\n"
    printf "  -p --percent-lines  V  Print percent of file. Default %s\n" "$max_percent_lines"
    printf "  -r --max-factor     V  Max rescored increase. Default %s\n" "$max_factor"
    printf "  -o --out-file       V  Output file. Default stdout\n"
    printf "  -q --quiet             Silence information\n"
    printf "  -h --help              This help\n"
    printf "  --                     Everything after this is input files\n"
    printf "\nEverything after first unrecognized option is treated as a file.\n"
}

Where one typically call it by print_help >&2 as to print to stderr and not stdout.

With help above one use a semi standard way. It does not take -abc or --foo=123, but each option and argument has to be separated by spaces.

Optionally, no pun intended, check out posts like

Then a simple way for rest of script, with some naive error checking could be:


# While not empty
while ! [ -z "$1" ]; do
    case "$1" in
    -h|--help)
        print_help >&2
        exit 1
        ;;
    -p|--percent-lines)
        shift
        max_percent_lines="$1"
        ;;
    -r|--max-factor)
        shift
        max_factor="$1"
        ;;
    -o|--out-file)
        shift
        fn_out="$1"
        ;;
    -q|--quiet)
        fn_err="/dev/null"
        ;;
    --)
        break
        ;;
    *)
        break
        ;;
    esac
    # Next argument
    shift
done

if ! [ -r "$1" ]; then printf "Unable to read file: `%s'\n" "$1" >&2 exit 1 fi

Print header from first file

head -n 1 "$1" >>"$fn_out"

for fn_in in "$@"; do printf "Processing '%s'\n" "$fn_in" >>"$fn_err" if ! [ -r "$1" ]; then printf "Unable to read file: `%s'\n" "$1" >&2 exit 1 fi get_num_lines print_filtered done

One could implement more validation of options, i.e. ensuring it is numbers etc.

ibuprofen
  • 2,890
  • exellent solution! indeed I indicated incorrectly contidition thank you for the correction. I will test your version asap. Cheers – Hot JAMS Jun 01 '21 at 09:11
  • ...just one question: in orde to use all of those options flaggs with the bash script my_script --max-factor 0.15 -p 20 how these two flaggs --max-factor and -p should be defined in the script specifictly ?? Or are they automatically directly assotiated to the global variables ? – Hot JAMS Jun 01 '21 at 09:41
  • @HotJAMS: You would typically loop and set. Can update with a short example. Only have to finish some work first ;P – ibuprofen Jun 01 '21 at 12:04
  • no problem! I ought to accept another answer that do the same job but the code is more compact. However thank you very much for your elegant sollution as well as for the exellent explanaation of the command line options. Cheers – Hot JAMS Jun 01 '21 at 13:48
1

If you simply want to filter all lines where the 4th field is below a threshold, the following awk command would work:

awk -F',' -v margin=0.2 'FNR==2 {min=$4} FNR>1&&($4<=(1+margin)*min)' input.csv

or, if you want to have the header in the filtered output, too:

awk -F',' -v margin=0.2 'FNR==2 {min=$4} FNR==1||($4<=(1+margin)*min)' input.csv

This will set the filed separator to , (but note that your file is non-standard CSV since you have additional spaces separating your fields) and import a variable margin with a value of 0.2 into the awk program.

Inside the program, it will set the variable min value to the value in the 4th column if we are on line 2 (FNR==2). It will then only print the current line if either we are on line 1 (the header - if you want it) or if we are in the data part of the file and the 4th field is smaller than 1+margin times the minimum value.

AdminBee
  • 22,803