0

I have a scenario where I want to calculate the sum of multiple columns containing floating point data.

My below code works perfect with integer values but, not with floating paint values.

Code works perfect for this data

ID|NAME|SAL|COST|PER|TAG

1|A|10|10|20|10|

1|B|10|15|20|10|

1|C|10|17|25|80|

1|D|115|110|20|100|

1|E|10|10|10|10|

code :

#!/bin/bash
FILE="$1"
COLUMNS="${@:2}"
for col in $COLUMNS; do
  colnum=$(awk -v RS='|' '/'$col'/{ print NR; exit}' $FILE)
  awk '{FS="|"}{s+='$colnum'}END{print "'$col' ", s}' $FILE
done | column -t

How do I achieve sum for below data using above code?

ID|NAME|SAL|COST|PER|TAG

    1|A|9.234|123.12|20.123|67.1|

    1|B|10.12|153.234|20.90|190.34|

    1|C|8.234|17.01|25.777|80.09|

    1|D|11.2|11.222|10.1|10.00000|

    1|E|16.23|10.1|145.22|11.77278|

New code not working:

#!/bin/bash
FILE="$1"
COLUMNS="${@:2}"
for col in $COLUMNS; do
            awk -F'|' '{T+=$col} END { printf "%.10f\n", T }' $FILE
    done | column -t
Kusalananda
  • 333,661
  • @markgraf can you look this above question : https://unix.stackexchange.com/questions/559767/how-to-calculate-multiple-decimal-column-sum-linux – genip26057 Jan 01 '20 at 06:35

2 Answers2

0

Start by not doing this in the shell, but in something like awk. The shell does not know how to do floating point arithmetics.

Your script seems to be getting a set of parameters that corresponds to the filename to parse and the names of the fields to consider.

You can do this a bit more efficiently than to call awk twice for each column.

BEGIN { OFS = FS = "|" }

FNR == 1 {
        # Handle the header row. Assuming "cols" is a FS-separated
        # string of column names that we want to sum, pick out the
        # column numbers that we want to process and put them in the
        # array "col". This also converts the "cols" string into a
        # corresponding "header" array.

        nf = split(cols, header)
        for (i = 1; i <= NF; ++i)
                for (j = 1; j <= nf; ++j)
                        if ($i == header[j])
                                col[j] = i
        next
}

{
        # Sum each column that we have enumerated in the "col" array.

        for (i = 1; i <= nf; ++i)
                sum[i] += $(col[i])
}

END {
        # Output a two row table containing only the headers that we
        # have summed over and the accumulated sums.

        $0 = ""

        for (i = 1; i <= nf; ++i)
                $i = header[i]
        print

        for (i = 1; i <= nf; ++i)
                $i = sum[i]
        print
}

Running this on your first file with:

$ awk -v cols="SAL|COST|PER|TAG" -f script.awk file1
SAL|COST|PER|TAG
155|162|95|210

Listing the same column twice gives the sum twice:

$ awk -v cols="SAL|SAL" -f script.awk file1
SAL|SAL
155|155

On the second file:

$ awk -v cols="SAL|COST|PER|TAG" -f script.awk file2
SAL|COST|PER|TAG
55.018|314.686|222.12|359.303

With post-processing using column -t:

$ awk -v cols="SAL|COST|PER|TAG" -f script.awk file2 | column -s '|' -t
SAL     COST     PER     TAG
55.018  314.686  222.12  359.303

Putting this into a simple script:

#!/bin/sh

infile=$1
shift

IFS='|'

awk -v cols="$*" -f script.awk "$infile" |
column -s '|' -t

You would use this as

$ ./script.sh file1 PER TAG
PER  TAG
95   210

You could also use a CSV-aware tool, like the tools in CSVkit, to do the work for you.

csvstat is able to compute some basic "stats" of a CSV file. Assuming your files are properly formatted CSV files (they are not, as each data row has a trailing |-delimiter), the sum for each column can be had with

$ csvstat --sum file1
  1. ID: None
  2. NAME: None
  3. SAL: 155
  4. COST: 162
  5. PER: 95
  6. TAG: 210
$ csvstat --sum file2
  1. ID: None
  2. NAME: None
  3. SAL: 55.018
  4. COST: 314.686
  5. PER: 222.12
  6. TAG: 359.30278

Or, for just a single column:

$ csvstat --sum -c 'SAL' file2
55.018
Kusalananda
  • 333,661
  • can you tell me how to make my script work i have updated the question ... added newly created code at the end ... where as your solution is really good .... is it possible to do with my code with liitle chnages pls look question updated – genip26057 Jan 01 '20 at 07:15
  • @genip26057 You're using $col in the awk code, but this would refer to column number col, and there's no col variable in the program. Therefore, $col is the same as $0, which is the complete line. – Kusalananda Jan 01 '20 at 07:20
  • @genip26057 Also see https://unix.stackexchange.com/questions/50044/use-a-shell-variable-in-awk – Kusalananda Jan 01 '20 at 07:21
  • @can you modify the provide code to make it work ...pls – genip26057 Jan 01 '20 at 07:21
  • @genip26057 Note that you have not indicated in the slightest way what you expect that a user give as arguments to the script. In the first coed it seems to be column names, in the second, it's a (single) column number? – Kusalananda Jan 01 '20 at 07:23
  • sum of entire column SAL COST PER TAG : run this command you will get to see the output for SAL : awk -F'|' '{T+=$3} END { printf "%.10f\n", T }' demo.txt same for COST PER TAG – genip26057 Jan 01 '20 at 07:26
  • @genip26057 Yes. You do not want to have to parse your file once for each column though, surely? – Kusalananda Jan 01 '20 at 07:27
  • this is how the code is "bash scriptname.sh filename.txt SAL COST PER TAG" it should automatically take the column name and sum the particular column name for SAL COST PER TAG – genip26057 Jan 01 '20 at 07:29
  • @genip26057 That is what my solution does. But with the column names being |-delimited. – Kusalananda Jan 01 '20 at 07:29
  • Ok i am happy with your code you created another script ...is it possible to do in one script only like this code : #!/bin/bash FILE="$1" COLUMNS="${@:2}" for col in $COLUMNS; do colnum=$(awk -v RS='|' '/'$col'/{ print NR; exit}' $FILE) awk '{FS="|"}{s+='$colnum'}END{print "'$col' ", s}' $FILE done | column -t – genip26057 Jan 01 '20 at 07:32
  • @genip26057 Sorry, you're obviously ignoring what I'm writing. I'm not answering more follow-up questions. – Kusalananda Jan 01 '20 at 07:33
  • can you refer solution of markgraf : https://unix.stackexchange.com/questions/558377/how-to-create-a-function-to-sum-the-required-multiple-column/558381?noredirect=1#comment1037610_558381 in this code only decimal should be able to calculate ... your code is awsome then him ... but our environment work does not allow new file to create. only in one file i need to implement this all – genip26057 Jan 01 '20 at 07:34
  • @genip26057 Calculating the sum over a single column is a different problem from calculating the sum over a variable number of columns. Imagine someone with terabytes of data. They do not want to parse the file six times just to sum over six columns. – Kusalananda Jan 01 '20 at 07:35
  • i have new idea is it possible to implement ... as the script requires column names for which sum need to be calculate what if we can store the position of column name like SAL comes in position $3 and COST = $4 and $5 , $6 and so on. by using looping passing $3 $4 $5 $6 for this code awk -F'|' '{T+=$3} END { printf "%.10f\n", T }' demo.txt ... inplace or $3 it will go on replacing values first time it will be $3 then $4 ... $5 $6 – genip26057 Jan 01 '20 at 07:40
  • i have created a another solution like your solution see the below answer ... suggest any improvement needed in script ... this is working as expected scenario which i am work environment – genip26057 Jan 01 '20 at 17:54
0

My solution will work like charm

For this DATA set 1 - tested on this

ID|NAME|SAL|COST|PER|TAG
1"|"A"|"50.1123"|"10.1"|"25.22"|"10.2"|"
2"|"B"|"50.11"|"15.45"|"25.1"|"10.1118"|"

For this DATA set 2 - tested on this

ID|NAME|SAL|COST|PER|TAG
1|A|9.234|123.12|20.123|67.1
2|B|10.12|153.234|20.90|190.34
3|C|8.234|17.01|25.777|80.09
4|D|11.2|11.222|10.1|10.00000
5|E|16.23|10.1|145.22|11.77278

Effective solution

FILE_NAME="$1"
COLUMN_NAME="$2"

alpha=( $(awk -F"|" 'NR==1{for(i=1;i<=NF;i++){if ($i ~ /'$COLUMN_NAME'/){print i;}}}' $FILE_NAME) )

for each in "${alpha[@]}"
do
      #echo "$each"
      awk -F'"?\\|"?' '{T+=$('$each')} END { printf "%.4f\n", T }' $FILE_NAME
done

Run like this

bash script.sh DEMO.txt 'SAL|COST|PER|TAG'

output 1 :

100.2223
25.5500
50.3200
20.3118

output 2 :

55.0180
314.6860
222.1200
359.3028
  • This input data is different from what's in the question. When providing a dataset in a question, make sure that it is representative of the actual data that you are working on. – Kusalananda Jan 01 '20 at 17:58
  • @Kusalananda previously i was not known what will be the DATA set .... i got with a requirement now ... so whole process got changed ... this is also another way we can achieve – genip26057 Jan 01 '20 at 18:28
  • @Kusalananda even i tested with another DATA set it work like charm ... updated the answer please have a look – genip26057 Jan 01 '20 at 18:35