6

I'm trying to calculate the geometric mean of a file full of numbers (1 column).

The basic formula for geometric mean is the average the natural log (or log) of all the values and then raise e (or base 10) to that value.

My current bash only script looks like this:

# Geometric Mean
count=0;
total=0; 

for i in $( awk '{ print $1; }' input.txt )
  do
    if (( $(echo " "$i" > "0" " | bc -l) )); then
        total="$(echo " "$total" + l("$i") " | bc -l )"
        ((count++))
    else
      total="$total"
    fi
  done

Geometric_Mean="$( printf "%.2f" "$(echo "scale=3; e( "$total" / "$count" )" | bc -l )" )"
echo "$Geometric_Mean"

Essentially:

  1. Check every entry in the input file to make sure it is larger than 0 calling bc every time
  2. If the entry is > 0, I take the natural log (l) of that value and add it to the running total calling bc every time
  3. If the entry is <=0, I do nothing
  4. Calculate the Geometric Mean

This works perfectly fine for a small data set. Unfortunately, I am trying to use this on a large data set (input.txt has 250,000 values). While I believe this will eventually work, it is extremely slow. I've never been patient enough to let it finish (45+ minutes).

I need a way of processing this file more efficiently.

There are alternative ways such as using Python

# Import the library you need for math
import numpy as np

# Open the file
# Load the lines into a list of float objects
# Close the file
infile = open('time_trial.txt', 'r')
x = [float(line) for line in infile.readlines()]
infile.close()

# Define a function called geo_mean
# Use numpy create a variable "a" with the ln of all the values
# Use numpy to EXP() the sum of all of a and divide it by the count of a
# Note ... this will break if you have values <=0
def geo_mean(x):
    a = np.log(x)
    return np.exp(a.sum()/len(a))

print("The Geometric Mean is: ", geo_mean(x))

I would like to avoid using Python, Ruby, Perl ... etc.

Any suggestions on how to write my bash script more efficiently?

Paulo Tomé
  • 3,782
Matt
  • 73
  • 2
    You are running two subshells and two bc external processes per input value, so around a million processes in total. awk will deal with the whole input in a single process, probably in under 30 seconds. – Paul_Pedant Mar 02 '20 at 18:48
  • 2
    If you need efficiency, forget bash (or any other shell). The shell is not designed for this sort of thing and will always be the slowest and least efficient solution possible. – terdon Mar 02 '20 at 18:49
  • 1
    Since you're already using awk, use awk throughout: awk '$1 > 0 {n++; s += log($1)} END{if(n)print exp(s/n)}' your_file. Use -v OFMT=%.16g if you want more digits. –  Mar 02 '20 at 18:58
  • Paul_Pedant & mosvy thank you so much. awk was able to perform all the calculations win < 5 sec. I clearly need to do some awk homework. I really appreciate your help! – Matt Mar 02 '20 at 19:00
  • Also, if your awk is GNU awk you may be able to do the calculation with arbitrary precision numbers instead of doubles by using the -M or --bignum option (check with gawk --version if it was compiled with gmp/mpfr support). –  Mar 02 '20 at 19:10
  • @mosvy thanks so much again for the help, sincerely appreciate it. For reference the wall clock time on the awk code you provided was 0.122 sec compared to the 0.208 sec for the Python script. Additionally, if we combine the 'user' and 'sys' time awk completed it in 0.125 sec while Python took 1.125 sec. Thanks again! – Matt Mar 02 '20 at 19:10
  • @mosvy please don't answer questions in comments. That circumvents the normal quality control procedures of the site since comments cannot be downvoted and also mean that the question isn't marked as answered. – terdon Mar 02 '20 at 19:14
  • You can always delete my comments instead of downvoting them. BTW, could you explain the purpose of E=exp(1) .. E^m instead of just exp(m) in your answer? –  Mar 02 '20 at 19:20
  • Well, I'd rather you post an answer so I can upvote it instead, @mosvy. As for the E=exp(1), that was just the first way that I found to get the value of e so that I could then raise it to the power returned by the tot/c. Your exp() approach seems much better, but I didn't know about it. Yet another reason why posting answers is better :). – terdon Mar 02 '20 at 19:22
  • 1
    Why the extra quotes and spaces in " "$i" > "0" "? Could be "$i > 0" – user253751 Mar 03 '20 at 11:36
  • I agree with user253751 — it’s rare that we criticize a post for having too many quotes, but this is such a case.   Shell variables should always be put inside quotes, unless you have a good reason not to, and you’re sure you know what you’re doing — see this and this for details.   For example, the next line should be echo "$total + l($i)". – G-Man Says 'Reinstate Monica' Mar 04 '20 at 04:16

2 Answers2

15

Please don't do this in the shell. There is no amount of tweaking that would ever make it remotely efficient. Shell loops are slow and using the shell to parse text is just bad practice. Your whole script can be replaced by this simple awk one-liner which will be orders of magnitude faster:

awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file

For example, if I run that on a file containing the numbers from 1 to 100, I get:

$ seq 100 > file
$ awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file
37.99

In terms of speed, I tested your shell solution, your python solution and the awk I gave above on a file containing the numbers from 1 to 10000:

## Shell
$ time foo.sh
3677.54

real 1m0.720s user 0m48.720s sys 0m24.733s

Python

$ time foo.py The Geometric Mean is: 3680.827182220091

real 0m0.149s user 0m0.121s sys 0m0.027s

Awk

$ time awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' input.txt 3680.83

real 0m0.011s user 0m0.010s sys 0m0.001s

As you can see, the awk is even faster than the python and far simpler to write. You can also make it into a "shell" script, if you like. Either like this:

#!/bin/awk -f

BEGIN{ E = exp(1); } $1>0{ tot+=log($1); c++; }

END{ m=tot/c; printf "%.2f\n", E^m }

or by saving the command in a shell script:

#!/bin/sh
awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++;} END{m=tot/c; printf "%.2f\n", E^m}' "$1"
terdon
  • 242,166
0

Here are some suggestions. I can't test them without knowing exactly what is in your file but I hope this helps. There are always different, better ways to do things so this is not at all exhaustive.


Change the if condition

if (( $(echo " "$i" > "0" " | bc -l) )); then

Change it to:

if [[ "$i" -gt 0 ]]; then

The first line creates multiple processes even though it is just doing simple math. A solution is to use the [[ shell keyword.


Remove unneeded code

else
  total="$total"

This is basically a way to explicitly waste time doing nothing :). These 2 lines can be removed outright.

JamesL
  • 1,270
  • 1
  • 14
  • 19