0

I have a text file called data.txt that contains 2 columns and 80,667 rows. The first 20 rows are, which is representative of all rows:

  Column A     Column B

ACKR1/CCRL2     12.66
ACKR1/CMKLR1    10.41
ACKR1/CXCR4     13.01
ACKR1/FZD5      10.72
ACKR1/GPR107    10.22
ACKR1/GPR137    10.73
ACKR1/GPR146    11.93
ACKR1/TAPT1     11.24
ACKR1/TPRA1     11.81
ACKR1/XPR1      10.01
ACKR2/ACKR3     9.36
ACKR2/ACKR3     10.48
ACKR2/ACKR3     11.08
ACKR2/ACKR3     11.11
ACKR2/ACKR3     11.38
ACKR2/ACKR3     11.73
ACKR2/ACKR3     12.97
ACKR2/ADGRA2    8.68
ACKR2/ADGRA2    9.03
ACKR2/ADGRA2    9.34

As you can see, there are some duplicate values in column A (e.g. ACKR2/ACKR3, ACKR2/ADGRA2). What I'd like to do is create another file results.txt that contains the average, standard deviation and standard error of unique values, and how many times they appear in data.txt. For instance:

   Value         Avg    Stdev    StdErr    Count

ACKR1/CCRL2     12.66    -         -         1
ACKR1/CMKLR1    10.41    -         -         1
ACKR1/CXCR4     13.01    -         -         1
ACKR1/FZD5      10.72    -         -         1
ACKR1/GPR107    10.22    -         -         1
ACKR1/GPR137    10.73    -         -         1
ACKR1/GPR146    11.93    -         -         1
ACKR1/TAPT1     11.24    -         -         1 
ACKR1/TPRA1     11.81    -         -         1
ACKR1/XPR1      10.01    -         -         1
ACKR2/ACKR3     11.46   0.84      0.35       7
ACKR2/ADGRA2    8.69    0.33      0.19       3

I am not too phased about getting both the standard deviation and standard error but it'd be nice to get at least one.

I'm a biologist so I'm not too savvy with writing unix scripts. I've really got no idea where to start, any help would be appreciated. Sorry if something like this has been answered before, I searched but could not find anything.

Many thanks!

DiscoA
  • 95

2 Answers2

3

Not sure about the Stdev and Sterr calculation. but using this method, you can easily calculate it

$ awk '{A[$1]++;B[$1]+=$2}END{print "Value\t\tAvg\tCount";for (i in A){print i"\t"B[i]/A[i]"\t"A[i]}}' input.txt
Value           Avg     Count
ACKR2/ADGRA2    9.01667 3
ACKR1/GPR107    10.22   1
ACKR1/XPR1      10.01   1
ACKR1/CMKLR1    10.41   1
ACKR1/CCRL2     12.66   1
ACKR1/GPR146    11.93   1
ACKR1/GPR137    10.73   1
ACKR1/CXCR4     13.01   1
ACKR2/ACKR3     11.1586 7
ACKR1/TPRA1     11.81   1
ACKR1/FZD5      10.72   1
ACKR1/TAPT1     11.24   1

awk '{
A[$1]++;B[$1]+=$2
}
END
{
print "Value\t\tAvg\tCount";
for (i in A)
{
print i"\t"B[i]/A[i]"\t"A[i]
}
}'  input.txt
Kamaraj
  • 4,365
  • I've thought about the stdev and stderr and I won't need them so much after all. So your solution is perfect (and it works a charm), thank you very much! – DiscoA Nov 22 '16 at 07:58
3

Since I don't have a life had some time to spare, and since I will (apparently) never get around to learning awk properly, I just wrote this together to see if I would be able to solve it in Python.

Not taking anything away from @Kamaraj's awk (I upvoted).

#!/usr/bin/env python3

colDict = {}  

with open ("cols") as infile: # Open the file
    for line in infile: # Read line by line
        splitLine = line.split() # Split line into list
        if splitLine[0] not in colDict: # Is value _not_ already in dict?
            value = splitLine[0] # Create new value
            colDict[value] = {} # Create nested dict
            colDict[value]["Avg"] = float(splitLine[1]) # Insert 'avg' 
            colDict[value]["Count"] = 1 # Insert count
        else: # Value _is_ in dict
            colDict[value]["Avg"] += float(splitLine[1]) # Add 'avg' float value
            colDict[value]["Count"] += 1 # Increment counter

print("Value\t\tAvg\tCount")
for value in sorted(colDict):
    print("{0}\t{1:.2f}\t{2}".format(value, 
                              colDict[value]["Avg"]/colDict[value]["Count"],
                              colDict[value]["Count"]))

Upon execution:

./parseCols.py
Value           Avg     Count
ACKR1/CCRL2     12.66   1   
ACKR1/CMKLR1    10.41   1   
ACKR1/CXCR4     13.01   1   
ACKR1/FZD5      10.72   1   
ACKR1/GPR107    10.22   1   
ACKR1/GPR137    10.73   1   
ACKR1/GPR146    11.93   1   
ACKR1/TAPT1     11.24   1   
ACKR1/TPRA1     11.81   1   
ACKR1/XPR1      10.01   1   
ACKR2/ACKR3     11.16   7   
ACKR2/ADGRA2    9.02    3  

The only difference in output is that mine is sorted (on Value) and that I round the Avg into two decimals.

Again; credit to @Kamaraj for writing such a compact soluton in awk!