1

I have many .txt files with the following information in 4 columns and thousands of rows:

chr10 73121691 18 SLC29A3
chr10 73121692 14 SLC29A3
chr10 73121693 10 SLC29A3
chr10 73120590 15 SLC29A3
chr10 73120591 15 SLC29A3
chr10 73120592 6 SLC29A3
chr10 12345678 25 COL1A1
chr10 12345679 8 COL1A1
chr10 12345680 6 COL1A1
chr17 48431036 5 LRP5
chr17 48431037 8 LRP5
chr17 48431038 5 LRP5

What I would like as an output is the following:

chr10 73121691 - 73121693 , 14, SLC29A3
chr10 73120590 - 73120592 , 12, SLC29A3
chr10 12345679 - 12345680 , 13, COL1A1
chr17 48431036 - 48431038 , 6, LRP5

The chr# range of consecutive numbers, average of column 3 and name associated with the range.

Is there a script I could use to do this on multiple files in one go?

Thanks

Gbardai
  • 51
  • What is the column separator? Spaces or tabs? Is the input file sorted by chromosome and position? Also, will all numbers be consecutive? Or can you have multiple sets of consecutive numbers in the same chromosome? – terdon Feb 03 '21 at 18:50
  • Thank you taking the time to format my input. The column separators are tabs. In the original excel file I concatenated the chr and position in a new column. I then sorted on the concatenated column. So it is separated on Chr and position. Yes you can have multiple sets of consecutive numbers in the same chromosome. I have tried code that was presented in a similar problem on this exchange but I was not successfull. – Gbardai Feb 03 '21 at 20:34
  • 1
    @EdMorton based on the type of data, there should never be a different chr (chromosome) for the same SLC (gene), but there can be different genes on the same chromosome. – terdon Feb 04 '21 at 09:55
  • @EdMorton, terdon is correct. I have updated the original post to clarify. – Gbardai Feb 04 '21 at 13:16

2 Answers2

5

Every bioinformatician needs Datamash in their toolkit :)

$ datamash -W groupby 1 min 2 max 2 mean 3 unique 4 < tmp/data.txt
chr10   73121691        73121693        14      SLC29A3
chr17   48431036        48431038        6       LRP5

The numbers in the command line arguments refer to columns. So, we are grouping by column 1, giving the min & max (range) of column 2, the mean of column 3, and a comma-separated list of corresponding entries from column 4.

cryptarch
  • 1,270
  • 1
    Thanks. After you posted your reply I significantly edited my original question. I used Datamash, very good program btw, but it did not give me the desired output. However, it is definitely a program I will use. – Gbardai Feb 04 '21 at 20:43
1

This may be what you want depending on the answers to the questions in my comment:

$ cat tst.awk
$2 != (prev[2] + 1) {
    if (NR > 1) {
        prt()
    }
    split($0,beg)
    sum = cnt = 0
}
{
    split($0,prev)
    sum += $3
    cnt++
}
END { prt() }

function prt( ave) { ave = (cnt ? sum / cnt : 0) print prev[1], beg[2], "-", prev[2], "," ave ",", prev[4] }

$ awk -f tst.awk file
chr10 73121691 - 73121693 ,14, SLC29A3
chr10 73120590 - 73120592 ,12, SLC29A3
chr10 12345678 - 12345680 ,13, COL1A1
chr17 48431036 - 48431038 ,6, LRP5
Ed Morton
  • 31,617
  • Thank you it worked. I was using excel and vlookup to do this. Rather long and tedious. However, I compared your script to the excel vlookup and they match exactly. I know this may asking to much: Would it possible to explain in the script what exactly is happening. Thanks again. – Gbardai Feb 04 '21 at 20:39
  • You're welcome. Please take a look at the man page, add some prints to it and figure out as much as you can yourself and then feel free to ask specific questions. – Ed Morton Feb 04 '21 at 20:41