2

I have a large database of this format:

0500506 gi|561108321|ref|NC_018143.2|   71446   0 
0500506 gi|561108321|ref|NC_018143.2|   71447   0
0500508 gi|561108321|ref|NC_018143.2|   71449   0
0500509 gi|561108321|ref|NC_018143.2|   71447   0

What I want to do is to create a new table with the number of occurrences of each of the third column, something like this

71446 1 0500506
71447 2 0500506,0500509
71449 1 0500508

Importantly, the whole lines won't match, so I want to make sure that I'm just matching the values in a single column. I'm assuming I can adapt one of the solutions from Count total number of occurrences using grep, but am not sure how.

Thanks very much, drea

drea
  • 35

1 Answers1

3

awk might be more up to the task

awk '{sep=$3 in a?",":"";b[$3]++;
     a[$3]=a[$3]sep$1}; END{for (k in a) print k, b[k], a[k]}' file
iruvar
  • 16,725