Searching file based on data range

Question

I previously posted requesting help with counting occurrences of a string. I'm now hoping to search for the occurrence of a string within a range of values and print out a similarly formatted file (the ranges below are sorted by the initial number in the range).

500506  genome  71445   71461   0
500506  genome  308369  308384  0
500506  genome  335450  335533  0
500506  genome  425268  425293  0
500506  genome  623326  623715  0
502289  genome  308370  308384  0
502289  genome  335462  335689  0
502289  genome  425268  425290  0

and I want to get a list showing the range, the number of times I see that range in my file, and which of the line identifiers has that range

71445-71461 1 500506
308369-308369 1 500506
308370-308384 2 500506,502289 
335450-335461 1 500506
335462-335533 2 500506,502289
335534-335689 2 500506,502289
425268-425290 2 500506,502289
425291-425293 1 500506

In the example above, 502289 could be either exactly matching the same range as 500506, or may fall somewhere within that range, or vice versa. Will this be do-able with a simple script? Or should I be using something like a perl script instead?

Can you make more wide explanation what do you mean under range because 500506 has 308369-308384 but 502289 has 308370-308384. Please indicate a way of choice. — Costas, Jan 20 '15 at 22:26
The initial table I presented would be of the form ID, info about the ID, minimum value in range, maximal value in range, then number. Ideally, I would like the output to contain only the overlapping parts of the range for one row, and separately, the non-overlapping parts as another row. I've modified the post to reflect this clarification — drea, Jan 20 '15 at 23:29

Costas · Accepted Answer · 2015-02-05T22:02:42.057

1

The following script should be tested on much bigger volume of data (more than 4 lines) to check correct execution this statement if ((A[1]<$3 && $4<=A[2])||(A[1]<=$3 && $4<A[2]))

awk '
    BEGIN{SUBSEP="-"}
    {     if (($3, $4) in ids)
              ids[$3,$4]=ids[$3,$4] "," $1
          else
              ids[$3,$4]=$1
    } 
    END{  for (rng1 in ids) {
              split (rng1,A,SUBSEP)
              for (rng2 in ids) {
                  split (rng2,B,SUBSEP)
                  if ((A[1]<B[1] && B[2]<=A[2])||(A[1]<=B[1] && B[2]<A[2]))
                      ids[rng2]=ids[rng2] "," ids[rng1]
                  }
              }
          for (rng in ids) {
              for (i=1;i<=split(ids[rng],D,",");i++)
                  a[D[i]]=1
              s=k=""
              n=0
              for (j in a) {
                  k=k s j
                  s=","
                  n++
                  }
              print rng, n, k
              delete a
              }
     }' formatted.file

edited Feb 05 '15 at 22:02

answered Jan 21 '15 at 01:03

Costas

14,916

Thanks Costas, Looks pretty good, the only thing I'm noticing as I'm going through the results, is that for some of the ranges, I have duplications of the ID within the range printed out. So for example, for ID 500506, I have one range between 71445-71461 in my initial file. In the output file, I'm getting this sample ID in 2 regions, which make sense, but in a 3rd region, this one sample is listed 3 times. Also, the if statement listed (if((A[i]<$3... etc. are you suggesting to test the output of the larger awk function? – drea Jan 21 '15 at 20:15
@drea The input data can be rather difference so the script should be tested. If some undesired output is occured please show it with problem explanation altogether with sample input. – Costas Jan 21 '15 at 21:00
I'm not sure how I can get you the input data you requested? I tried to add a few more clarifying lines to my example above in hopes of that helping. The output I'm concerned about I've included a line as an example. The range listed in the output contains 500506 5 different times, even though there is only a single matching range (71445-71461) in the input file OUTPUT LINE: 71451-71453 11 1400163,500506,901048,500506,800315,1101019,500506,901048,500506,900367,500506 Thanks again! – drea Jan 22 '15 at 22:15
@drea Now I see. Can you show the grep "1400163\|500506\|901048\|800315\|1101019\|900367" formatted.file to understand what is going on. – Costas Jan 22 '15 at 22:35
The grep"..." as requested of the input file? It contains >500 lines of data. – drea Jan 23 '15 at 15:22
@drea You can use some pastebin-site – Costas Jan 23 '15 at 15:27
@drea Have edit. Try if result is satisfy – Costas Jan 23 '15 at 17:22
Ok, I have removed my previous stupid question. The script is still providing strange results. This time it has several parts of the range repeated in different rows, and for pretty much all of them, each of the line identifiers is listed. 71445 71461 12 .... \n 71445 71453 12 .... \n 71448 71449 12 .... \n 71448 71453 12 .... \n 71450 71453 12 .... \n (this output has been resorted based on the lower number in the range, 12 is the total number of identifiers, and each line also contains the list of ids). – drea Feb 03 '15 at 17:05
@drea It rather difficult to understand what result you need. Please take 50-60 lines from input file and show what output you expect – Costas Feb 03 '15 at 17:22
It's not 50-60 lines cause it would take me too long to generate the appropriate output that I'm looking for for that amount of observations. http://pastebin.com/yiL4VTDV – drea Feb 03 '15 at 17:50
were you able to get it, or do I need to repast it? – drea Feb 05 '15 at 20:43
Either edit your post or put in pastebin input and how the output should look like. – Costas Feb 05 '15 at 20:46
edited. Hope that helps – drea Feb 05 '15 at 20:55
@drea Edited. But I still do not understand in your example where you find range 335534-335689 etc. so my output a little bit different. – Costas Feb 05 '15 at 22:24
That works wonderfully. Thanks so much for all the effort! – drea Feb 06 '15 at 14:24
But oh so slow! Any speed ups possible? – drea Feb 26 '15 at 16:31

Searching file based on data range

1 Answers1