5

Yes, I know it sounds odd. Sequential and Binary-splits don't mix.. That is unless the sequence is the byte offset within the file itself...

I've scrambled together a binary split search in bash script, using dd. It has Search-and-Find times of 3-9 seconds for an 8 GB file.. so it works (but slower than I know it can be)... I'd really prefer to not have to polish this wheel; It just took my fancy to do it as an exercise in bash (having a project is the best way to learn a language, etc). I think this would be pretty straight forward in C/++, etc... I'm curious to see some other examples (particularly bash ones.

Peter.O
  • 32,916
  • By binary split you mean just split in half once, or keep splitting the halves into smaller halves? If you just need it once, probably something along the lines of head -n $(($(cat file | wc -l) / 2)) file for the first half and tail ... for the second half. – Marcin Mar 09 '11 at 02:05
  • @marcin: Yes that's what I mean. I've tried head/tail -c. It slows down dramatically to 2min for just one access at offset=8GB, and that's without using wc. wc alone takes 2min on 8GB (see 3rd comment here: http://unix.stackexchange.com/questions/8444/is-it-possible-in-bash-to-start-reading-a-file-from-an-arbitary-byte-count-offse/8498#8498)... So head/tail split may tehcnically work, but simple awk/sed search is faster.. 'dd's skip avoids most sequentail reads. My max 9sec comes from a file with each Key having 86400 duplicates for which I used a combo of split and seq-reads) – Peter.O Mar 09 '11 at 03:08
  • Split up the problem into pieces and see which part is the expensive part. If it is the initial line count that makes it slow, see if you can arrive at the number of lines differently. You seem to have some knowledge about the dataset, how many of what values...can you use that to figure out where the middle is? Also, does it have to be the exact middle? If all you need is an approximate middle, read the size (meta-info lookup, no sequential reads at all!), divide it in half, and start printing the size/2 characters. Once you reach that point, then print the other half. – Marcin Mar 09 '11 at 05:41
  • Marcin, thanks. I think my main problem was that I didn't plan it well enough.. Finding each key was relatively straight forward (1 second per key), but chasing through the duplicates got messy..I located the first/last duplicate-key, by binary-doubling out looking for a non-dupe, and then used a sequential read to find the correct first/last key line.. This bogs down when there are thousands of lines with the same YYYY-MM-DD key (eg logging one line per second.)...If you are interested, here is the script: http://paste.ubuntu.com/577919/ – Peter.O Mar 09 '11 at 16:46
  • Please give an IO example, it is hard to understand what you mean ;-) – Ciro Santilli OurBigBook.com Nov 22 '15 at 10:04

1 Answers1

3

All the basic text processing utilities are meant to act as filters, and most are meant to process their input as a stream (i.e. read a little input, process it, write the corresponding output, repeat). dd is a little unusual, both by its syntax and by the options it offers. dd is the only shell interface to lseek, and as you've noticed it's clumsy. When you reach this point, it's time to switch to a more powerful scripting language such as Perl or Python.

  • Thanks,Gilles... That sounds like good advice.. I've been focusing on bash, but I suspect that it is time to now,"just use it for what it is", and start looking at something more heavy-weight.. I've touched on Python, so I'll probably continue there.. Thanks – Peter.O Mar 08 '11 at 19:41
  • 1
    Use the bisect module in Python to do this. – Michael Dillon Mar 09 '11 at 00:00