Efficient way to print lines from a massive file using awk, sed, or something else?

Question

If I had a plain-text file containing 8 million lines and I want to print out lines 4,000,010 to 4,000,000 to the screen, which would be more efficient: awk or sed?

There is no pattern to the text, and unfortunately, a database isn't an option. I know this isn't ideal, I'm just curious on which one would complete the task quicker.

Or maybe there is even a better alternative to sed or awk?

terdon · Accepted Answer · 2013-10-10T02:05:51.423

17

Neither, use tail or head instead:

$ time tail -n 4000001 foo | head -n 11
real    0m0.039s
user    0m0.032s
sys     0m0.004s

$ time head -n 4000010 foo | tail -n 11
real    0m0.055s
user    0m0.064s
sys     0m0.036s

tail is in fact consistently faster. I ran both commands 100 times and calculated their average:

tail:

real    0.03962
user    0.02956
sys     0.01456

head:

real    0.06284
user    0.07356
sys     0.07244

I imagine tail is faster because though it has to seek all the way to line 4e10, it does not actually print anything until it gets there while head will print everything until line 4e10 + 10.

Compare to some other methods sorted in order of time:

sed:

$ time sed -n 4000000,4000011p;q foo
real    0m0.312s
user    0m0.236s
sys     0m0.072s

Perl:

$ time perl -ne 'next if $.<4000000; print; exit if $.>=4000010' foo 
real    0m1.000s
user    0m0.936s
sys     0m0.064s

awk:

$ time awk '(NR>=4000000 && NR<=4000010){print} (NR==4000010){exit}' foo 
real    0m0.955s
user    0m0.868s
sys     0m0.080s

Basically, the rule is the less you parse, the faster you are. Treating the input as a stream of data which only needs to be printed to the screen (as tail does) will always be the fastest way.

edited Oct 10 '13 at 02:05

answered Oct 09 '13 at 20:23

terdon

242,166

Great! Works visibly faster than sed. Just out of curiosity however would you know the efficiency difference between sed/awk for the same task? – dukevin Oct 09 '13 at 20:30
@KevinDuke make sure you use my updated version, there was a mistake in the first one. I will post some tests with other tools in a second but all of them will need to parse the lines somehow which should make them slower. – terdon Oct 09 '13 at 20:33
1

@KevinDuke comparisons added. – terdon Oct 09 '13 at 20:45
3

Is there any disadvantage to doing it the other way around? I mean: head -n 4000010 foo.txt | tail -n 10? It seems the more intuitive way to do it to me, since it doesn't require you to know the exact length of the file. It should even be faster, since AFAIK tail has to read the whole file (in order to know how many lines there are), whereas head stops reading after the appropriate number of lines. – evilsoup Oct 09 '13 at 20:47
Wow! Thanks a lot for spending the time to do all these tests, it is really interesting and useful. – dukevin Oct 09 '13 at 20:49
@evilsoup see updated answer. head will be slower because it has to print unnecessary lines. – terdon Oct 10 '13 at 00:01
Adding a NR>4000010{exit} to the end of the awk or a ;4000011q to the end of the sed function cuts the time down by almost half, largely because that allows the script to skip reading nearly half of the file. But head | tail is still faster. – dannysauer Oct 10 '13 at 01:36
@dannysauer good points, thanks. Answer updated. – terdon Oct 10 '13 at 01:55
Your results are interesting. On one of my slower systems, head | tail is consistently faster than tail | head (by close to 50%). :) Goes to show that benchmarks need to be verified in a given environment before being relied upon absolutely. :D – dannysauer Oct 10 '13 at 15:33