So, in general, I tend to look to sed
for text processing - especially for large files - and usually avoid doing those sorts of things in the shell itself.
I think, though, that may change. I was poking around at man ksh
and I noticed this:
<#pattern Seeks forward to the beginning of the
next line containing pattern.
<##pattern The same as <# except that the por‐
tion of the file that is skipped is
copied to standard output.
Skeptical of real-world usefulness, I decided to try it out. I did:
seq -s'foo bar
' 1000000 >file
...for a million lines of data that look like:
1foo bar
...
999999foo bar
1000000
...and pitted it against sed
like:
p='^[^0-8]99999.*bar'
for c in "sed '/$p/q'" "ksh -c ':<##@(~(E)$p)'"
do </tmp/file eval "time ( $c )"
done | wc -l
So both commands should get up to 999999foo bar and their pattern matching implementation must evaluate at least the beginning and end of each line in order to do so. They also have to verify the first char against a negated pattern. This is a simple thing, but... The results were not what I expected:
( sed '/^[^0-8]99999.*bar/q' ) \
0.40s user 0.01s system 99% cpu 0.419 total
( ksh -c ':<##@(~(E)^[^0-8]99999.*bar)' ) \
0.02s user 0.01s system 91% cpu 0.033 total
1999997
ksh
uses ERE here and sed
a BRE. I did the same thing with ksh
and a shell pattern before but the results did not differ.
Anyway, that's a fairly significant discrepancy - ksh
outperforms sed
10 times over. I've read before that David Korn wrote his own io lib and implements it in ksh
- possibly this is related? - but I know next to nothing about it. How is it the shell does this so well?
Even more amazing to me is that ksh
really does leave its offset right where you ask it. To get (almost) the same out of (GNU) sed
you have to use -u
- very slow.
Here's a grep
v. ksh
test:
1000000 #grep + head
( grep -qm1 '^[^0-8]99999.*bar'; head -n1; ) \
0.02s user 0.00s system 90% cpu 0.026 total
999999foo bar #ksh + head
( ksh -c ':<#@(~(E)^[^0-8]99999.*bar)'; head -n1; ) \
0.02s user 0.00s system 73% cpu 0.023 total
ksh
beats grep
here - but it doesn't always - they're pretty much tied. Still, that's pretty excellent, and ksh
provides lookahead - head
's input starts before its match.
It just seems too good to be true, I guess. What are these commands doing differently under the hood?
Oh, and apparently there's not even a subshell here:
ksh -c 'printf %.5s "${<file;}"'
ksh
's regex engine is efficient as its io? Anyway, thanks very much for the answer. My apologies to your laptop. What about the custom memory allocator, though? Do you have any more on that? – mikeserv Dec 22 '14 at 14:01Most of the standard POSIX commands are available in the AST collection. Many are coded as library functions which can be added to ksh as built-in command which dramatically improves performance.* - Now I've just gotta figure out how to build it,
– mikeserv Dec 22 '14 at 14:44