tr -c \\n 1 <testfile | #first transform every [^\n] char to a 1
grep -nF '' | #next get line numbers
paste -d: - testfile | #then paste it together with itself
sort -t: -nk2,2 #then sort on second field
...and the winner is... line 2, it would seem.
2:1111:4for
4:11111:five!
1:1111111:seven/7
3:11111111:8 eight?
But the problem with that is that every line must more than double in length in order for it to work - so LINE_MAX is effectively halved. The cause is that it is using - what, a base 1? - to represent the length of the line. A similar - and perhaps more tidy - approach might be to compress that information in stream. The first idea along those lines that occurs to me is that I ought to unexpand
it:
tr -c \\n \ <testfile | #transform all [^\n] to <space>
unexpand -t10 | #squeeze every series of 10 to one tab
grep -nF '' | #and get the line numbers
sed 's/:/!d;=;:/;h;:big #sed compares sequential lines
$P;$!N; /\(:[^ ]*\)\( *\)\n.*\1.*\2/!D #newest line is shorter or...
g;/:./!q;b big' | #not; quit input entirely for blank line
sed -f - -e q testfile #print only first occurrence of shortest line
That prints...
2
4for
Another one, just sed
:
sed -n '/^\n/D;s/\(.\)\(\n.*\)*/\1/g
$p;h; s// /g;G;x;n;//!g;H;s// /g
G; s/^\( *\)\(\n \1 *\)\{0,1\}\n//
D' <infile >outfile
The syntax is standards compliant - but that is no guarantee that any old sed
will handle the \(reference-group\)\{counts\}
correctly - many do not.
It basically applies the same regexp to input repeatedly - which can be very beneficial when it is time to compile them. That pattern is:
\(.\)\(\n.*\)*
Which matches different strings in different ways. For example:
string1\nstring2\nstring3
...is matched with s
in \1
and ''
the null string in \2
.
1\nstring2\nstring3
...is matched with 1
in \1
and \nstring2\nstring3
in \2
\nstring2\nstring3
...is matched with \n
in \1
and ''
the null string in \2
. This would be problematic if there was any chance of a \n
ewline occurring at the head of pattern space - but the /^\n/D
, and //!g
commands are used to prevent this. I did use [^\n]
but other needs for this little script made portability a concern and I wasn't satisfied with the many ways it is often misinterpreted. Plus, .
is faster.
\nstring2
string1
... match \n
and s
again in \1
and both get the ''
null string in \2
. Empty lines don't match at all.
When the pattern is applied g
lobally the two biases - both the left-most standard bias and the lesser right-side \n
ewline bias - are counter-balanced to effect a skip. A few examples:
s/\(.\)\(\n.*\)*/\1:\2/g
s/\(.\)\(\n.*\)*/\2\1:/g
s/\(.\)\(\n.*\)*/\1: /g
s/\(.\)\(\n.*\)*/ :\2/g
...if all applied (not in succession) to the following string...
string1\nstring2
...will transform it to...
s:t:r:i:n:g:1:\nstring2
s:t:r:i:n:g:\nstring21:
s:t:r:i:n:g:1:
: : : : : : :\nstring2
Basically I use the regexp to always handle only the first line in any pattern-space to which I apply it. That enables me to juggle two different versions of both a retained shortest-match-so-far line and the most recent line without resorting to test loops - every substitution applied handles the entire pattern-space at once.
The different versions are necessary for literal string/string comparisons - so there must be a version of each line where all characters are guaranteed to be equal. But of course if one or the other should wind up actually being the earliest occurring shortest line in input, then the line printed to output should probably be the original version of the line - not the one I've sanitized/homogenized for comparison's sake. And so I need two versions of each.
It is unfortunate that another necessity is a lot of buffer switching to handle same - but at least neither buffer ever exceeds any more than the four lines needed to stay current - and so maybe it is not terrible.
Anyway, for each cycle the first thing that happens is a transformation on the remembered line - because the only copy actually saved is the literal original - into...
^ \nremembered line$
...and afterward the n
ext input line overwrites any old buffer. If it does not contain at least a single character it is effectively ignored. It would be far easier just to q
uit at the first occurring blank line, but, well, my test data had a lot of those and I wanted to handle multiple paragraphs.
And so if it does contain a character its literal version is appended to the remembered line and its spaced comparison version is positioned at head of pattern space, like this:
^ \n \nremembered line\nnew$
Last a substitution is applied to that pattern space:
s/^\( *\)\(\n \1 *\)\{0,1\}\n//
So if the newline can fit within the space needed to contain the remembered line with at least one char to spare then the first two lines are substituted away, else only the first.
Regardless of the outcome the first line in pattern space is always D
eleted at end-of-cycle before starting again. This means that if the new line is shorter than the last the string...
new
...is sent back to the first substitution in the cycle which will always strip only from the first newline char on - and so it remains whole. But if it is not then the string...
remembered line\nnew
...will begin the next cycle instead, and the first substitution will strip from it the string...
\nnew
...every time.
On the very last line the remembered line is printed to standard out, and so for the example data given, it prints:
4for
But, seriously, use tr
.