11

In this edit Stéphane Chazelas POSIXifies (again) my sed formatting by inserting an -expression break and another -expression statement. Now, I might just ask him why in the comments, I suppose, but it is already revision number 18 on that answer and almost all of the previous ones were already thanks to similar freebies (if you can see deleted comments you'll know what I mean). Also, I think I'm near enough to understanding why to phrase this in a way that might be more generally useful. So here's hoping...

I generally prefer to keep my total sed -expressions to one if I might, but I also have a greater preference for conforming to the spec as near as I can, especially when the difference amounts to no more than a <space> and an -e. But I cannot do this if I do not understand why I should. Here's a brief rundown of the current state of my understanding:

  • the ' -e ' break can portably stand in for a sed script \newline break in a sed command-line statement... I am admittedly fuzzy about why

  • the closing brace in a sed { function } must be preceded by a \newline break as stated here:

    • The <right-brace> shall be preceded by a <newline> and can be preceded or followed by <blank> characters.
  • a \newline break is similarly required following any use of... a, b, c, i, r, t, w, or :.

But I do not understand clearly how the { function } definition relates to the !not operator. The only mention I find of the negation operator in the spec states:

  • A function can be preceded by one or more ! characters, in which case the function shall be applied if the addresses do not select the pattern space.

Does this mean that use of a ! implies { braces }? What of $!commands - should they likewise be separated by ' -e ' breaks? Was this what was addressed when Stéphane most recently POSIXified my answer?

I think it is either the !negation operator, or it is the branch statement he addresses in his edit - or possibly it is both at once - but I do not know and should like to. If it is only the branch statement, then I believe a d would do in its place and eliminate the need for the ' -e ' break, but I'd rather be certain before hazarding a thrice POSIXified answer. Can you help?

I did risk it after all, but not with any great certainty...

mikeserv
  • 58,310
  • With b;n;:b, you're branching to the label called ";n;:b" in historical and POSIX seds (and GNU sed is not in that regards). – Stéphane Chazelas Aug 05 '14 at 16:49
  • @StéphaneChazelas - I get the : part - you drove that home months ago. But I don't fully understand why the second sed command was similarly POSIXified. – mikeserv Aug 05 '14 at 16:54
  • Well because b;h;n;G;P;D branches to the ;h;n;G;P;D label. What do you mean? – Stéphane Chazelas Aug 05 '14 at 16:56
  • 1
    In any case, the POSIX spec for sed is very unclear to me. I've requested clarifications a few times in the past, but I don't think it was updated as a result. A good test is to try with the heirloom toolchest (Solaris one, derived from the original and which the POSIX spec is largely based on). – Stéphane Chazelas Aug 05 '14 at 16:59
  • @StéphaneChazelas - oh! And so the ! is not involved at all? I mean what I asked - how does the ! relate to a function and/or breaks required due to the }? Does it at all?hmmm.... any chance you could put something to that effect in an answer? – mikeserv Aug 05 '14 at 16:59
  • TBH, the questions in the OP are a bit too chaotic for me, but to explain things in a nutshell for sed learners: simply memorize -e as to be mandatory if, for instance, you want to do multiple replacements using regex, e. g. sed -e 's/foo/bar/' -e 's/foo/baz/g'. Mandatory means, "mandatory unless you want to do something really stupid" (like pipe a couple sed statements in a chain when there is no reason to do so; always gives me the shivers when I see this) – syntaxerror Dec 05 '14 at 16:22
  • 1
    @syntaxerror - i dont believe that is the case at all. if you read the spec youll find that s///ubstitutions are spec'd to accept chaining with a ; . it gets blurry around commands that must be delimited with a newline and how -e can stand in in that case - at least it does for me. ive yet to stumble on a sed that doesnt interpret them pretty interchangeably though. – mikeserv Dec 05 '14 at 16:27
  • Great point! So it's just a habit of mine to use -e then (still better than this stupid piping after all, no? ;)). But ... um ... of course you're right. I confess I always keep forgetting that you can use the ; as a separator even between different s/// regex statements. I'm just not used to it. Heck knows why. – syntaxerror Dec 05 '14 at 16:30
  • @syntaxerror - yes, in most cases it is preferable. sometimes though, a little parallelism is a benefit. still, i do generally prefer a single out per loop iteration and passing output wholly for each in a straight line as much as i can. there are some other cases when a single sed simply will not do though - particularly when it comes to line counting. you need fresh input for fresh counts following edits - and you need another sed - or else some algorithm which otherwise handles it. – mikeserv Dec 05 '14 at 16:35
  • @syntaxerror - actually, thinking twice, maybe s///ubs are not so clear cut as all that. For example: s///w file should need a newline or -e following it, and you need the same when s///;testing a substitution as well, i suppose. – mikeserv Dec 05 '14 at 16:46
  • Trust me, often THEY ARE! Thanks to your excellent "brain refresher", I've now done a very overdue update on an old answer of mine on SF, one of which I love to call one of my "rep cornucopiae": this one here. Now if you please compare to the old -e-concatenated sed lines and now! That's worlds apart, and has given a readability boost by at least 50 percent. – syntaxerror Dec 05 '14 at 17:01
  • 1
    @syntaxerror - I like it, but you should know that you don't need the ; before a newline - a newline is fine. Honestly, you could do without the -e and all entirely and just write a file like #!/bin/sed with each command on a newline - or those that don't require such delimiters instead delimited with ;. The ones that do require newlines are usually the ones that take arbitrary input - :label names and commands that refer to them like b or t or closing } curlies for functions, or read and write which take filename args. They all portably need to be followed by \n. – mikeserv Dec 05 '14 at 17:06
  • oh and a i and c which all accept any kind of input up to next newline. – mikeserv Dec 05 '14 at 17:07
  • You're really a human hints pool! Dang, you're right, works without this ; stuff indeed. OK, to my excuse I just thought it's better to be safe than sorry. ;-) But many thanks, we're getting there. :p BUT OTOH, fiddling with this stuff in a trial-and-error fashion will usually spawn zillions of unrelated, highly-confusing error messages and warnings, so this is why I'd commonly like to avoid that at all costs. – syntaxerror Dec 05 '14 at 17:09
  • @syntaxerror - I have a few complicated examples strewn about if you're interested. here is one from a month or two ago. And here's another from yesterday - that one coordinates a few sed in something like an eval chain on a file. – mikeserv Dec 05 '14 at 17:14
  • TY, cool stuff! Got them bookmarked. Gonna delve myself into those the next days for sure. BTW, please do no longer feel addressed when I say "sed learners". You definitely are no learner (LOL), more like a pro asking other uber-pros like Stéphane to squeeze out the very last quirks that remain. ;) grin – syntaxerror Dec 05 '14 at 17:15
  • @syntaxerror - there is no offense taken here. i am a learner - its why im here. there is always more to learn. i am pretty good with sed though - but the topics covered in this question ive pretty much come to terms with since i asked it. – mikeserv Dec 05 '14 at 17:21

1 Answers1

3

So it's high-time this question had an answer, and, though I eventually intuitively worked out the how to do this correctly in pretty much every case some time ago, I only very recently managed to fairly concrete that understanding with the text in the standard. It's actually stated there fairly simply - I just stupidly overlooked it many times, I guess.

The relevant portions of the text are all found under the heading...

  • Editing Commands in sed:

    • The argument text shall consist of one or more lines. Each embedded \newline in the text shall be preceded by a \backslash. Other backslashes in text shall be removed, and the following character shall be treated literally.

    • The r and w command verbs, and the w flag to the s command, take an optional rfile (or wfile) parameter, separated from the command verb letter or flag by one or more <blank>s; implementations may allow zero separation as an extension.

    • Command verbs other than {, a, b, c, i, r, t, w, :, and # can be followed by a ;semicolon, optional <blank>s, and another command verb. However, when the s command verb is used with the w flag, following it with another command in this manner produces undefined results.

...in...

  • Options: Multiple -e and -f options may be specified. All commands shall be added to the script in the order specified, regardless of their origin.

    • -e script - Add the editing commands specified by the script option-argument to the end of the script of editing commands. The script option-argument shall have the same properties as the script operand, described in the OPERANDS section.

    • -f script_file - Add the editing commands in the file script_file to the end of the script.

And last in...

  • Operands:

    • script - A string to be used as the script of editing commands. The application shall not present a script that violates the restrictions of a text file except that the final character need not be a \newline.

So, when you take it altogether, it makes sense that any command which is optionally followed by an arbitrary parameter without a predefined delimiter (as opposed to s d sub d repl d flag for example) should delimit at an unescaped \newline.

It is arguable that the ; is a predefined delimiter but in that case using the ; for any of [aic] commands would necessitate that a separate parser be included in the implementation specifically for those three commands - separate, that is, from the parser used for [:brw], for example. Or else the implementation would have to require that ; also be backslash escaped within the text parameter and it only grows more complicated from there on.

If I were writing a sed which I desired to be both compliant and efficient, then I would not write such a separate parser, I expect - except that maybe [aic] should gen a syntax error if not immediately followed by a \newline. But that is a simple tokenization problem - the end delimiter case is generally the more problematic one. I would just write it so:

sed -e w\ file\\ -e one -e '...;and more commands'

...and...

sed -e a\\ -e appended\\ -e text -e '...;and more commands'

...would behave very similarly, in that the first would create and write to a file named:

file
one

...and the second would append a block of text to the current line on output like...

appended
text

...because both would share the same parsing code for the parameter.

And regarding the { ... } and $! issue - well, I was way off there. A single command preceded by an address is not a function but rather it is just an addressed command. Almost all commands - including { function definition } are specified to accept /one/ or /one/,/two/ addresses - with the exception of #comment and :label definition. And an address can be either a line number or a regular express and can be negated with !. So all of...

$!d
/address/s/ub/stitution/
5!y/d/c/

...can be followed by a ; and more commands according to standard, but if more commands are required for a single address, and that address should not be reevaluated following the execution of each command, then a { function } should be used like:

/address/{ s//replace addressed pattern/
           s/do other conditional/substitutions/
           s/in the same context/without/
           s/reevaluating/address/
}

...where { cannot be followed on the same line by a closing } and that a closing } cannot occur except at the start of a line. But if a contained command should not otherwise be followed by a \newline, then it need not within the function either. So all of the above s///ubstitutions - and even the closing } brace, can be portably followed by ; semicolons and further commands.

I keep talking about \newline delimiters but the question is instead about -expression statements, I know. But the two are really one and the same, and the key relation is that a script can be either a literal command-line argument or a file with either of -[ef], and that both are interpreted as text files (which are specified to end in a \newline) but neither need actually end in a \newline. By this I can reasonbly (I hope) infer that a \0NUL delimited argument implies an ending \newline, and as all invocation arguments get at least) a \0NUL delimiter anyway, then either should work fine.

In fact, in practice, in every case but one where the standard specifies a \backslash escaped newline should be required, I have portably found...

sed -e ... -e '...\' -e '...'

...to work just as well. And in every case - again, in practice - where a non-escaped \newline should be required...

sed -e '...' -e '...'

...has worked for me, too. The one exception I mention above is...

sed -e 's/.../...\' -e '.../'

...which does not work for any implementation in any of my tests. I'm fairly sure that falls back to the text file requirement and the fact that s/// comes with a delimiter and so there is no reason a single statement should span \0NUL delimited arguments.

So, in conclusion, here is a short rundown of portable ways to write several kinds of sed commands:

For any of [aic]:

...commands;[aic]\
text embedded newline\
delimiting newline
...more;commands...

...or...

sed -e '...commands;[aic]\' -e 'text embedded newline\' -e 'delimiting newline' -e '.;.;.'

For any of [:rwtb] where the parameter is optional (for all but :) but the delimiting \newline is not. Note that I have never had a reason to try multiple line label parameters as would be used with [:tb], but that writing/reading to multiple lines in [rw]file parameters is usually accepted without question by seds I have tested so long as the embedded \newline is escaped w/ a \backslash. Still, the standard does not directly specify that label and [rw]file parameters should be parsed identically to text parameters and makes no mention of \newlines regarding the first two except as it delimits them.

...commands;[:trwb] parameter
...more;commands...

...or...

sed -e '[:trwb] parameter' -e '...'

...where the <space> above is optional for [:tb].

And last...

...;address[!]{ ...function;commands...
};...more;commands....

...or...

sed -e '...;address[!]{ ...function;commands...' -e '};...more;commands...'

...where any of the aforementioned commands (excepting :) also accept at least one address and which can be either a /regexp/ or a line number and might be negated with !, but if more than one command is necessary for a single evaluation of address then { function context } delimiting braces must be used. A function can contain even multiple \newline delimited commands, but each must be delimited within the braces as it would be otherwise.

And that's how to write portable sed scripts.

mikeserv
  • 58,310