sed
isn't the right tool for this task
… but that doesn't mean you can't abuse it to do your bidding:
sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt | sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}'
So, after lying down in a dark room for a bit to recuperate from that abomination, here's what it does:
What do we want to achieve?
Extract "blocks" from a file, where each "block" starts with a line matching regex R1 ("beginning lines") and ends with the line preceding the next occurence of regex R2 ("terminator lines").
So just use sed
's pattern ranges, where's the problem?
R2 is a subset of R1, so our "terminator lines" could be the beginnings of new blocks. sed
doesn't support overlapping blocks.
So build a regex that matches R2, but doesn't match R1.
That would require zero-length assertions, which sed
doesn't have. (Remember how I said sed
wasn't the right tool for this?)
Solution: If looking for the "terminator line" swallows up the "beginning lines", just duplicate the "beginning lines".
That'll work, but we must not duplicate the first "beginning line", otherwise we'll just see each duplicate pair as a block.1
sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt
= Print all lines starting at line number 2 (i.e. everything except line 1). Also print lines a second time if they match R1. I'll get to the s/$/§/
in a bit.
Now that we've got cleanly delimited blocks, use a pattern range to print all lines enclosed between block beginnings and terminators: sed -n '/^~keyword~./,/^~[[:alpha:]]./p'
Oh wait, that includes the terminator lines. Stack Overflow to the rescue.
But we can't just skip all lines matching R2 — remember that R1 ⊂ R2, so removing the terminator lines would remove the beginning lines as well.
"Luckily", sed
has branching. How about we print everything that matches R1, and only discard matches for R2 afterwards?
sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~./b print; /^~[[:alpha:]]./b; :print p}'
Great, now we're printing our duplicated beginning lines when they happen to be a terminator line… If only there was a way to tell original beginning lines and their duplicates apart…
This is why we had that s/$/§/
: add a §
at the end of every duplicated beginning line (note that the §'ed duplicated beginning lines will end up being the ones starting a block, and the un-§'ed beginning lines will be the ones termating blocks that are immediately followed by another block).
Now we've got all information required to do some more fine-grained checking and branching:
For all lines within a block range …
- Check if the line matches R1 and has a trailing §.
If it does, remove the § and jump to printing the line.
- Otherwise (i.e. if we didn't jump), remove all lines matching R2 by skipping all further commands (including the printing).
- Finally print the current line.
{/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}
End result:
sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt | sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}'
However that assumes that the file's first beginning line (matching R1) is in line 1 (remember, that's the only line we excluded when duplicating beginning lines). If it isn't you'll get neat pairs, but no data:
~keyword~, ~output~.
~keyword~, ~output~.
You could probably add more matching and branching to work around that, but really…
just use awk
.
[~][[:alpha:]].
). You'll probably need a pattern that doesn't match all the data lines you want to keep. ;) – n.st Apr 02 '16 at 16:10~keyword blablabla~, ~not the output~.
? if it is working as you said, it should result on what I expect not what I get! – Sadegh Apr 02 '16 at 16:11