0

I want to use sed -n "/START PATTERN/,/END PATTERN/p" file.txt pattern to search in a file.

file.txt content is

~keyword~, ~output~.
~1.~ ~output~.
~2.~ ~output~.
~keyword~, ~output~.
~1.~ ~output~.
~2.~ ~output~.
~3.~ ~output~.
~keyword blablabla~, ~not the output~.
~1.~ ~not the output~.
~2.~ ~not the output~.
~keyword blablabla2~, ~not the output~.
~1.~ ~not the output~.
~2.~ ~not the output~.
~3.~ ~not the output~.
~4.~ ~not the output~.
~blablabla~, ~not the output~.
~1.~ ~not the output~.
~2.~ ~not the output~.
~3.~ ~not the output~.
~4.~ ~not the output~.

What I expect as output is

~keyword~, ~output~.
~1.~ ~output~.
~2.~ ~output~.
~keyword~, ~output~.
~1.~ ~output~.
~2.~ ~output~.
~3.~ ~output~.

So start pattern is keyword in between ~ followed by any char . so it is /~keyword~./

End pattern is ~ followed by any alphabetic characters and then any char ..

When I run sed -n "/~keyword~./,/[~][[:alpha:]]./p" file.txt the output is

~keyword~, ~output~.
~1.~ ~output~.
~keyword~, ~output~.
~1.~ ~output~.

2nd and 3rd lines is not printing in output so my question is what is wrong with my approach? I inspired this by using the solution provided here

I also tried sed "/~keyword~./,/[~][[:alpha:]]./!d;//d" file.txt which results in empty output (inspired from this question)

This question is different with the question marked as duplicate because I asked specifily about use of sed for regular expression. Considering this, if you think it means it is duplicated, please mark it as duplicate.

Sadegh
  • 599
  • Your range ends on the first occurrence of the second pattern ([~][[:alpha:]].). You'll probably need a pattern that doesn't match all the data lines you want to keep. ;) – n.st Apr 02 '16 at 16:10
  • exactly. isn't first occurrence of 2nd pattern is ~keyword blablabla~, ~not the output~.? if it is working as you said, it should result on what I expect not what I get! – Sadegh Apr 02 '16 at 16:11
  • OP specific multiple lines in a reply for clarification. – Thomas Dickey Apr 02 '16 at 19:27

3 Answers3

3

Let's see if sed is right tool for this job:

sed '/^~[[:alpha:]].*/!{               # if line doesn't match this pattern
H                                      # append it to hold space
$!d                                    # and delete it if it's not the last line
b end                                  # else branch to label end
}
//b end                                # if line matches, branch to label end
: end                                  # label end
x                                      # exchange pattern space w. hold space
/^~keyword~.*/p                        # if pattern space matches, print it
d' infile                              # delete pattern space

With gnu sed you could write it as a one-liner:

sed '/^~[[:alpha:]].*/!{H;$!d;b end};//b end;: end;x;/^~keyword~.*/p;d' infile
don_crissti
  • 82,805
1

A pattern-delimited range /P1/,/P2/ like the one you're using starts at (and includes) the line matching /P1/ and ends at (and includes) the line matching /P2/.

Your patterns aren't anchored to the beginning of the line (you'd use a leading ^ in the regex for that), so they may match anywhere in the line.
Your "end" pattern /[~][[:alpha:]]./ matches the data lines you want to keep (specifically the "~output" part), so the range ends right at the first data line.

I was going to suggest letting your range end at the first line that doesn't match your data pattern, but since sed doesn't support overlapping ranges, that would make it impossible to print consecutive "blocks" (like block 1 and block 2 in your example). (The first block would include the first line of the second block.)

Can I interest you in our lord and saviour awk instead? ;)

awk '
    BEGIN {
        inrange = 0
    }
    /^~[[:alpha:]]/ {
        inrange = 0
    }
    /^~keyword~/ {
        inrange = 1
    }
    {
        if (inrange) {
            print
        }
    }'

An explanation might be in order:

  • The awk script above parses the input (from a file or stdin) line by line, just like sed does.
  • At the very beginning (= before processing the first line), it sets a flag to "we shouldn't print the current line".
  • When the current line matches the pattern you've given for "first line after a block", it also sets the flag to "don't print".
  • When the current line matches the pattern you've given for "first line of a block", it sets the flag to "do print".
  • Depending on the flag, it either prints the current line or doesn't.

You could even exclude the "start of block" lines just by rearranging the order of checks (i.e. print/don't print first, check if the current line is a block start afterwards).

The linebreaks in the awk script are also optional, but greatly improve readability.

n.st
  • 8,128
  • thanks for answer. I tried to use awk for this example last week but the problem was that it didnt printout characters after first pattern(it only printed output with numbers in the beginning) so that is why I tried sed instead. I am on windows and need to run the code in a bat file so I needed to remove linesbreaks and I run the code as awk "BEGIN {inrange = 0};/^~[[:alpha:]]/{inrange = 0};/^~keyword~/;{inrange = 1};{if(inrange){print}}" file.txt The result is whole of the text so i think something is wrong. did you try it yourself? – Sadegh Apr 02 '16 at 16:43
  • @Woeitg You introduced a nasty little typo when you removed the linebreaks: /^~keyword~/;{inrange = 1} — That separates the flag change from the condition, so the flag is always enabled unconditionally. Remove that ; and you should be good to go. ;) – n.st Apr 02 '16 at 16:52
  • you are right and the answer is working! I hope to find an answer for answer for this with sed. otherwise I will accept your answer ;-) why sed? because I am doing a university project with sed and I already wrote introduction and many pages about sed, if I want to use awk as a solution, I need to rewrite everything! – Sadegh Apr 02 '16 at 16:56
1

sed isn't the right tool for this task

… but that doesn't mean you can't abuse it to do your bidding:

sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt | sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}'

So, after lying down in a dark room for a bit to recuperate from that abomination, here's what it does:

What do we want to achieve?
Extract "blocks" from a file, where each "block" starts with a line matching regex R1 ("beginning lines") and ends with the line preceding the next occurence of regex R2 ("terminator lines").

So just use sed's pattern ranges, where's the problem?
R2 is a subset of R1, so our "terminator lines" could be the beginnings of new blocks. sed doesn't support overlapping blocks.

So build a regex that matches R2, but doesn't match R1.
That would require zero-length assertions, which sed doesn't have. (Remember how I said sed wasn't the right tool for this?)

Solution: If looking for the "terminator line" swallows up the "beginning lines", just duplicate the "beginning lines".
That'll work, but we must not duplicate the first "beginning line", otherwise we'll just see each duplicate pair as a block.1

sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt

= Print all lines starting at line number 2 (i.e. everything except line 1). Also print lines a second time if they match R1. I'll get to the s/$/§/ in a bit.

Now that we've got cleanly delimited blocks, use a pattern range to print all lines enclosed between block beginnings and terminators: sed -n '/^~keyword~./,/^~[[:alpha:]]./p'

Oh wait, that includes the terminator lines. Stack Overflow to the rescue.
But we can't just skip all lines matching R2 — remember that R1 ⊂ R2, so removing the terminator lines would remove the beginning lines as well.

"Luckily", sed has branching. How about we print everything that matches R1, and only discard matches for R2 afterwards?

sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~./b print; /^~[[:alpha:]]./b; :print p}'

Great, now we're printing our duplicated beginning lines when they happen to be a terminator line… If only there was a way to tell original beginning lines and their duplicates apart…

This is why we had that s/$/§/: add a § at the end of every duplicated beginning line (note that the §'ed duplicated beginning lines will end up being the ones starting a block, and the un-§'ed beginning lines will be the ones termating blocks that are immediately followed by another block).

Now we've got all information required to do some more fine-grained checking and branching:

For all lines within a block range …

  • Check if the line matches R1 and has a trailing §.
    If it does, remove the § and jump to printing the line.
  • Otherwise (i.e. if we didn't jump), remove all lines matching R2 by skipping all further commands (including the printing).
  • Finally print the current line.
{/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}

End result:

sed -n -e "2,$ p" -e "/^~keyword~./ {s/$/§/;p}" in.txt | sed -n '/^~keyword~./,/^~[[:alpha:]]./ {/^~keyword~..*§$/ {s/§$//; b print}; /^~[[:alpha:]]./b; :print p}'

However that assumes that the file's first beginning line (matching R1) is in line 1 (remember, that's the only line we excluded when duplicating beginning lines). If it isn't you'll get neat pairs, but no data:

~keyword~, ~output~.
~keyword~, ~output~.

You could probably add more matching and branching to work around that, but really…

just use awk.

n.st
  • 8,128