Multiline pattern match using sed, awk or grep

Question

Is it possible to do a multiline pattern match using sed, awk or grep? Take for example, I would like to get all the lines between { and }

So it should be able to match

 1. {}
 2. {.....}
 3. {.....

     .....}

Initially the question used <p> as an example. Edited the question to use { and}.

afaik, you can do it with perl regex but not with sed/awk/grep. — forcefsck, Mar 28 '11 at 11:19
@forcefsck> You can do multiline pattern matching with 'sed' and 'awk', but in both cases you need more than a single command... — Peter.O, Mar 29 '11 at 15:03
don't ask like "is it possible to use sed to do ...." you can use sed to do anything within the area of text processing. LOL — zinking, Oct 06 '13 at 03:13
@CiroSantilli - there's nothing wrong with a similar Q showing up on the various SE sites, only if the original poster posted the identical Q on multiple sites. — slm, Sep 16 '14 at 01:34

score 24 · Accepted Answer · edited Oct 06 '13 at 02:48

24

While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.

Here's a debugging version of the sed code

sed -n '/[{]/,/[}]/{
    p
    /[}]/a\
     end of block matching brace

    }' *.txt

Some notes,

-n means 'no default print lines as processed'.
'p' means now print the line.
The construct /[{]/,/[}]/ is a range expression. It means scan until you find something that matches the first pattern (/[{]/) AND then scan until you find the 2nd pattern (/[}]/) THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).

You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.

This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)

I hope this helps.

edited Oct 06 '13 at 02:48

terdon

242,166

answered Mar 28 '11 at 22:01

shellter

704

1

When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it: div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div" ... The "betty" line shouldn't be there.
– Peter.O Mar 29 '11 at 13:54
@fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks! – shellter Mar 29 '11 at 17:13
Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? https://www.pement.org/sed/sedfaq4.html#s4.23.1 – Ben Farmer Nov 12 '23 at 22:40

Cooper · Answer 2 · 2011-03-30T13:44:41.363

15

You can use the -M (multiline) option for pcregrep:

pcregrep -M '\{(\s*.*\s*)*\}' test.txt

\s is whitespace (including newlines), so this matches zero or more occurrences of (whitespace followed by .* followed by whitespace), all enclosed in braces.

Update:

This should do the non-greedy matching:

pcregrep -n -M '\{(\n*.*?\n*)*?\}' test.txt

edited Mar 30 '11 at 13:44

answered Mar 28 '11 at 21:05

Cooper

251
1
4

It seems like a handy tool... Yes, it is being greedy... Can you show how to invert the greedy nature? ... and I noticed in my Ubuntu man pcregrep: ...8K characters are available for forward matching, and 8K for previous matching...
– Peter.O Mar 29 '11 at 14:33
Adding a ? after a quantifier makes it non-greedy. (asdf)* is greedy, and (asdf)*? is non greedy. – Cooper Mar 30 '11 at 13:17
Thanks.. It's brilliant... It works as "advertised" and with (optional) line numbers! :)
– Peter.O Mar 30 '11 at 14:31
Thanks for mentioning pcregrep! This was the only tool I succeeded in eliminating arbitrary multiline patterns in multiline input strings (with pcregrep -v -M -F -- "$pattern" – stefanct Mar 02 '16 at 23:19

score 6 · Answer 3 · edited May 23 '17 at 12:39

6

XML like expressions (infinintely recursive tags) is not a 'regular language' therefore cannot be parsed with regular expressions (regex). Here's why:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/

http://www.perlmonks.org/?node_id=668353

https://stackoverflow.com/questions/1379524/textual-protocol-which-is-not-a-regular-language

edited May 23 '17 at 12:39

Community

1

answered Mar 28 '11 at 12:30

Marcin

1,597

FYI, I have used
just for an example. Let's change this to a C block {,}, Will I still be able to do a pattern matching to extract out a C block, which can span multiple line – Mar 28 '11 at 12:38
@johnsamuel: One problem is that unless you can fully parse a particular language, you can't tell if "{" (for example) is part of a comment or a quoted literal or is actually the start of your "block"... It only takes one misinterpretation to upset everything – Peter.O Mar 28 '11 at 13:08
1

(1) The slogan about non-regular languages concerns a technical notion of regular expressions that is more limited than most regex engines now in use. (2) The question was about what can be done with sed or awk, not with their regex engines specifically. And these languages are Turing complete. I'm not saying that writing anything beyond a trivial parser in them is going to be pretty or efficient. – dubiousjim Apr 19 '12 at 21:20

dubiousjim · Answer 4 · 2012-04-19T23:58:02.573

parser.awk:

#!/usr/bin/awk -f    
function die(msg) { print msg > "/dev/stderr"; exit 1 }
BEGIN {
  FS=opener
  if (mode=="l") linewise=1
  else if (mode=="i") trim_closer=length(closer)
  else if (mode!="a") die("mode must be one of: l,i,a")
}
{
  live=level
  for (f=1; f<=NF; f++) {
    if (f>1) {
      live=++level
      if (mode=="i" && level>1 || mode=="a") printf "%s", opener
    }
    cur=$f
    level-=gsub(closer, "", cur)
    if (level<0) die("Unbalanced")
    if (!linewise) {
      cur=$f
      if (sub(".*" closer, "", cur)) printf "%s", 
        substr($f, 1, length($f) - length(cur) - (level ? 0 : trim_closer))
      else if (live) printf "%s", $f
    }
  }
  if (live) {
    if (linewise) print
    else print ""
  }
}
END { if (level>0) die("Unbalanced") }

Call as awk -v'opener={' -v'closer=}' -v'mode=a' -f parser.awk. If mode is a, it prints the brackets and contents of all outermost, balanced {...}; if mode is i, it prints only their contents; if mode is l, it prints complete lines where an outermost {...} begins, remains open, or closes.

mtk358 · Answer 5 · 2011-03-28T20:32:33.440

1

Regular expressions cannot find matching nested parentheses.

If you are certain that there will be no pair of parentheses nested inside the one you are searching, you can search until the first closing one. For example:

sed -r 's#\{([^}])\}#\1#'

This will replace all the text from '{' to '}' with what's between them.

edited Mar 28 '11 at 20:32

answered Mar 28 '11 at 20:16

mtk358

121

1

s#\{([^}])\}#\1# will only match a single non-} char... It needs a zero-to-many * wildcard after the closing square bracket ]*... and also, sed always operates only on a single line, unless you do some buffer hold of the line and sub-process subsequent lines until you find a matching '}'
– Peter.O Mar 29 '11 at 08:53

Multiline pattern match using sed, awk or grep

5 Answers5

Linked

Related