33

Is it possible to do a multiline pattern match using sed, awk or grep? Take for example, I would like to get all the lines between { and }

So it should be able to match

 1. {}
 2. {.....}
 3. {.....
.....}

Initially the question used <p> as an example. Edited the question to use { and}.

jasonwryan
  • 73,126

5 Answers5

24

While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.

Here's a debugging version of the sed code

sed -n '/[{]/,/[}]/{
    p
    /[}]/a\
     end of block matching brace

    }' *.txt

Some notes,

  • -n means 'no default print lines as processed'.
  • 'p' means now print the line.
  • The construct /[{]/,/[}]/ is a range expression. It means scan until you find something that matches the first pattern (/[{]/) AND then scan until you find the 2nd pattern (/[}]/) THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).

You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.

This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)

I hope this helps.

terdon
  • 242,166
shellter
  • 704
  • 1

    When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it: div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div" ... The "betty" line shouldn't be there.

    – Peter.O Mar 29 '11 at 13:54
  • @fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks! – shellter Mar 29 '11 at 17:13
  • Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? https://www.pement.org/sed/sedfaq4.html#s4.23.1 – Ben Farmer Nov 12 '23 at 22:40
15

You can use the -M (multiline) option for pcregrep:

pcregrep -M '\{(\s*.*\s*)*\}' test.txt

\s is whitespace (including newlines), so this matches zero or more occurrences of (whitespace followed by .* followed by whitespace), all enclosed in braces.

Update:

This should do the non-greedy matching:

pcregrep -n -M '\{(\n*.*?\n*)*?\}' test.txt
Cooper
  • 251
  • 1
  • 4
  • It seems like a handy tool... Yes, it is being greedy... Can you show how to invert the greedy nature? ... and I noticed in my Ubuntu man pcregrep: ...8K characters are available for forward matching, and 8K for previous matching...

    – Peter.O Mar 29 '11 at 14:33
  • Adding a ? after a quantifier makes it non-greedy. (asdf)* is greedy, and (asdf)*? is non greedy. – Cooper Mar 30 '11 at 13:17
  • Thanks.. It's brilliant... It works as "advertised" and with (optional) line numbers! :)

    – Peter.O Mar 30 '11 at 14:31
  • Thanks for mentioning pcregrep! This was the only tool I succeeded in eliminating arbitrary multiline patterns in multiline input strings (with pcregrep -v -M -F -- "$pattern" – stefanct Mar 02 '16 at 23:19
6

XML like expressions (infinintely recursive tags) is not a 'regular language' therefore cannot be parsed with regular expressions (regex). Here's why:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/

http://www.perlmonks.org/?node_id=668353

https://stackoverflow.com/questions/1379524/textual-protocol-which-is-not-a-regular-language

Marcin
  • 1,597
  • FYI, I have used

    just for an example. Let's change this to a C block {,}, Will I still be able to do a pattern matching to extract out a C block, which can span multiple line
    –  Mar 28 '11 at 12:38
  • @johnsamuel: One problem is that unless you can fully parse a particular language, you can't tell if "{" (for example) is part of a comment or a quoted literal or is actually the start of your "block"... It only takes one misinterpretation to upset everything – Peter.O Mar 28 '11 at 13:08
  • 1
    (1) The slogan about non-regular languages concerns a technical notion of regular expressions that is more limited than most regex engines now in use. (2) The question was about what can be done with sed or awk, not with their regex engines specifically. And these languages are Turing complete. I'm not saying that writing anything beyond a trivial parser in them is going to be pretty or efficient. – dubiousjim Apr 19 '12 at 21:20
6

parser.awk:

#!/usr/bin/awk -f    
function die(msg) { print msg > "/dev/stderr"; exit 1 }
BEGIN {
  FS=opener
  if (mode=="l") linewise=1
  else if (mode=="i") trim_closer=length(closer)
  else if (mode!="a") die("mode must be one of: l,i,a")
}
{
  live=level
  for (f=1; f<=NF; f++) {
    if (f>1) {
      live=++level
      if (mode=="i" && level>1 || mode=="a") printf "%s", opener
    }
    cur=$f
    level-=gsub(closer, "", cur)
    if (level<0) die("Unbalanced")
    if (!linewise) {
      cur=$f
      if (sub(".*" closer, "", cur)) printf "%s", 
        substr($f, 1, length($f) - length(cur) - (level ? 0 : trim_closer))
      else if (live) printf "%s", $f
    }
  }
  if (live) {
    if (linewise) print
    else print ""
  }
}
END { if (level>0) die("Unbalanced") }

Call as awk -v'opener={' -v'closer=}' -v'mode=a' -f parser.awk. If mode is a, it prints the brackets and contents of all outermost, balanced {...}; if mode is i, it prints only their contents; if mode is l, it prints complete lines where an outermost {...} begins, remains open, or closes.

dubiousjim
  • 2,698
1

Regular expressions cannot find matching nested parentheses.

If you are certain that there will be no pair of parentheses nested inside the one you are searching, you can search until the first closing one. For example:

sed -r 's#\{([^}])\}#\1#'

This will replace all the text from '{' to '}' with what's between them.

mtk358
  • 121
  • 1

    s#\{([^}])\}#\1# will only match a single non-} char... It needs a zero-to-many * wildcard after the closing square bracket ]*... and also, sed always operates only on a single line, unless you do some buffer hold of the line and sub-process subsequent lines until you find a matching '}'

    – Peter.O Mar 29 '11 at 08:53