51

It seems I am misusing grep/egrep.

I was trying to search for strings in multiple line and could not find a match while I know that what I'm looking for should match. Originally I thought that my regexes were wrong but I eventually read that these tools operate per line (also my regexes were so trivial it could not be the issue).

So which tool would one use to search patterns across multiple lines?

slm
  • 369,824
Jim
  • 10,120
  • 1
    @CiroSantilli - I do not think that this Q and the one you linked to are duplicates. The other Q is asking how you'd do multi-line pattern match (i.e. what tool should/can I use to do this) while this one is asking how to do this with grep. They are tightly related but not dups, IMO. – slm Sep 16 '14 at 01:38
  • 1
    @sim those cases are hard to decide: I can see your point. I think this particular case is better as a duplicate because the user said "grep" suggesting the verb "to grep", and top answers, including accepted, don't use grep. – Ciro Santilli OurBigBook.com Sep 16 '14 at 06:32
  • 1
    There is no indication to show that a multi-line regular expression is needed here. Please consider showing an actual example with input data and expected output data, as well as your previous effort. – Kusalananda Sep 06 '20 at 16:50

11 Answers11

45

Here's a sed one that will give you grep-like behavior across multiple lines:

sed -n '/foo/{:start /bar/!{N;b start};/your_regex/p}' your_file

How it works

  • -n suppresses the default behavior of printing every line
  • /foo/{} instructs it to match foo and do what comes inside the squigglies to the matching lines. Replace foo with the starting part of the pattern.
  • :start is a branching label to help us keep looping until we find the end to our regex.
  • /bar/!{} will execute what's in the squigglies to the lines that don't match bar. Replace bar with the ending part of the pattern.
  • N appends the next line to the active buffer (sed calls this the pattern space)
  • b start will unconditionally branch to the start label we created earlier so as to keep appending the next line as long as the pattern space doesn't contain bar.
  • /your_regex/p prints the pattern space if it matches your_regex. You should replace your_regex by the whole expression you want to match across multiple lines.
Joseph R.
  • 39,549
  • Note: On MacOS this gives sed: 1: "/foo/{:start /bar/!{N;b ...": unexpected EOF (pending }'s) – Stan James Sep 20 '18 at 16:32
  • 1
    Getting sed: unterminated { error – nomæd Apr 29 '19 at 14:07
  • @Nomaed Shot in the dark here, but does your regex happen to contain any "{" characters? If so, you'll need to backslash-escape them. – Joseph R. Apr 29 '19 at 16:37
  • Nope. I got it both on Mac and in a BusyBox image. Didn't try GNU Linux. – nomæd Apr 30 '19 at 17:30
  • Just ran it on CentOS and it worked fine though – nomæd May 01 '19 at 06:36
  • 1
    @Nomaed It seems that it has to do with the differences between sed implementations. I tried to follow the recommendations in that answer to make the above script standard-compliant but it told me that "start" was an undefined label. So I'm not sure whether this can be done in a standard-compliant way. If you do manage it, please feel free to edit my answer. – Joseph R. May 01 '19 at 17:24
  • IINM this doesn't work if foo and bar are the same. sed -n '/^_$/{:start /^_$/!{N;b start};/.*/p}' your_file just prints a bunch of underscores for me (on sed (GNU sed) 4.5). – August Janse May 06 '19 at 09:39
  • @JosephR.: How can I specify bar to be a simple regex. For example, my text is something like this "TRANSLATION The great sage Maitreya told Vidura: When the king entered his city, PURPORT". However, sometimes, it can be like this. "TRANSLATION Fragrant xxxx SB". In this case, the "bar", instead of being "PURPORT" is "SB". How can I tell sed that the "bar" can either be "PURPORT" or "SB". Thanks – anjanb Aug 31 '20 at 05:06
  • I tried using the command "sed -n '/TRANSLATION/{:start /PURPORT/!{N;b start};/.*/p}'" -- it works most of the time but sometimes, as I mentioned above, instead of TRANSLATION at the end of the text, it has "SB". So, if you could let me know how to do something like this "TRANSLATION|SB" or equivalent, that will be great. Thanks. – anjanb Aug 31 '20 at 05:08
  • @anjanb Have you tried PURPORT\|SB? – Joseph R. Aug 31 '20 at 13:59
  • @JosephR. : Thanks. I didn't earlier. I tried now but sed throws an error. I'm running on windows, if that helps and sed(v 4.4) comes from cygwin. – anjanb Aug 31 '20 at 14:41
26

I generally use a tool called pcregrep which can be installed in most of the linux flavour using yum or apt.

For eg.

Suppose if you have a file named testfile with content

abc blah
blah blah
def blah
blah blah

You can run the following command:

$ pcregrep -M  'abc.*(\n|.)*def' testfile

to do pattern matching across multiple lines.

Moreover, you can do the same with sed as well.

$ sed -e '/abc/,/def/!d' testfile
17

Simply a normal grep which supports Perl-regexp parameter P will do this job.

$ echo 'abc blah
blah blah
def blah
blah blah' | grep -oPz  '(?s)abc.*?def'
abc blah
blah blah
def

(?s) called DOTALL modifier which makes dot in your regex to match not only the characters but also the line breaks.

Avinash Raj
  • 3,703
  • When I try this solution the output does not end at 'def' but goes to the end of the file 'blah' – buckley Oct 25 '19 at 14:43
  • 1
    maybe your grep does not support -P option – Avinash Raj Oct 26 '19 at 01:31
  • This was the only that worked for me - tried all sed suggestions, but didn't go as far as installing grep alternatives. – igorsantos07 Oct 28 '20 at 02:19
  • $ grep --version: grep (GNU grep) 3.1 in the Windows Git Bash has an option -P, --perl-regexp but (?s) doesn't seem to work there. It still shows the first line only. The same pattern with the same test string works on https://regex101.com/. Is there an alternative in the Git Bash? sed? (sed (GNU sed) 4.8 here) – Gerold Broser Nov 19 '20 at 19:30
  • Do you know how to add context to the output? grep -1 doesn't work here. – Leponzo Nov 28 '20 at 02:20
  • It is not the Perl -P which does the trick here, but the -z which lets grep operate on the entire input at once instead of line by line! Try grep -z 'abc.*def'. (Actually, it now operates on pieces separated by the zero-bytes, which is usually absent in text files.) – Robert Siemer Sep 26 '23 at 16:19
8

Here's a simpler approach using Perl:

perl -e '$f=join("",<>); print $& if $f=~/foo\nbar.*\n/m' file

or (since JosephR took the sed route, I'll shamelessly steal his suggestion)

perl -n000e 'print $& while /^foo.*\nbar.*\n/mg' file

Explanation

$f=join("",<>); : this reads the entire file and saves its contents (newlines and all) into the variable $f. We then attempt to match foo\nbar.*\n, and print it if it matches (the special variable $& holds the last match found). The ///m is needed to make the regular expression match across newlines.

The -0 sets the input record separator. Setting this to 00 activates 'paragraph mode' where Perl will use consecutive newlines (\n\n) as the record separator. In cases where there are no consecutive newlines, the entire file is read (slurped) at once.

Warning:

Do not do this for large files, it will load the entire file into memory and that may be a problem.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
terdon
  • 242,166
  • I don't know much about Perl, but wouldn't it need to be my $f=join("",<>);, strictly speaking? – Sapphire_Brick Sep 06 '20 at 01:40
  • @Sapphire_Brick only if you are in strict mode (use strict;). It's a good habit to get into, especially when writing larger scripts, but it's ooverkill for a little one-liner like this one. – terdon Sep 06 '20 at 13:46
8

I solved this one for me using grep and -A option with another grep.

grep first_line_word -A 1 testfile | grep second_line_word

The -A 1 option prints 1 line after the found line. Of course it depends on your file and word combination. But for me it was the fastest and reliable solution.

mansur
  • 181
  • 1
    alias grepp='grep --color=auto -B10 -A20 -i ' then cat somefile | grepp blah | grepp foo | grepp bar ... yes those -A and -B are very handy ... you have the best answer – Scott Stensland Apr 10 '19 at 14:12
  • 1
    This isn't super deterministic and it ignores the entire pattern in favor of just getting a different single line (just based on its proximity to the first line). It's better to tell the program to go however far it needs to go in order to get to some sort of pattern you're absolutely sure is the end of the text you're trying to match. For example if testfile gets updated such that second_line_word is on the third line, then not only are you now missing the first line (due to your second grep) but you're not missing the line that started showing up inbetween the two. – Bratchley Apr 09 '20 at 13:23
  • 1
    This would be a good enough MO for ad hoc commands where you really do just want a single line in output you already understood though. I don't think that's what the OP is after though and you could probably also just copy/paste at that point due to it being ad hoc. – Bratchley Apr 09 '20 at 13:25
6

Supppose we have the file test.txt containing :

blabla
blabla
foo
here
is the
text
to keep between the 2 patterns
bar
blabla
blabla

The following code can be used :

sed -n '/foo/,/bar/p' test.txt

For the following output :

foo
here
is the
text
to keep between the 2 patterns
bar
4

The grep alternative sift supports multiline matching (disclaimer: I am the author).

Suppose testfile contains:

<book>
  <title>Lorem Ipsum</title>
  <description>Lorem ipsum dolor sit amet, consectetur
  adipiscing elit, sed do eiusmod tempor incididunt ut
  labore  et dolore magna aliqua</description>
</book>


sift -m '<description>.*?</description>' (show the lines containing the description)

Result:

testfile:  <description>Lorem ipsum dolor sit amet, consectetur
testfile:  adipiscing elit, sed do eiusmod tempor incididunt ut
testfile:  labore  et dolore magna aliqua</description>


sift -m '<description>(.*?)</description>' --replace 'description="$1"' --no-filename (extract and reformat the description)

Result:

description="Lorem ipsum dolor sit amet, consectetur
  adipiscing elit, sed do eiusmod tempor incididunt ut
  labore  et dolore magna aliqua"
svent
  • 181
3

One way to do this is with Perl. e.g. here's the contents of a file named foo:

foo line 1
bar line 2
foo
foo
foo line 5
foo
bar line 6

Now, here's some Perl which will match against any line that begins with foo followed by any line that begins with bar:

cat foo | perl -e 'while(<>){$all .= $_}
  while($all =~ /^(foo[^\n]*\nbar[^\n]*\n)/m) {
  print $1; $all =~ s/^(foo[^\n]*\nbar[^\n]*\n)//m;
}'

The Perl, broken down:

  • while(<>){$all .= $_} This loads the entire standard input in to the variable $all
  • while($all =~ While the variable all has the regular expression...
  • /^(foo[^\n]*\nbar[^\n]*\n)/m The regex: foo at the beginning of the line, followed by any number of non-newline chars, followed by a newline, followed immediately by "bar", and the rest of the line with bar in it. /m at the end of the regex means "match across multiple lines"
  • print $1 Print the part of the regex that was in parenthesis (in this case, the entire regular expression)
  • s/^(foo[^\n]*\nbar[^\n]*\n)//m Erase the first match for the regex, so we can match multiple cases of the regex in the file in question

And the output:

foo line 1
bar line 2
foo
bar line 6
samiam
  • 3,686
  • 3
    Just dropped by to say your Perl can be shortened to the more idiomatic: perl -n0777E 'say $& while /^foo.*\nbar.*\n/mg' foo – Joseph R. Feb 02 '14 at 13:10
2

If we want to get the text between the 2 patterns excluding themselves.

Supppose we have the file test.txt containing :

blabla
blabla
foo
here
is the
text
to keep between the 2 patterns
bar
blabla
blabla

The following code can be used :

 sed -n '/foo/{
 n
 b gotoloop
 :loop
 N
 :gotoloop
 /bar/!{
 h
 b loop
 }
 /bar/{
 g
 p
 }
 }' test.txt

For the following output :

here
is the
text
to keep between the 2 patterns

How does it work, let's make it step by step

  1. /foo/{ is triggered when line contains "foo"
  2. n replace the pattern space with next line, i.e. the word "here"
  3. b gotoloop branch to the label "gotoloop"
  4. :gotoloop defines the label "gotoloop"
  5. /bar/!{ if the pattern doesn't contain "bar"
  6. h replace the hold space with pattern, so "here" is saved in the hold space
  7. b loop branch to the label "loop"
  8. :loop defines the label "loop"
  9. N appends the pattern to the hold space.
    Now hold space contains :
    "here"
    "is the"
  10. :gotoloop We are now at step 4, and loop until a line contains "bar"
  11. /bar/ loop is finished, "bar" has been found, it's the pattern space
  12. g pattern space is replaced with hold space that contains all the lines between "foo" and "bar" that have saved during the main loop
  13. p copy pattern space to standard output

Done !

  • Well done, +1. I usually avoid using these commands by tr'ing the newlines into SOH and performing normal sed commands then replace the newlines. – Adam D. Nov 08 '19 at 02:11
0
grep -E "pattern1|pattern2" 

would list all lines matching with either <pattern1> or <pattern2>.

koders
  • 101
  • 1
    I think OP want to match two or more lines containing thoses 2 patterns, not a single line containing either or both patterns. – Archemar Jul 22 '21 at 09:58
  • @Archemar Not only that but it's a useless use of cat. Also egrep is deprecated. It's only there for backward compatibility. Should use grep -E. Also < and > can in some contexts be important of course like word boundaries \< and \> so probably shouldn't use them in examples because some people might take them literally (so unless it's being searched for they should be left out - probably). – Pryftan Nov 19 '22 at 17:39
0

Use the -z switch to let grep separate the content by \0 bytes instead of newline, which in usual text files never appear. This leads grep to take the entire file as input at once.

Unfortunately I don't know how to reference a newline, but . will match everything, including newline:

echo 'first
second
third' | grep -z 'second.third'

This will output the entire input, and highlight the two lines matching the pattern. If your system does not highlight the match in the output, add -o to see the match isolated from the rest.