Get text between start pattern and end pattern based on pattern between start and end pattern

Question

I am trying to get everything between startStr and endStr for case bbb. I understand how I can get all occurrences between startStr and endStr using sed. I do not see how I would constraint it just to the one instance where bbb occurs.

Sample input:

fff
startStr
aaa
bbb
ccc
endStr
xxx
yyy
startStr
ddd
endStr
ddd
bbb

Required output:

startStr
aaa
bbb
ccc
endStr

This is what I have:

$ sed -n -e '/startStr/,/endStr/ p' sample.txt
startStr
aaa
bbb
ccc
endStr
startStr
ddd
endStr

No, anything similar would work as well - awk, grep, ... – Tomas Greif Jan 06 '17 at 22:49 — Tomas Greif, Jan 06 '17 at 22:49

score 7 · Accepted Answer · edited Jan 08 '17 at 03:55

7

For first startStr…endStr, contains /bbb/ occurrence:

 sed -n '/startStr/ {:n; N; /endStr/ {/\n[^\n]*bbb[^\n]*\n/ {p; q}; b}; bn}'

or

sed -n '/startStr/ {:n; N; /endStr/ {/\nbbb\n/ {p; q}; b}; bn}'

if bbb is not regular expression and it is exactly string you need (from begin to \n).

Explanation

For address /startStr/ we:

set label :n;
read next line with N;
check that it matches /endStr/;
- if it's true, check /\nbbb\n/ occurrence in this block we read;
  - if it is present, do {p; q} for «print and quit»,
  - otherwise, do b for «throw this block and start to search next»;
if it isn't end of block, we jump to :n, i.e., continue reading.

edited Jan 08 '17 at 03:55

G-Man Says 'Reinstate Monica'

22,870

answered Jan 06 '17 at 23:14

ValeriyKr

929

Did you notice that the question mentions bbb? – G-Man Says 'Reinstate Monica' Jan 07 '17 at 08:11
I looked at "Required output" section. – ValeriyKr Jan 07 '17 at 09:03
For this "Required output" it's ok, but if OP has different file, it might not work as they want. – Sergiy Kolodyazhnyy Jan 07 '17 at 11:08
I've understood you, you're right. So, i've fixed script, thanks. – ValeriyKr Jan 07 '17 at 13:05
1

Much better. See this answer for a slightly streamlined version. – G-Man Says 'Reinstate Monica' Jan 08 '17 at 03:58

score 2 · Answer 2 · answered Jan 06 '17 at 22:58

2

I recommend pcregrep for this job:

pcregrep -M 'startStr(.|\n)*?bbb(.|\n)*?endStr' sample.txt

Option -M allows to match multiline patters, and the *? in no greedy operator. The rest should be obvious.

answered Jan 06 '17 at 22:58

jimmij

47,140

1

I don't have pcregrep, but I tried the possible equivalent grep -ozP '(?s)startStr(.|\n)*?bbb(.|\n)*?endStr\n' ... the problem is if there is a block of startStr...endStr without bbb before the block containing bbb, all lines from the first startStr to second endStr get filtered... just interchange order of the two blocks in OP's input to check it... – Sundeep Jan 07 '17 at 04:04

Sundeep · Answer 3 · 2017-01-07T04:29:11.223

Modified input sample to include block of startStr...endStr without bbb before matching block

$ cat ip.txt 
startStr
foo
bar
endStr
fff
baz
startStr
aaa
bbb
ccc
endStr
xxx
yyy
startStr
ddd
endStr
ddd
bbb

awk solution

awk '/startStr/{f=1; m=0; buf = $0; next}
     /bbb/ && f{m=1}
     f{buf = buf ORS $0}
     /endStr/ && f{f=0; if(m==1)print buf}
    ' ip.txt

/startStr/{f=1; m=0; buf = $0; next} set flag to indicate start of block, clear match, initialize buffer and move on to next line
/bbb/ && f{m=1} if line contains bbb, set match. f is used to avoid matching bbb outside of startStr...endStr
f{buf = buf ORS $0} as long as flag is set, accumulate input lines
/endStr/ && f{f=0; if(m==1)print buf} at end of block, print buffer if match was found

as one-liner:

$ awk '/startStr/{f=1; m=0; buf = $0; next} /bbb/ && f{m=1} f{buf = buf ORS $0} /endStr/ && f{f=0; if(m==1)print buf}' ip.txt 
startStr
aaa
bbb
ccc
endStr

A simpler perl solution by slurping entire input file - assumes that there are no blocks like startStr...startStr...endStr (i.e no endStr for the first startStr)

$ perl -0777 -ne '(@m) = /startStr.*?endStr\n/gs; print grep { /bbb/ } @m' ip.txt 
startStr
aaa
bbb
ccc
endStr

+1 for identifying startStr...startStr...endStr as (another) ambiguity in the OP. — G-Man Says 'Reinstate Monica', Jan 08 '17 at 03:59

score 1 · Answer 4 · answered Jan 07 '17 at 08:38

Python solution:

$ ./find_bound_pattern.py < input.txt                                                                                    
startStr
aaa
bbb
ccc
endStr

Script itself:

#!/usr/bin/env python
from __future__ import print_function
import sys

flag = None
group = []
for line in sys.stdin:
    if 'startStr' == line.strip():
        flag = True # mark beginning of the block
        group.append(line.strip())
        continue
    if flag: # we are in a block, so record lines
        group.append(line.strip())
    if  'endStr' == line.strip():
        flag = False # reached end of block, time to check
        if 'bbb' in group:
            print('\n'.join(group))
        group = [] # clear list after each block end

The way this works is quite simple: we mark the beginning of the block with flag variable, and unset it once we reach end of block. Once we reach end of block - we check what we recorded for presence of bbb, and print out all the strings. The recorded list is cleared at the end of each block and process repeats again, so this is suitable for matching multiple blocks which may contain bbb.

The logic in this approach is simple, can be implemented with other languages such as C, Java, or Ruby - whichever your heart desires. Note that in this case, this is more of a line-matching problem, but if there is a need for more advanced pattern matching, it can be implemented as well via re module.

score -1 · Answer 5 · answered Jan 06 '17 at 22:47

-1

sed -n -e '/startStr/,/bbb/p;/bbb/,/endStr/p' /path/to/input

answered Jan 06 '17 at 22:47

DopeGhoti

76,081

To me this returns 11 rows. Looks like union of rows 2-4 and 4-13? – Tomas Greif Jan 06 '17 at 22:55

Get text between start pattern and end pattern based on pattern between start and end pattern

5 Answers5

Explanation

Linked

Related