5

I am trying to get everything between startStr and endStr for case bbb. I understand how I can get all occurrences between startStr and endStr using sed. I do not see how I would constraint it just to the one instance where bbb occurs.

Sample input:

fff
startStr
aaa
bbb
ccc
endStr
xxx
yyy
startStr
ddd
endStr
ddd
bbb

Required output:

startStr
aaa
bbb
ccc
endStr

This is what I have:

$ sed -n -e '/startStr/,/endStr/ p' sample.txt
startStr
aaa
bbb
ccc
endStr
startStr
ddd
endStr

5 Answers5

7

For first startStrendStr, contains /bbb/ occurrence:

 sed -n '/startStr/ {:n; N; /endStr/ {/\n[^\n]*bbb[^\n]*\n/ {p; q}; b}; bn}'

or

sed -n '/startStr/ {:n; N; /endStr/ {/\nbbb\n/ {p; q}; b}; bn}'

if bbb is not regular expression and it is exactly string you need (from begin to \n).

Explanation

For address /startStr/ we:

  • set label :n;
  • read next line with N;
  • check that it matches /endStr/;
    • if it's true, check /\nbbb\n/ occurrence in this block we read;
      • if it is present, do {p; q} for «print and quit»,
      • otherwise, do b for «throw this block and start to search next»;
  • if it isn't end of block, we jump to :n, i.e., continue reading.
ValeriyKr
  • 929
2

I recommend pcregrep for this job:

pcregrep -M 'startStr(.|\n)*?bbb(.|\n)*?endStr' sample.txt

Option -M allows to match multiline patters, and the *? in no greedy operator. The rest should be obvious.

jimmij
  • 47,140
  • 1
    I don't have pcregrep, but I tried the possible equivalent grep -ozP '(?s)startStr(.|\n)*?bbb(.|\n)*?endStr\n' ... the problem is if there is a block of startStr...endStr without bbb before the block containing bbb, all lines from the first startStr to second endStr get filtered... just interchange order of the two blocks in OP's input to check it... – Sundeep Jan 07 '17 at 04:04
2

Modified input sample to include block of startStr...endStr without bbb before matching block

$ cat ip.txt 
startStr
foo
bar
endStr
fff
baz
startStr
aaa
bbb
ccc
endStr
xxx
yyy
startStr
ddd
endStr
ddd
bbb


awk solution

awk '/startStr/{f=1; m=0; buf = $0; next}
     /bbb/ && f{m=1}
     f{buf = buf ORS $0}
     /endStr/ && f{f=0; if(m==1)print buf}
    ' ip.txt
  • /startStr/{f=1; m=0; buf = $0; next} set flag to indicate start of block, clear match, initialize buffer and move on to next line
  • /bbb/ && f{m=1} if line contains bbb, set match. f is used to avoid matching bbb outside of startStr...endStr
  • f{buf = buf ORS $0} as long as flag is set, accumulate input lines
  • /endStr/ && f{f=0; if(m==1)print buf} at end of block, print buffer if match was found


as one-liner:

$ awk '/startStr/{f=1; m=0; buf = $0; next} /bbb/ && f{m=1} f{buf = buf ORS $0} /endStr/ && f{f=0; if(m==1)print buf}' ip.txt 
startStr
aaa
bbb
ccc
endStr


A simpler perl solution by slurping entire input file - assumes that there are no blocks like startStr...startStr...endStr (i.e no endStr for the first startStr)

$ perl -0777 -ne '(@m) = /startStr.*?endStr\n/gs; print grep { /bbb/ } @m' ip.txt 
startStr
aaa
bbb
ccc
endStr
Sundeep
  • 12,008
1

Python solution:

$ ./find_bound_pattern.py < input.txt                                                                                    
startStr
aaa
bbb
ccc
endStr

Script itself:

#!/usr/bin/env python
from __future__ import print_function
import sys

flag = None
group = []
for line in sys.stdin:
    if 'startStr' == line.strip():
        flag = True # mark beginning of the block
        group.append(line.strip())
        continue
    if flag: # we are in a block, so record lines
        group.append(line.strip())
    if  'endStr' == line.strip():
        flag = False # reached end of block, time to check
        if 'bbb' in group:
            print('\n'.join(group))
        group = [] # clear list after each block end

The way this works is quite simple: we mark the beginning of the block with flag variable, and unset it once we reach end of block. Once we reach end of block - we check what we recorded for presence of bbb, and print out all the strings. The recorded list is cleared at the end of each block and process repeats again, so this is suitable for matching multiple blocks which may contain bbb.

The logic in this approach is simple, can be implemented with other languages such as C, Java, or Ruby - whichever your heart desires. Note that in this case, this is more of a line-matching problem, but if there is a need for more advanced pattern matching, it can be implemented as well via re module.

-1
sed -n -e '/startStr/,/bbb/p;/bbb/,/endStr/p' /path/to/input
DopeGhoti
  • 76,081