9

Answers to this question:

How to grep lines between start and end pattern?

are not concerned with multiple sequences of lines which fall between the match patterns. Thus, for example, sed -n '/startpattern_here/,/endpattern_here/p' will print several sequences of lines which lie between occurrences of these patterns.

However, suppose I want to only print the last such sequences within a file. Can I do this with sed? If not, I guess probably awk? Something else?

Notes:

  • You may assume that these sequences are non-overlapping.
  • The starting and ending pattern lines should be included in the output.
  • Answers making assumptions of lower-complexity patterns are also valid (although not optimal).
einpoklum
  • 9,515
  • Would the start and end patterns be full regular expressions? If not, could they contain RegExp special characters? – AdminBee Sep 10 '20 at 15:59
  • @AdminBee: If you have a decent answer making the assumption they're not full-on regular expressions - that's already an interesting answer. – einpoklum Sep 10 '20 at 16:06
  • 4
    [edit] your question to include a MCVE with concise, testable sample input and expected output plus what you've tried to far so we can best help you. Include your requirements for what to do if the start or end "pattern" doesn't exist in your input. – Ed Morton Sep 10 '20 at 22:22
  • @EdMorton: Not necessary... 5 people have already gotten enough information to provide interesting answers. – einpoklum Sep 11 '20 at 06:38
  • 2
    I think the point detected by @AdminBee in his answer not totally defined yet. What should happen if there is a last start pattern but no pairing end pattern? The proposed sed you give us would give one thing as a result, but the 2nd bullet point seems to contradict it. Could you please clarify that point? – Quasímodo Sep 11 '20 at 10:10
  • 5
    @einpoklum I'm glad you find the answers interesting but they're all making different assumptions about what you actually want in any but the the sunny-day cases so I disagree that posting a MCVE and covering your rainy day requirements isn't necessary. See [ask]. – Ed Morton Sep 11 '20 at 15:35

12 Answers12

7

This might work, assuming you want a full regular expression test:

awk '/startpattern_here/ {buf="";f=1}
     f{buf=buf $0 "\n"}
     /endpattern_here/ {f=0; lastcomplete=buf}
     END{printf("%s",lastcomplete)}' file.txt

This will ensure that only complete start-stop-patterns will be printed.

Test case:

irrelevant
irrelevant
irrelevant
startpattern_here
relevant_but_dont_show_1
relevant_but_dont_show_1
relevant_but_dont_show_1
endpattern_here

irrelevant irrelevant

startpattern_here relevant_but_dont_show_2 relevant_but_dont_show_2 relevant_but_dont_show_2 endpattern_here

irrelevant irrelevant

startpattern_here relevant_and_show relevant_and_show relevant_and_show endpattern_here

irrelevant startpattern_here incomplete_dont_show

Result:

startpattern_here
relevant_and_show
relevant_and_show
relevant_and_show
endpattern_here

Note If you want to suppress the output of the start and end patterns, just swap the rules /startpattern_here/ { ... } and /endpattern_here/ { ... }, i.e. place the "end pattern" rule first, and the "start pattern" rule just before the END rule.

AdminBee
  • 22,803
6

Combination of tac and awk

tac file \
| awk '
   !p && /endpattern_here/   {p = 1}
    p                        {print}
    p && /startpattern_here/ {exit}
' \
| tac
einpoklum
  • 9,515
glenn jackman
  • 85,964
  • Thanks for the sed fix - but that should be a comment on my answer... also, given that we can do this with sed - is there a benefit to tac+awk+tac? Allowing myself an edit..., and +1 – einpoklum Sep 11 '20 at 07:55
  • 1
    Yes, should have been a comment. sed vs awk: I would be very surprised if there was any difference in performance, and IMO awk is far more flexible and readable. – glenn jackman Sep 11 '20 at 11:01
  • This will print from the last End to the start of the file if no Start is found (thus: an incomplete section). Awk is able to correct that. –  Sep 11 '20 at 17:29
  • 1
    Wait... what if there's an extra occurrence of the end pattern after the last pair of start-then-end patterns? I think this will get it wrong. – einpoklum Sep 13 '20 at 07:49
  • 1
    Indeed this would fetch startpattern \ntext\n endpattern \ntext\n endpattern from the end. To solve this, you can add at the end | awk '{print} /endpattern_here/{exit}' – thanasisp Sep 28 '20 at 14:54
6

With Ex (a POSIX editor) that is quite simple:

printf '%s\n' 1 '?END?' '?START?,.p' | ex -s file
  • 1 goes to the first line of the file. This is necessary in case END is the last line of the file.

  • ?END? seeks backward (wrapping around the end-of-file) for the END, thus finding its last occurrence in the file.

  • ?START?,.p prints all from the previous START up to the current address.

Below an example with here-docs instead of printf, just for diversity.

$ cat file
zdk
START
b12
END
kdn
START
000
111
END
START
ddd
$ ex -s file <<EOF
> 1
> ?END?
> ?START?,.p
> EOF
START
000
111
END
Quasímodo
  • 18,865
  • 4
  • 36
  • 73
  • Ahh... so it works with ex - that's good to know – steeldriver Sep 10 '20 at 18:21
  • 1
    @steeldriver Yeah, but the 1 was missing in the previous version. However, Ed prints the line with 1, that is why it does not work with Ed. – Quasímodo Sep 10 '20 at 18:24
  • 1
    It is also possible to do it in ed: printf '%s\n' '1;#' '?END?;#' '?START?;/END/p' | ed -s file. A ;# sets the current line. A ?END? finds the last END in the file and then the START - END range gets printed by the next command. The first 1;# may be omitted if it is the first command of the ed script. Otherwise, it should be used for reliability. @steeldriver –  Sep 10 '20 at 19:19
  • With ed we can do as: ?START?;/END/p as the last line is the default current line when a file is first read in. – Rakesh Sharma Sep 11 '20 at 03:43
  • Why do I need the printf? Wouldn't echo -e "1 ?START?,?END?p\n" work? – einpoklum Sep 11 '20 at 08:00
  • @einpoklum You could use echo, but it fails in some edge cases and is not much portable. In my system, that would be echo -e '1\n?START?,?END?p' – Quasímodo Sep 11 '20 at 10:12
  • Nice @Quasímodo , I wouldve come up with something similar. Funny how often ed/ex work so well :) – D. Ben Knoble Sep 11 '20 at 16:12
  • @einpoklum Using ?start?,?end?p could lead to errors. Compare: printf '%s\n' '?^10?;?^14?p' | ed -s <(seq 15) which prints the intended lines with: printf '%s\n' '?^10?;?^1?p' | ed -s <(seq 15) which will fail (as the previous line that start with a 1 is before the start line and ranges where the start is after the end are not allowed). In short: An ending ?end? doesn't mean next match. –  Sep 11 '20 at 18:00
4

It seems I can just use tac:

tac | sed -n '/endpattern_here/,/startpattern_here/ {p; /startpattern_here/q;}' | tac

Thanks goes to @glenn jackman and @Quasimodo for helping me get my sed invocation right.

einpoklum
  • 9,515
2

one way would be to simply store each set, override it with the next one, and print whichever set you have kept once you get to the end:

awk '{ 
        if(/startpattern_here/){
            a=1; 
            lines=$0; 
            next
        } 
        if(a){
            lines=lines"\n"$0
        } 
        if(/end_pattern/){
            a=0
        }
    } 
    END{
        print lines
    }' file

For example, using this test file:

startpattern_here
line 1
line 2
line 3
end_pattern
startpattern_here
line 1b
line 2b
line 3b
end_pattern
startpattern_here
line 1c
line 2c
line 3c
end_pattern

I get:

$ awk '{ if(/startpattern_here/){a=1; lines=$0; next} if(a){lines=lines"\n"$0} if(/end_pattern/){a=0}} END{print lines}' file
startpattern_here
line 1c
line 2c
line 3c
end_pattern
terdon
  • 242,166
  • Isn't this about the same as @AdminBee's solution? – einpoklum Sep 11 '20 at 07:57
  • 1
    @einpoklum pretty much, yes. I posted mine about 5 minutes before. We must have been writing it at the same time. Great minds... :P – terdon Sep 11 '20 at 08:07
  • 1
    @einpoklum Yeah, I saw the notification about an existing answer when I was about to post mine, and to my surprise it turned out they are very similar in their approach (then again, terdon and I had similar correlations already when commenting on posts ...) – AdminBee Sep 11 '20 at 08:22
  • 1
    This will print the last incomplete section (that have an start but not and end) if it exists. –  Sep 11 '20 at 17:14
2
  • You can grep out the last range using PCRE flavor of grep in slurp mode.

    grep -zoP '(?ms).*\K^start.*?\nend[^\n]*' file | tr '\0' '\n'
    
  • We use the range operator in awk to store and re store once we start a new range. Assuming there is no dangling start pattern line in the vicinity of eof.

    awk '
      /^start/,/^end/ {
        t = (/^start/ ? "" : t ORS) $0
      }
      END { print t }
    ' file
    
  • Here we use the tac file to reverse it and then the m?? operator in Perl which matches just once.

    < file tac \
    | perl -lne 'print if m?end? .. m?start?' \
    | tac;
    
  • Other alternatives

    < file sed -ne '/start/=;/end/='  \
    | sed -ne 'N;s/\n/,/;$s/$/p/p' \
    | sed -nf - file
    
    < file \
    tac | sed -e '/start/q' |
    tac | sed -e '/end/q'
    
    sed -e '
      /start/,/end/H
      /start/h;g;$q;d
    ' file
    
AdminBee
  • 22,803
  • The use of -P (PCRE) with -z is said to be experimental in GNU grep manual. I don't believe that this is a good idea. –  Sep 11 '20 at 18:35
2

Most answers here either

  1. fail to handle the case where either the start or the end pattern does not exist, or where a line matches both the start and the end pattern.
  2. store whole ranges of lines in the memory (unscalable).
  3. use some editor like ed or ex which first loads the whole file in the memory.

For the case where the input file is a regular/seekable file (not pipe input), a dumb-simple solution which just gets the last offsets where the start and end patterns matched, and then seeks+reads from there to may be a better idea.

LC_ALL=C awk -v SP=start_pattern -v EP=end_pattern '
   {o+=length+1}
   $0~SP, q=($0~EP) { if(!p) p=o-length; if(q){ l=o+1-(s=p); p=0 } }
   END { if(s && l) system("tail -c +"s" "FILENAME" | head -c "l) }
' file

For the case where the input is from a pipe, you can use a simple pattern range and juggle two temporary files, using close(filename) to rewind them:

... | awk -v SP=start_pattern -v EP=end_pattern -v tmp="$(mktemp)" -v out="$(mktemp)" '
  $0~SP, q=($0~EP){
     print > tmp; if(q){ close(tmp); t=tmp; tmp=out; out=t; }
  }
  END { if(t) system("cat "out); system("rm -f " out " "tmp) }
'

Since any solution will have to parse the whole file before printing anyway (otherwise there's no way to know that it had printed the last range), it makes more sense not to print anything for a file where only the start pattern was found. This is obviously a discutable change from the behaviour of the range operator in sed, awk or perl.

Examples:

seq 1 107 > file
LC_ALL=C awk -v SP=9 -v EP=1 '
   {o+=length+1}
   $0~SP, q=($0~EP) { if(!p) p=o-length; if(q){ l=o+1-(s=p); p=0 } }
   END { if(s && l) system("tail -c +"s" "FILENAME" | head -c "l) }
' file
92
...
100

seq 1 107 | awk -v SP=9 -v EP=1 -v tmp="$(mktemp)" -v out="$(mktemp)" ' $0~SP, q=($0~EP){ print > tmp; if(q){ close(tmp); t=tmp; tmp=out; out=t; } } END { if(t) system("cat "out); system("rm -f " out " "tmp) } ' 92 ... 100

  • Very efficient solution. The combination head-tail is very fast. As you say, for a complete solution, we have to search all the file, and an interesting attempt would be, for this part, one grep instead of awk, and a few lines of code to parse its output and decide the start-end lines, if any. – thanasisp Sep 28 '20 at 20:45
1
$ seq 20 > file
$ awk '/5/{rec=""; f=1} f{rec=rec $0 ORS; if (/8/) f=0} END{if (!f) printf "%s", rec}' file
15
16
17
18
Ed Morton
  • 31,617
  • The OP didn't specify if that can happen, but if there is an "incomplete" pattern (say, seq 16 > file), your approach would output nothing (as f=1 would still be set by the start pattern rule) ... – AdminBee Sep 11 '20 at 07:43
  • 2
    Right, I asked the OP to specify their requirements for cases like that but they said it's "not necessary" so idk if that can happen or not and if it can what they want to do. – Ed Morton Sep 11 '20 at 15:33
1
 perl -ne '$x = (/startpattern/../endpattern/ ? $x . $_ : ""); $y=$x if $x and /endpattern/; END { print $y }'

Or, more readably (i.e. not on one line):

#!/usr/bin/perl -n

save a set; could be incomplete

$x = /startpattern/../endpattern/ ? $x . $_ : "" ;

save last complete set seen

if ($x and /endpattern/) { $y = $x; }

print last complete set seen, ignoring any incomplete sets that may have come after

END { print $y; }

Which you run as perl ./script < inputfile

1

Some possible solutions:

: sed -z 's/.*\(StartPattern.*EndPattern[^\n]*\n\).*/\1\n/' file
: printf '%s\n' '1;kx' '?^End?;kx' "?^Start?;'xp" | ed -s file
: printf '%s\n' '1' '?^End?' "?^Start?,.p" | ex file
: awk '/^Start/{s=1;section=""}
s{section=section $0 ORS}
/^End/{complete=section;s=0}
END{printf ("%s",complete)}' file
: tac file | sed -n '/^End/,/^Start/{p;/^Start/q}' | tac


regex sed

You can match the last occurrence of a pattern between start and end with a regex like:

.*START.*END.*

Then, you can extract the range including the delimiters with a parentheses.

.*\(START.*END\).*

That will work in sed (as it may use the replace s///) but require GNU sed to make the whole file one string (using the -z option):

sed -z 's/.*\(StartPattern.*EndPattern[^\n]*\n\).*/\1\n/' file    

ed

It is possible to search backwards in ed with ?regex?. So, we can search backwards for EndPattern (to ensure the pattern is complete and we are at the last one) and then search also backward to the previous StartPattern.

printf '%s\n' '?^End?;kx' '?^Start?;kx' '.;/End/p' | ed -s file

The ;kx is used to avoid that ed prints the selected line.

That would fail if the last line is End, to avoid that, start on the first line and search backward for End.

And, since the limits are being marked, we can use a simpler range:

printf '%s\n' '1;ky' '?^End?;ky' '?^Start?;kx' "'x;'yp" | ed -s file

Or,

printf '%s\n' '1;kx' '?^End?;kx' "?^Start?;'xp" | ed -s file

That is assuming that at least one complete section of Start -- End exists. If there is none, the script will fail.

I have seen several uses of ?Start?,?End?. That may fail in several ways because it doesn't mean "find the next End after what was found by Start. Compare:

$ printf '%s\n' 1 '?START?,?END?p' | ex -s <(printf '%s\n' 111 START 222 END 333 END 444)

START 222 END 333 END

$ printf '%s\n' 1 '?START?,/END/p' | ex -s <(printf '%s\n' 111 START 222 END 333 END 444)

START 222 END

ex

The command from ed could be simplified to work in ex:

printf '%s\n' '1' '?^End?' '?^Start?,.p' | ex file

awk

We can store each complete section Start to End in one variable and print it at the end.

awk ' /^Start/{s=1;section=""} # If there is an start, mark a section. s{section=section $0 ORS} # if inside a section, capture all lines. /^End/{complete=section;s=0} # If a section ends, unmark it but store. END{printf ("%s",complete)}' file # Print a complete section (if one existed).


# tac
We can reverse the whole file (line by line) and then print only the **first** section that starts at `End` and ends at `Start`. Then reverse again:

tac file | sed -n '/^End/,/^Start/{p;/^Start/q}' | tac

The /^Start/q exists sed to ensure that only the first section is printed.

Note that this will print everything from the last End to the start of the file if there is no Start to be found (instead of just not printing).

test file

Tested with (at least) this file (and others):

$ cat file3 Don't print 1 Don't print 2 Don't print 3 StartPattern_here-1 Inside Pattern but Don't print 1-1 Inside Pattern but Don't print 1-2 Inside Pattern but Don't print 1-3 EndPattern_here-1

Lines between 1 and 2 - 1 Lines between 1 and 2 - 2 Lines between 1 and 2 - 3

StartPattern_here-2 Inside Pattern but Don't print 2-1 Inside Pattern but Don't print 2-2 Inside Pattern but Don't print 2-3 EndPattern_here-2

Lines between 2 and 3 - 1 Lines between 2 and 3 - 2 Lines between 2 and 3 - 3

StartPattern_here-3 Inside Pattern, Please Do print 3-1 Inside Pattern, Please Do print 3-2 Inside Pattern, Please Do print 3-3 EndPattern_here-3

Lines between 3 and 4 - 1 Lines between 3 and 4 - 2 Lines between 3 and 4 - 3

StartPattern_here-4 This section has an start but not an end, thus, incomplete. Lines between 4 and $ - 1 Lines between 4 and $ - 2 Lines between 4 and $ - 3

1

Here is a solution trying to handle all cases, including no printing for no block found, and be efficient in memory and execution time. There is no writing line by line in this solution, no processing of every line and no lines buffering.

#!/bin/bash

sp="startpattern_here" ep="endpattern_here" f="file"

range=$(tac "$f" | grep -n "$sp|$ep" | awk -F: -v sp="$sp" -v ep="$ep"
'$2 ~ sp && prev ~ ep {s=$1; print s,e; exit} {prev=$2; e=$1}')

if [[ "$range" ]]; then # echo "Counting from the end => start: ${range% } end: ${range# }" tail -n "${range% }" "$f" | head -n "${range# }" else echo "No blocks found" 1>&2 fi

Explanation and example:

> cat file
startpattern_here
text
endpattern_here
startpattern_here
text
startpattern_here
42
endpattern_here
text
endpattern_here

In the worst case scenario, we have to search the whole file for a complete answer, so we use the fast grep for that. We start searching from the end, so it will get something like this:

1:endpattern_here
3:endpattern_here
5:startpattern_here
7:startpattern_here
8:endpattern_here
10:startpattern_here

which is piped to awk to decide if there is a valid last block or not. Note that here awk is being used for simple programming, not for the actual text processing. For a large input, grep is faster than searching the file with awk or even more, writing line by line with awk or sed.

Also, in case a block between patterns is detected quickly close at the end, awk is exiting and closing its pipe, so the previous sequence is also exiting, without searching the whole file.

This way, we get the range, counting from the end, and finally tail and head seek() to those line numbers and "cat" the content. In case of empty range, there is no standard output.

startpattern_here
42
endpattern_here
thanasisp
  • 8,122
0

Fast and simple sed-only solution. Most other solutions are either wasting resources by double-tac-ing, or even worse, loading whole input into memory at once, or doing multiple pass processing in some way.

This processes text line-by-line, so we only requires memory for one copy of matched block, and we don't fork and exec other stuff which would do even more extra processing. As a bonus, it is quite readable and understandable (well, as far as any sed script can be).

Instead of your: sed -n '/startpattern_here/,/endpattern_here/p' you do this:

sed -n '/startpattern_here/,/endpattern_here/H; /startpattern_here/h; ${g;p}'

Explanation (note: anything after ; is independent of previous commands, unless grouped with { and }):

  • first part /startpattern_here/,/endpattern_here/H is mostly similar to one in your question, but instead of outright printing to stdout everything found between start and end patterns, it instead appends that text to "hold space" (H).

  • /startpattern_here/h notices when NEW match starts, and erases previous hold space by overwriting it (h) with current pattern space. Note that next line in file will of course start executing all of our commands from scratch, which will keep appending to hold space (see above point) - result being that we will always keep in hold space only last matched block.

  • ${g;p} - $ address matches only on last line in file, so anything between { and } is executed only when we're finished with processing file. Here we simply print content of hold space (by g - copying hold space to pattern space, and p - printing pattern space)

for example, to get last Debian package basic info:

% sed -n '/^Package/,/^Section/H; /^Package/h; ${g;p}' /var/lib/dpkg/status

Package: zsh-common Status: install ok installed Priority: optional Section: shells

Matija Nalis
  • 3,111
  • 1
  • 14
  • 27
  • Wouldn't this script fail if the start pattern were found after the last occurrence of the end pattern? – einpoklum Sep 12 '20 at 22:49
  • 1
    @einpoklum It would behave in the same way as your original sed script would (with the obvious exception from requirement of only printing last result instead of all of them). Note that even your original sed /start/,/end/p will print lines from start pattern until end pattern or until end of file occurs, whichever happens first. See EdMorton comments to your question if that is not what you actually want, and edit question to better define problem. – Matija Nalis Sep 13 '20 at 00:19