42

How can I print all the lines between two lines starting with one pattern for the first line and ending with another pattern for the last line?


Update
I guess it was a mistake to mention that this document is HTML. I seem to have touched a nerve, so forget that. I'm not trying to parse HTML or do anything with it other than print a section of a text document.


Consider this example:

aaa
bbb
pattern1
aaa pattern2
bbb
ccc
pattern2
ddd
eee
pattern1
fff
ggg

Now, I want to print everything between the first instance of pattern1 starting at the beginning of a line and pattern2 starting at the beginning of another line. I want to include the pattern1 and pattern2 lines in my output, but I don't want anything after the pattern2 line.

pattern2 is found in one of the lines of the section. I don't want to stop there, but that's easily remedied by indicating the start of the line with ^.

pattern1 appears on another line after pattern2, but I don't want to look at that at all. I'm just looking for everything between the first instance of pattern1 and the first instance of pattern2, inclusive.

I found something that almost gets me there using sed:

sed -n '/^pattern1/,/^pattern2/p' inputfile.txt

... but that starts printing again at the next instance of pattern1

I can think of a method using grep -n ... | cut -f1 -d: twice to get the two line numbers then tail and head to get the section I want, but I'm hoping for a cleaner way. Maybe awk is a better tool for this task?

When I get this working, I hope to tie this into a git hook. I don't know how to do that yet, either, but I'm still reading and searching :)

Thank you.

Vince
  • 593
  • 1
    use a HTML parser.. – heemayl Feb 22 '16 at 07:42
  • @heemayl Is there a particular HTML parser you recommend? How do I do this from the command line? Is there something available on Unix & Linux that you can recommend? Even if there is, I'd probably prefer to do something with the tools already available on the Linux platform. – Vince Feb 22 '16 at 07:54
  • Check why http://stackoverflow.com/q/1732348/3789550 – heemayl Feb 22 '16 at 07:55
  • 1
    Though for general HTML an HTML parser is the way to go, if you have a particular HTML document in which the format obeys certain rules, you can avoid HTML parsers and use regular expressions. I think in this case the question is whether you need to do something about the minified parts or whether it's OK to print them as-is. Also, if you are certain that the patterns you want don't match anywhere else in the text. – RealSkeptic Feb 22 '16 at 07:57
  • 2
    You guys saw "HTML" and went crazy. This doesn't have anything to do with parsing HTML, so I updated my question. – Vince Feb 22 '16 at 08:57
  • @heemayl The prize-winning answer to the question you linked is effectively an emotional rant that IMHO doesn't belong on SO. I guess the moderators disagree because of it's humor and creative abuse of SO's posting system. – Vince Feb 22 '16 at 09:01
  • 1
    @Vince You got one comment from one user suggesting you use a parser. That hardly seems like "going crazy". While I agree that your particular case can indeed be dealt with using simple regexes, the general rule about not parsing html with regular expressions (again, not in your case which is indeed very simple) is a valid one. While you can do it, it is very hard and rarely worth it, See here for how to do it correctly. I'm sure you'll agree that is way more complex than using a parser. – terdon Feb 22 '16 at 09:31
  • @terdon The linked question in the third comment is pretty crazy as the accepted answer is more of a rant and is posted in such a way that it can't be read and understood. I also thought it was pretty crazy that my post wasn't about parsing HTML at all, but the commenters only addressed the subject of parsing HTML. – Vince Feb 22 '16 at 11:24
  • Fair enough, I was just pointing out that you only got one comment about that, the other was pointing out that in your case, regexes are fine. The rant was the culmination of years of questions about attempting to parse HTML with regular expressions which are simply not up to the job unless you're someone like Tom Christiansen. It's a bit of a cult classic around here :) – terdon Feb 22 '16 at 11:32
  • This answer might also be applicable: https://stackoverflow.com/a/48022994/2026975 – imriss Dec 29 '17 at 17:02

2 Answers2

59

You can make sed quit at a pattern with sed '/pattern/q', so you just need your matches and then quit at the second pattern match:

sed -n '/^pattern1/,/^pattern2/{p;/^pattern2/q}'

That way only the first block will be shown. The use of a subcommand ensures that ^pattern2 can cause sed to quit only after a match for ^pattern1. The two ^pattern2 matches can be combined:

sed -n '/^pattern1/,${p;/^pattern2/q}'
Stephen Kitt
  • 434,908
FelixJN
  • 13,566
  • Excellent! That's exactly what I was looking for. And, wouldn't you know, it works just as well on an HTML file despite the knee-jerk reaction of some commenters to the original version of my question :) Thank you. – Vince Feb 22 '16 at 09:15
  • @Vince Well, the infuriated discussion makes me believe there is good reason for mentioning the dangers of HTML and regex. I cannot comment on this as I never work on HTML. – FelixJN Feb 22 '16 at 10:22
  • 8
    The second pattern should be in a subcommand, a la sed -n '/A/,/B/{p;/B/q}'. Otherwise, if /B/ matches before /A/, sed will quit before it gets to print anything. Also, repetition of the second pattern can be avoided like this: sed -n '/A/,${p;/B/q}'. – SU3 Jun 10 '20 at 04:42
  • 3
    Is there a way to make this work if pattern1 and pattern2 are the same? – SystemParadox Jul 09 '21 at 15:13
  • @SystemParadox I suggest you to make this a new question. This will also get better answers. – FelixJN Jul 09 '21 at 19:11
22

As a general approach, with sed, it's easy to print lines from one match to another inclusively:

$ seq 1 100 > test
$ sed -n '/^12$/,/^15$/p' test
12
13
14
15

With awk, you can do the same thing like this:

$ awk '/^12$/{flag=1}/^15$/{print;flag=0}flag' test
12
13
14
15

You can make these non-inclusive like this:

$ awk '/^12$/{flag=1;next}/^15$/{flag=0}flag' test
13
14

$ sed -n '/^12$/,/^15$/p' test | sed '1d;$d'
13
14
Will
  • 2,754
  • That's close, but using your example the problem in my case is that 12 appears on another line after 15 and these all start processing again after it gets to the second 12. – Vince Feb 22 '16 at 09:06