match specific line and print all lines except the following line

Question

I have similar instructions like in the thread Print Matching line and nth line from the matched line

I would need to match the specific line, print it, then remove the following line (1st following line) and then print the rest until to match the specific line etc.

In other words, I need to remove only lines with </s> which follows the line starting with <doc.

My file:

<doc>
</s>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
</s>
...

My required output:

<doc>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
...

It's not a code-writing service. You have to show us that you tried do to something — mrc02_kr, Sep 07 '17 at 07:56
Of course the comment of @mrc02_kr is right. But +1 for giving a precise definition of the question, even with input and output. That's unusual. — Volker Siegel, Sep 09 '18 at 11:45

Philippos · Accepted Answer · 2017-09-11T07:11:39.900

2

This is not too hard to figure out with basic sed knowledge:

sed '/<doc>/{n;/<\/s>/d;}'

For lines with <doc>, print it and read the next line with n and then, if this folloing line contains </s> (slash needs to be escaped), delete it with d.

More verbose explanation: /expression/{command;command;...;} means to execute the commands only on lines that match the pattern, so all other lines simply get printed as they are, while for the <doc> line, n is executed. This command prints the current line and reads the next one, so the following commands are executed on the next line. Here comes another command (d) with an "address" (/<\/s>/), thus the line is deleted only if it contains </s>, otherwise is printed. In either case the script will continue with the following line.

edited Sep 11 '17 at 07:11

answered Sep 07 '17 at 08:54

Philippos

13,453

thanks, this works perfectly! I did not know about using "{n". – Rodrigo Sep 11 '17 at 06:46
Actually, {n is not a command. I added are more detailed explanation for you. If the script solves your question, please consider marking it as answered for future readers. – Philippos Sep 11 '17 at 07:13
thanks, for the detailed explanation, now I completely understand. – Rodrigo Sep 11 '17 at 08:33

αғsнιη · Answer 2 · 2017-09-08T09:07:45.950

1

With GNU sed:

sed -z -i 's:<doc>\n</s>:<doc>:g' infile.txt

This is replacing <doc> followed by </s> with only <doc>. The sed's -i flag is used for in place replace; and the g flag is to replace all occurences. -z cause to separate lines with NULL characters.

edited Sep 08 '17 at 09:07

answered Sep 07 '17 at 07:59

αғsнιη

41,407

\n in the pattern is a POSIX requirement supported by all sed implementations I ever met. But as sed works linewise this can't ever match unless you join lines first or use -z option with GNU sed – Philippos Sep 07 '17 at 08:47
making -z default would break 99% of all scripts! 2) Without -z your script doesn't work with GNU sed 4.4 (like it doesn't with any other version). 3) I suppose you are talking about \n in the replacement string. In the matching pattern I can't think of any version not supporting it.

Philippos

Sep 07 '17 at 10:23

The second version will still not work on any sed version. – Philippos Sep 08 '17 at 05:44

@Philippos Correct, I removed that part from my answer, but may I ask you why sed doesn't recognize newline there in find part and again I should use -z with that?. thank you – αғsнιη Sep 08 '17 at 09:09

2

Because the default behaviour of sed is to read one line of input to the buffer ("pattern space"), apply the script on it and proceed with the next line. So there will never be a newline in the pattern space unless you use a command like N, which appends the next line to the pattern space, letting you have two lines with a newline in between. You can run N;P;D patterns to always have two consecutive lines in the buffer, but that's beyond what can be explained in a comment. – Philippos Sep 08 '17 at 09:17

RomanPerekhrest · Answer 3 · 2017-09-07T09:14:16.060

0

As you marked shell_script I would suggest awk approach:

awk '/^<doc>/ && getline nl > 0 && nl!~/^<\/s>/{ print $0 RS nl }1' file

The output:

<doc>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
...

edited Sep 07 '17 at 09:14

answered Sep 07 '17 at 08:02

RomanPerekhrest

30,212

This is not the real output of your script. Your script also removes the last line like all other lines following <doc>, although the question says I need to remove only lines with which follows the line starting with – Philippos Sep 07 '17 at 08:58
@Philippos, see my update – RomanPerekhrest Sep 07 '17 at 09:14
I suppose this works (I can't verify this as some shell magic gives me !~/^: event not found error for your script), and it confirms me to prefer sed over awk for most tasks. (-; – Philippos Sep 07 '17 at 10:01
@Philippos, strange, what's your shell that giving error? – RomanPerekhrest Sep 07 '17 at 10:03
bash 4.3.30 as well as 4.4.12, almost no custom settings. But I've seen it messing with ! even inside single quotes before. It's rare, but nasty. Not related to your script though. – Philippos Sep 07 '17 at 10:15
Oh! I just realize this only happens if I echo the text and pipe it to your awk script. No problems when working on a file. But your sciprt gives me a trailing <doc> line for some reason (-: – Philippos Sep 07 '17 at 10:32
@Philippos, you may look https://ibb.co/bsoLYa – RomanPerekhrest Sep 07 '17 at 10:35
Yes, see my last comment. It happens only with piped input. – Philippos Sep 07 '17 at 10:37
I did analyze the problem. I suspect it's a bug, see my question here – Philippos Sep 08 '17 at 05:45
@Philippos, yes, you have posted an interesting question. I wish it was successfully answered – RomanPerekhrest Sep 08 '17 at 07:43

match specific line and print all lines except the following line

3 Answers3