-1

I have similar instructions like in the thread Print Matching line and nth line from the matched line

I would need to match the specific line, print it, then remove the following line (1st following line) and then print the rest until to match the specific line etc.

In other words, I need to remove only lines with </s> which follows the line starting with <doc.

My file:

<doc>
</s>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
</s>
...

My required output:

<doc>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
...
Philippos
  • 13,453
Rodrigo
  • 39
  • 4
    It's not a code-writing service. You have to show us that you tried do to something – mrc02_kr Sep 07 '17 at 07:56
  • Of course the comment of @mrc02_kr is right. But +1 for giving a precise definition of the question, even with input and output. That's unusual. – Volker Siegel Sep 09 '18 at 11:45

3 Answers3

2

This is not too hard to figure out with basic sed knowledge:

sed '/<doc>/{n;/<\/s>/d;}'

For lines with <doc>, print it and read the next line with n and then, if this folloing line contains </s> (slash needs to be escaped), delete it with d.

More verbose explanation: /expression/{command;command;...;} means to execute the commands only on lines that match the pattern, so all other lines simply get printed as they are, while for the <doc> line, n is executed. This command prints the current line and reads the next one, so the following commands are executed on the next line. Here comes another command (d) with an "address" (/<\/s>/), thus the line is deleted only if it contains </s>, otherwise is printed. In either case the script will continue with the following line.

Philippos
  • 13,453
  • thanks, this works perfectly! I did not know about using "{n". – Rodrigo Sep 11 '17 at 06:46
  • Actually, {n is not a command. I added are more detailed explanation for you. If the script solves your question, please consider marking it as answered for future readers. – Philippos Sep 11 '17 at 07:13
  • thanks, for the detailed explanation, now I completely understand. – Rodrigo Sep 11 '17 at 08:33
1

With GNU sed:

sed -z -i 's:<doc>\n</s>:<doc>:g' infile.txt

This is replacing <doc> followed by </s> with only <doc>. The sed's -i flag is used for in place replace; and the g flag is to replace all occurences. -z cause to separate lines with NULL characters.

αғsнιη
  • 41,407
  • \n in the pattern is a POSIX requirement supported by all sed implementations I ever met. But as sed works linewise this can't ever match unless you join lines first or use -z option with GNU sed – Philippos Sep 07 '17 at 08:47
  • making -z default would break 99% of all scripts! 2) Without -z your script doesn't work with GNU sed 4.4 (like it doesn't with any other version). 3) I suppose you are talking about \n in the replacement string. In the matching pattern I can't think of any version not supporting it.
  • – Philippos Sep 07 '17 at 10:23
  • The second version will still not work on any sed version. – Philippos Sep 08 '17 at 05:44
  • @Philippos Correct, I removed that part from my answer, but may I ask you why sed doesn't recognize newline there in find part and again I should use -z with that?. thank you – αғsнιη Sep 08 '17 at 09:09
  • 2
    Because the default behaviour of sed is to read one line of input to the buffer ("pattern space"), apply the script on it and proceed with the next line. So there will never be a newline in the pattern space unless you use a command like N, which appends the next line to the pattern space, letting you have two lines with a newline in between. You can run N;P;D patterns to always have two consecutive lines in the buffer, but that's beyond what can be explained in a comment. – Philippos Sep 08 '17 at 09:17