Remove three lines from file until a match

Question

982
01:25:09,473 --> 01:25:10,978
Stay with me.

983
01:25:09,473 --> 01:25:10,978
Stay with me.

984
01:25:15,390 --> 01:25:18,484
( MAJESTIC MUSIC )

I want to delete three lines below 984 (inclusive). I tried this, but it doesn't work:

perl -0777 -pe 's/.*\n.*\n\(.*\)//'

Do you want to delete the three last lines in the file or the there specific lines starting with 984 regardless of whether they are last or not? — Kusalananda, Mar 05 '23 at 16:55
Advice to newcomers: If an answer solves your problem, please accept it by clicking the large check mark (✓) next to it and optionally also up-vote it (up-voting requires at least 15 reputation points). If you found other answers helpful, please up-vote them. Accepting and up-voting helps future readers. — Gilles Quénot, Mar 06 '23 at 10:14
I see you are working on text records. Generalizing your question: 1. Do you want to seamlessly eliminate record 984, or other such records of your choice? Or do you want to retain the top line of each record (e.g. 984) as a "title", and then delete the "body" of the record only? Such a method would leave all record numbers sequentially intact (982, 983, 984...). — jubilatious1, Mar 21 '23 at 19:16
FYI, perl has a Subtitles modules for working with subtitles. From the debian package (libsubtitles-perl) description: This module provides means for simple loading, re-timing, and storing these subtitle files. The module supports srt, sub, smi subtitle formats. A command-line tool 'subs' for manipulation of subtitles files is included in this package. BTW, see also https://metacpan.org/pod/App::SubtitleUtils — cas, Mar 30 '23 at 11:14

score 3 · Answer 1 · edited Mar 07 '23 at 15:01

3

Using a sed that understands relative addresses (non-standard, but generally supported):

$ sed '/^984$/,+2d' input_file
982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.

Or, with any sed:

sed '/^984$/{$!N;$!N;d;}' input_file

That is, on the match, append the Next two following lines (if they exist) and delete them altogether.

edited Mar 07 '23 at 15:01

Philippos

13,453

answered Mar 05 '23 at 17:16

sseLtaH

2,786

Gilles Quénot · Answer 2 · 2023-03-05T19:38:44.880

1

Like this:

In paragraph mode:

$ perl -00 -ne 'print unless /^984\b/' file

line by line:

$ perl -anE 'if ($F[0] == 984) { last } else { print }' file

slurping the whole file using regex ¹:

$ perl -0777 -ne 'print $& if /.*(?=\n^984)/ms' file
$ perl -gne 'print $& if /.*(?=\n^984)/ms' file # perl >= 5.36

Output

983
01:25:09,473 --> 01:25:10,978
Stay with me.

¹ The regular expression matches as follows:

Regex	description
`.*`	Match 0 or more of any character
`(?=`	Positive lookahead assertion
`\n`	Match a newline character
`^`	Match the start of the string
`984)`	Match the 4 characters `984` and close look-ahead

edited Mar 05 '23 at 19:38

answered Mar 05 '23 at 14:57

Gilles Quénot

33,867

1

Perl "paragraph mode" example will delete an internal record starting with 984, but Perl "line-by-line" example deletes from the line containing 984 to the end of the file. – jubilatious1 Mar 22 '23 at 17:53

Ed Morton · Answer 3 · 2023-03-05T21:08:23.547

Using any awk in paragraph mode (activated by RS=<null> and used when input records are separated by blank lines):

$ awk -v RS= -v ORS='\n\n' '$1 != 984' file
982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.

If you couldn't use paragraph mode for some reason (e.g. maybe those apparently empty lines between records actually contain non-printable chars we can't see) and you truly did just want to delete 4 lines starting from 984 then you could do this instead (but it's less robust, see below):

$ awk '$1 == "984"{c=4} !(c&&c--)' file
982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.

See printing-with-sed-or-awk-a-line-following-a-matching-pattern for related awk idioms.

Note that the first script is the most robust as it will ONLY match on exactly 984 on the first line after an empty line. You should include cases like:

950
01:25:09,473 --> 01:25:10,984
this is bad
951
01:25:09,473 --> 01:25:10,978
984 here is also bad
9841
01:25:09,473 --> 01:25:10,978
this is also bad

in your sample input/output to flush out scripts that would falsely match on the 3rd line of the record instead of just on the first or would do a partial instead of full match on the number you're targeting.

score 0 · Answer 4 · answered Mar 05 '23 at 15:53

0

With awk you can use line like this:

awk '$0==984 {getline;getline;getline} {}1'  input_file >output_file

The idea is command getline read the next line from input stream. If we talk about subtitles you should skip this and next 3:

awk '$0==984 {getline;getline;getline;getline} {}1'  input_file >output_file

answered Mar 05 '23 at 15:53

Romeo Ninov

17,484

1

That chain of getlines would leave $0 set to the last successful call to getline so you'd get the 3rd or 4th line after 984 printed instead of deleted. YMMV with not testing the results of getline and obviously hard-coding as many calls to getline as lines you want to delete isn't ideal. – Ed Morton Mar 05 '23 at 16:19
@EdMorton, if this is really subtitles the next line will be empty and should be deleted (it's not good to have more than one empty lines, but it's not catastrophe). You are right about hardcoding, but this is fast and dirty (and sample) way to do the work :) – Romeo Ninov Mar 05 '23 at 16:23
I'm not sure what you mean about "subtitles", sorry. What I'm saying is that if you execute $0==984 {getline;getline;getline} {}1 then at 1, given the OPs sample input, $0 will be populated with ( MAJESTIC MUSIC ) and so that line will be printed. You can solve that and the extra blank line problem by adding a ;next after the last getline in each block. You could just do for (i=1;i<=4;i++) getline to avoid hard-coding 4 calls to getline and it's no less fast, just a bit less dirty :-) – Ed Morton Mar 05 '23 at 16:46
2

@EdMorton, I see about the code. To clarify it work for me, yes, for will be possible improvement. About subtitles: think about movie/film and text on the bottom part of the screen which write what actors say (or any music/noise as text). – Romeo Ninov Mar 05 '23 at 17:05
2

Oh, yeah, I can see that now. I thought it was something like a track of songs with titles. – Ed Morton Mar 05 '23 at 17:12

jubilatious1 · Answer 5 · 2023-03-22T01:40:57.807

Using Raku (formerly known as Perl_6)

~$ raku -ne '.put unless /^ 984 $/ fff *.chars == 0 ;'  file
#OR
~$ raku -ne '.put unless /^ 984 $/ fff {.chars == 0} ;'  file

The above code uses Raku's fff "flip-flop" operator, which detects a record starting with 984 and ending on a blank line (.chars equal zero). Note the above code makes no attempt to detect a blank line before 984.

Sample Input:

982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.
984
01:25:15,390 --> 01:25:18,484
( MAJESTIC MUSIC )
985
01:25:18,485 --> 01:25:18,500
( END CREDITS )

Sample Output (1):

982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.
985
01:25:18,485 --> 01:25:18,500
( END CREDITS )

Raku provides a number of fff variants to leave either-or-both of the two recognition sequences in the return. They are ^fff or fff^ or ^fff^. This reduces the need to use lookaheads/lookbehinds. For example, simply change fff in the above code to ^fff^ to get the following return:

Sample Output (2):

982
01:25:09,473 --> 01:25:10,978
Stay with me.
983
01:25:09,473 --> 01:25:10,978
Stay with me.
984
985
01:25:18,485 --> 01:25:18,500
( END CREDITS )

If you want/need to separate records first, slurp the file in all at once and simply split on \n\n consecutive newlines. Then the remainder of the code simplifies to the following, but unfortunately adds two blank lines to the very end of the file:

~$ raku -e 'for slurp.split("\n\n") { put $_ ~ "\n"  unless /^984 / };' file

Not to worry, see the first link below to use Raku for removing blank lines at the beginning/end of a file.

https://unix.stackexchange.com/a/725227/227738
https://docs.raku.org
https://raku.org