1

Given a 425M sized text file with the following content:

--START--
Data=asdfasdf
Device=B
Lorem=Ipsum
--END--
--START--
Data=asdfasdf
Lorem=Ipsum
Device=A
--END--
--START--
Device=B
Data=asdfasdf
--END--
...

The sed task is to print everything between --START-- and --END--, where Device=A is included. There are two solutions provided here and here. There is huge execution time difference between both commands. The second command is quite faster, but needs more description for me how it works?

$ sed -n '/--START--/{:a;N;/--END--/!ba; /Device=A/p}' file
$ sed 'H;/--START--/h;/--END--/!d;x;/Device=A/!d' file

The description of the first command:

How it works:

/--START--/{...} Every time we reach a line that contains --START--, run the commands inside the braces {...}.

:a; Define a label "a".

N; Read the next line and add it to the pattern space.

/--END--/!ba Unless the pattern space now contains --END--, jump back to label a.

/Device=A/p If we get here, that means that the patterns space starts with --START-- and ends with --END--. If, in addition, the pattern space contains Device=A, then print (p) it.

Description of 2nd command:

sed 'H              #add line to hold space
     /--START--/h   #put START into hold space (substitute holded in)
     /--END--/!d    #clean pattern space (start next line) if not END
     x              #put hold space into pattern space
     /Device=A/!d   #clean pattern space if it have not "Device=A"
    ' file
Hölderlin
  • 1,186
  • Is Device=A always next line after --START--? – Cyrus Jun 06 '23 at 20:38
  • Why does it have to be sed? Have you tried awk or perl? – Cyrus Jun 06 '23 at 20:39
  • @Cyrus "Is Device=A always next line after --START--?" no. "... Have you tried awk or perl?" was no need till now, because the 2nd does job well. – Hölderlin Jun 06 '23 at 20:50
  • I had misunderstood your question. – Cyrus Jun 06 '23 at 21:38
  • An important point to note is that the d delete command stops execution of the rest of the script, and we read a new line and go back to the start of the script. This means there is a tight loop over the first 3 commands until we match the END line. This loop accumulates the data into the hold space. The match on START clears the hold space and puts just the start line in there. – meuh Jun 07 '23 at 06:20

1 Answers1

3

One thing to keep in mind is that regex matching is "expensive"... so the more stuff you have in the pattern buffer the slower the search gets.
In this particular case, sed has to find three patterns (let's number them as 1, 2 and 3): the range START (1), the range END (2) and a MATCH (3) (if any) in that range.

The main difference between the two solutions is the buffer used to store all the lines in a range which in turn determines how the end of range is detected.

The 1st solution works by searching for START (1) on each line and once it finds it, it starts appending lines to the pattern space and has to check for the END (2) of range at every iteration (i.o.w. every time it adds a new line to the data in the pattern space, it searches the whole buffer again for END so as to know when to stop). Once it finds it, it searches the entire pattern space for MATCH (3).

The 2nd solution works differently: it unconditionally accumulates lines in the hold space via H, it does pattern matching on every single line twice: to determine the START (1) and respectively the END (2) of range. This is very fast. Once it detects the END of range it exchanges the buffers (so now the pattern space contains all the lines that were accumulated in the hold space) and searches the entire pattern space for MATCH (3).

As you can see, (3) is identical in both cases: both sed scripts run a single search for MATCH, once the pattern space contains all the lines from START till END. So it's not the search for MATCH that separates the two solutions. The main difference here is caused by (2):
The second solution searches for END on every line - if the line doesn't contain END, it deletes it from the pattern space and restarts the cycle i.e. it pulls in another line and, again, tries to find the END and so on. Until it finds the END, there is never more than one line in the pattern space.
The first solution, by contrast, will execute a;N;/--END--/!ba over and over again on a increasingly bigger text buffer even though the difference from the previous run consists of a single line. This is never a good thing when working with large text files - imagine having START-END ranges spanning over thousands of lines...

In short: searching for the END of range is what slows it down to a crawl.


A good example of how slow the first technique is, compared to the second one, can be found here:

Turn list into single line with delimiter

As you can see in my tests there, the first solution couldn't even finish the test.

don_crissti
  • 82,805
  • I don't understand the difference of both procedure descriptions, with respect to run time or space requirements, when input size grows. For me both sounds similar to "search for --start--, then read lines till --end-- into buffer. If buffer not contains Device=A, then delete block else print. Search next start..." Is it possible to compare the key difference to any algorithm principal? – Hölderlin Jun 09 '23 at 02:24
  • @Hölderlin - fair enough, I hope my edit makes it easier to understand – don_crissti Jun 09 '23 at 16:25