7

I need to split the contents of a buffer into a list of strings. The null character is used to separate the items.

It the items were separated by newline characters, then I could use the same approach as process-lines:

(let (lines)
  (while (not (eobp))
    (setq lines (cons (buffer-substring-no-properties
               (line-beginning-position)
               (line-end-position))
              lines))
    (forward-line 1))
  (nreverse lines))

I assume forward-line is efficient, but the use of line-beginning-position and line-end-position is a bit suspicious. But since the null character is used I cannot do that anyway.

One way of doing it would be:

(split-string (buffer-string) "\0")

I was also considering this variation:

(split-string (buffer-substring-no-properties (point-min)
                                              (point-max))
              "\0")

Is that actually more efficient? The text in the buffer is not propertized, but I would imagine that looking for the non-existent properties would still add an overhead.

Instead of reading the buffer into a string and then splitting the string I would like to instead work on the buffer directly - again assuming that that actually is more efficient.

(let ((beg (point))
      items)
  (while (search-forward "\0" nil t)
    (push (buffer-substring-no-properties beg (1- (point))) items)
    (setq beg (point)))
  (nreverse items))

Does something like search-forward-char exist and would that be more efficient than search-forward?

I suppose I could use:

(while (not (= (char-after) ?\0)) (forward-char))

But I would expect that to be available as a function if it were more efficient than search-forward.

Drew
  • 75,699
  • 9
  • 109
  • 225
tarsius
  • 25,298
  • 4
  • 69
  • 109
  • 1
    `(skip-chars-forward "^\0")` should do the job. – Tobias Mar 06 '18 at 22:31
  • @Tobias Just beat me to it. :) It's almost three times as fast as `(search-forward "\0" nil t)` on my machine. – Basil Mar 06 '18 at 22:32
  • @Basil Nevertheless, the overall program needs to be profiled. Often pure c-functions beat byte compiled stuff. So maybe the `(split-string (buffer-substring-no-properties) "\0")` variant wins. Furthermore, the performance may depend on the structure of the text. (Are there many short tokens terminated by null-characters or are there large tokens with only a few null-characters.) – Tobias Mar 06 '18 at 22:35
  • @Tobias I know, I was going to do some further tests out of curiosity anyway. @tarsius Note that `char-after` can return `nil`. – Basil Mar 06 '18 at 22:37
  • FWIW, `buffer-substring-no-properties` is ever so negligibly faster than `buffer-{sub,}string` in unpropertised buffers because it avoids this branch: http://git.savannah.gnu.org/cgit/emacs.git/tree/src/editfns.c#n2808. If the buffer contains many text properties, e.g. fontification, it is obviously **significantly** faster. – Basil Mar 06 '18 at 22:51
  • Good question.. – Drew Mar 06 '18 at 22:54
  • 1
    Did you make sure that splitting the buffer string at `0` is really the bottleneck of your application before you dug that deep? – Tobias Mar 06 '18 at 23:23
  • FWIW in comments to http://irreal.org/blog/?p=6124 , when comparing `search-forward`+`insert` with using `mapconcat` on `split-string`, the latter was quicker (I think generally, but especially so when the split character was more numerous). – phils Mar 07 '18 at 01:15
  • I note that `skip-chars-forward` has its own byte-compiled opcode, which accounts for its speed over `search-forward`. https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/bytecomp.el#L679 (and credit to http://nullprogram.com/blog/2017/01/30/ ). – phils Mar 07 '18 at 01:26
  • @Tobias Maybe re-write your comment as answer? – Andreas Röhler Mar 07 '18 at 07:51
  • @Basil Please write up an answer. You did much more research on that topic than me (the timing for an instance and you spoke already about some other test). I think you have some hard facts that I don't have. – Tobias Mar 07 '18 at 07:55
  • Why do you need to keep substrings, if you don't erase the buffer? Can you keep positions instead? – wvxvw Mar 07 '18 at 12:44
  • No I did not run any benchmarks and don't actually expect this to make much of a difference but still feel like doing that right thing anyway. – tarsius Mar 07 '18 at 13:08
  • I am collecting strings because that is the purpose of the function I am writting - something like `process-lines`, but for null character separated items not lines. – tarsius Mar 07 '18 at 13:09
  • @Tobias Done, please review. – Basil Mar 07 '18 at 17:18

1 Answers1

10

I have run the following benchmarks on

GNU Emacs 27.0.50
(build 14, x86_64-pc-linux-gnu, X toolkit, Xaw3d scroll bars)
of 2018-02-21

without customisations, i.e. by starting Emacs with the -Q flag.

Is there a more efficient alternative to search-forward when searching for a single character?

[...]

Does something like search-forward-char exist and would that be more efficient than search-forward?

As @Tobias correctly points out in a comment, a faster alternative to search-forward when searching for a single character is skip-chars-forward. Some benchmarks follow.

Null character at buffer end

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Newline-terminated line of random printable ASCII
    (insert (make-string 200 (+ #x20 (random #x5e))) ?\n))
  ;; NUL
  (insert 0)
  (garbage-collect)
  (message "a: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (search-forward "\0")))
  (message "b: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (skip-chars-forward "^\0"))))

gives

a: (6.959186105 0 0.0)
b: (2.527484532 0 0.0)

Long null-terminated lines

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Null-terminated line of random printable ASCII
    (insert (make-string 200 (+ #x20 (random #x5e))) 0))
  (garbage-collect)
  (message "a: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (while (search-forward "\0" nil t))))
  (message "b: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (while (progn (skip-chars-forward "^\0")
                                   (not (eobp)))
                       (forward-char)))))

gives

a: (10.596461232 0 0.0)
b: (4.896477926  0 0.0)

Short null-terminated lines

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Null-terminated line of random printable ASCII
    (insert (make-string 4 (+ #x20 (random #x5e))) 0))
  (garbage-collect)
  (message "a: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (while (search-forward "\0" nil t))))
  (message "b: %s" (benchmark-run-compiled 1000
                     (goto-char (point-min))
                     (while (progn (skip-chars-forward "^\0")
                                   (not (eobp)))
                       (forward-char)))))

gives

a: (3.642238859 0 0.0)
b: (2.281851267 0 0.0)

Note that the smaller time difference with short lines is likely due to the higher loop complexity of test (b). Furthermore, inverting the direction of the search (i.e. using point-max, skip-chars-backward, bobp, and backward-char) makes no noticeable difference.

Is that actually more efficient? The text in the buffer is not propertized, but I would imagine that looking for the non-existent properties would still add an overhead.

Let's see:

(defun a ()
  (buffer-string))

(defun b ()
  (buffer-substring (point-min) (point-max)))

(defun c ()
  (buffer-substring-no-properties (point-min) (point-max)))

(dolist (f '(a b c))
  (byte-compile f))

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Random-length random-printable-ASCII newline-terminated line
    (dotimes (_ (random 200))
      (insert (+ #x20 (random #x5e))))
    (insert ?\n))
  (garbage-collect)
  (message "a: %s" (benchmark-run 1000 (a)))
  (garbage-collect)
  (message "b: %s" (benchmark-run 1000 (b)))
  (garbage-collect)
  (message "c: %s" (benchmark-run 1000 (c))))

gives

a: (7.069123577999999 1000 6.8170885259999885)
b: (7.072005507999999 1000 6.819331175000003)
c: (7.064939498999999 1000 6.812288113000008)

so no difference in an unpropertised buffer. Note that I had to place the call to buffer-string in a separate byte-compiled function, otherwise it would have been optimised to a constant under benchmark-run-compiled.

Instead of reading the buffer into a string and then splitting the string I would like to instead work on the buffer directly - again assuming that that actually is more efficient.

Let's check. The following three functions should give the same result:

(defun a ()
  (split-string (buffer-string) "\0"))

(defun b ()
  (goto-char (point-min))
  (let (l)
    (while (let ((p (point)))
             (push (buffer-substring-no-properties
                    p (+ p (skip-chars-forward "^\0")))
                   l)
             (not (eobp)))
      (forward-char))
    (nreverse l)))

(defun c ()
  (goto-char (point-max))
  (let (l)
    (while (let ((p (point)))
             (push (buffer-substring-no-properties
                    p (+ p (skip-chars-backward "^\0")))
                   l)
             (not (bobp)))
      (backward-char))
    l))

(dolist (f (a b c))
  (byte-compile f))

Null character at buffer end

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Newline-terminated line of random printable ASCII
    (insert (make-string 200 (+ #x20 (random #x5e))) ?\n))
  ;; NUL
  (insert 0)
  (garbage-collect)
  (message "a: %s" (benchmark-run 100 (a)))
  (garbage-collect)
  (message "b: %s" (benchmark-run 100 (b)))
  (garbage-collect)
  (message "c: %s" (benchmark-run 100 (c))))

gives

a: (2.46373737  200 1.5349787340000005)
b: (1.046089159 100 0.7499454190000003)
c: (1.040357797 100 0.7460460909999975)

Long null-terminated lines

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Null-terminated line of random printable ASCII
    (insert (make-string 200 (+ #x20 (random #x5e))) 0))
  (garbage-collect)
  (message "a: %s" (benchmark-run 100 (a)))
  (garbage-collect)
  (message "b: %s" (benchmark-run 100 (b)))
  (garbage-collect)
  (message "c: %s" (benchmark-run 100 (c))))

gives

a: (4.065745779999999  300 2.3008262569999927)
b: (2.787263217        274 2.097104968000009)
c: (2.7745770399999996 275 2.112500514999999)

Short null-terminated lines

(with-temp-buffer
  (dotimes (_ 10000)
    ;; Null-terminated line of random printable ASCII
    (insert (make-string 4 (+ #x20 (random #x5e))) 0))
  (garbage-collect)
  (message "a: %s" (benchmark-run 100 (a)))
  (garbage-collect)
  (message "b: %s" (benchmark-run 100 (b)))
  (garbage-collect)
  (message "c: %s" (benchmark-run 100 (c))))

gives

a: (1.346149274 85 0.640683847)
b: (1.010766266 80 0.6072433190000055)
c: (0.989048037 80 0.6078114269999908)

So, you can probably get a ~2x speedup by using skip-chars-{forward,backward}, but as @Tobias points out, is it worth the extra complexity?

Basil
  • 12,019
  • 43
  • 69