2

Given a multibyte string "x=π", how to get 4-th byte of the string, without creating string copy with something like string-as-unibyte?

Pseudo code: (string-nth-byte 4 "x=π").

Expected result: 128.

"x=π" = [120 61 207 128] <- bytes
         0   1  2   3    <- index

Any solution that is reasonably efficient is considered valid. O(1) run time complexity is preferred. If there is Emacs builtin for that, I will be very happy.

Notes:

  • Allocations must be avoided;
  • get-byte fails on multibyte chars;
  • I am aware of unibyte strings. They are not the answer;

If you ask: "Why on earth you need to do this?", the answer is this.

  • 1
    Uni- and multi-byte string implementations are very C-based in elisp, so you may wish to consider making (or requesting) such improvements upstream at C level. – phils Jul 06 '17 at 02:38
  • @phils, do you mean patching Emacs C code? I think it is risky option if one wants to distribute his program among other Emacs users. Emacs C `module` could help here because it is easier to deliver, but it will not provide good performance (calling C module function is very costly). Anyway, it *is* a solution actually. Probably I will try it out later. – Iskander Sharipov Jul 06 '17 at 08:01
  • 1
    Yes, I mean get it dealt with upstream in the Emacs C code. Doing so doesn't help people running older versions of Emacs, but at a glance it seems like the best (surely the most efficient) way to be introducing such a function. It doesn't prevent you from using a non-C fallback option on older versions of Emacs. – phils Jul 06 '17 at 08:34

1 Answers1

1

Solution presented in this answer is not perfect.

It meets O(1) criteria only for ASCII string inputs.

;;; -*- lexical-binding: t -*-

(defun utf8-decode-byte (ch n ch-size)
  "Returns integer (uint8) value of N-th byte of
utf-8 encoded char CH.
CH-SIZE is (utf8-char-width CH)."
  ;; Does not check for surrogates for simplicity.
  (pcase ch-size
    (`1 ch)
    (`2 (pcase n
          (`0 (logior #xC0 (lsh ch -6)))
          (_ (logior #x80 (logand #x3F ch)))))
    (`3 (pcase n
          (`0 (logior #xE0 (lsh ch -12)))
          (`1 (logior #x80 (logand #x3F (lsh ch -6))))
          (_ (logior #x80 (logand #x3F ch)))))
    (`4 (pcase n
          (`0 (logior #xF0 (lsh ch -18)))
          (`1 (logior #x80 (logand #x3F (lsh ch -12))))
          (`2 (logior #x80 (logand #x3F (lsh ch -6))))
          (_ (logior #x80 (logand #x3F ch)))))))

(defun mb-string-nth-byte (pos s)
  "Get byte of multi-byte string S at POS index."
  (let ((len (string-bytes s)))
    (if (<= pos (/ len 2))
        (mb-string--nth-byte-fwd pos s)
      (mb-string--nth-byte-rev len pos s))))

;; Part of `mb-string-nth-byte' implementation.
(defun mb-string--nth-byte-fwd (pos s)
  (let* ((i 1)
         (ch (aref s 0))
         (ch-size (utf8-char-width ch))
         (offset ch-size))
    (while (<= offset pos)
      (setq ch (aref s i)
            ch-size (utf8-char-width ch)
            offset (+ offset ch-size)
            i (1+ i)))
    (utf8-decode-byte ch (- ch-size (- offset pos)) ch-size)))

;; Part of `mb-string-nth-byte' implementation.
;; Like `mb-string--nth-byte-fwd', but string is
;; traversed in reverse order.
(defun mb-string--nth-byte-rev (byte-len pos s)
  (let* ((i (- (length s) 2))
         (ch (aref s (- (length s) 1)))
         (ch-size (utf8-char-width ch))
         (offset (- byte-len ch-size 1)))
    (while (>= offset pos)
      (setq ch (aref s i)
            ch-size (utf8-char-width ch)
            offset (- offset ch-size)
            i (1- i)))
    (utf8-decode-byte ch (- (- pos offset) 1) ch-size)))

(defun string-nth-byte (pos s)
  "Get byte of single-byte/multi-byte string S at POS index."
  (if (not (multibyte-string-p s))
      (aref s pos) ;; Fast path
    (mb-string-nth-byte pos s)))

This answer will not be accepted.