1

In the GNU Emacs Lisp Reference Manual, 6.2 Arrays:

In practice, we always choose strings for such applications, for four reasons:

  • They occupy one-fourth the space of a vector of the same elements.

How am I supposed to understand the above sentence in the way how the data structures of these two types are laid out in memory?

shynur
  • 4,065
  • 1
  • 3
  • 23

1 Answers1

1

A vector of characters needs foureight¹ bytes per character. Meanwhile a string will be encoded in UTF-8, and thus needs a variable amount of bytes per character, from one to four. Most of the time the characters in your average string are ASCII, and thus take up only a single byte. But it is a simplification to say that it is always one particular ratio; you might have a string of characters that each need 3 bytes to encode, so the size will only be ⅜ths that of an equivalent vector.

¹: a modern 64–bit build of Emacs will naturally use 64–bit words to store both integers and pointers.

db48x
  • 15,741
  • 1
  • 19
  • 23
  • 1
    Anywhere from 1/4 to 1. – NickD Mar 01 '23 at 18:40
  • Do you mean that a vector stores the pointers to its elements and a string stores characters directly? But [6.2 Arrays](https://www.gnu.org/software/emacs/manual/html_node/elisp/Arrays.html) says “Any element of an array may be accessed in constant time.” and a string is a kind of array; so if a string stores characters directly, it's unable to access any element in O(1) time because of UTF-8. – shynur Mar 02 '23 at 02:55
  • No, I mean what I said. An array of character holds characters. [Chapter 2.4.3 Character Type](https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html) defines what this means: “A “character” in Emacs Lisp is nothing more than an integer. In other words, characters are represented by their character codes.”. Unicode characters are numbed from 0 up to 0x10FFFF (not quite 2²¹-1) and thus a whole entire integer is used to store them. Actually I just double–checked, and in a 64–bit build of Emacs integers are now 64 bits long, less type bits, so the ratio is usually 8:1. – db48x Mar 02 '23 at 04:08
  • Meanwhile strings use either one byte per character (unibyte strings) or a variable number of bytes per character (multibyte strings). 99% of the time you should be dealing with multibyte strings; there’s mostly no need to create unibyte strings outside of some odd circumstances. See [chapter 34.1 Text Representations](https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html). – db48x Mar 02 '23 at 04:14
  • @db48x: So in most cases, a string is multibyte so that it cannot access its elements in constant time? – shynur Mar 02 '23 at 05:34
  • 1
    Most of the time it is still effectively constant time, because you will be iterating over the characters in the string (and thus accessing the characters in order), or because there is ancillary information such as the start or end offset of a text property that you are accessing. But no, Emacs no longer guarantees exactly constant access times to characters in strings. But because of backwards compatibility, the interface is the same as for arrays (you access individual codepoints with aref). – db48x Mar 02 '23 at 06:03