3

This is really a question about implementation rather than anything a user of GNU Emacs would be interested in.

Emacs regexps are represented as strings. There is a nice rx DSL for constructing regexps out of s-expressions, but it produces a string not some other more fundamental data structure. I assume that the underlying C regexp engine uses something lower level than strings to represent a regexp matcher.

Does emacs parse / compile every regexp string every time a procedure is called, or is there some memoization going on there under the hood... e.g. an LRU of the regexp string to the underlying compiled data structure. Is it possible to customise any of the parameters for that cache as a user?

fommil
  • 1,750
  • 11
  • 24

1 Answers1

6

From a cursory look all regex matching functions end up using compile_pattern which first checks the built-in regex cache for a previously compiled one, compiling one when needed. The cache is a linked list with a hardcoded size of 20 items. New matches are put in the front of the list, essentially creating a LRU mechanism. See search.c for the details.

wasamasa
  • 21,803
  • 1
  • 65
  • 97
  • Thankyou! Great answer. 20 feels very small for modern computers, I might experiment with using 100 or 1000. – fommil Mar 30 '20 at 21:01
  • Some anecdotal data: my ert-runner test suite takes (consistently) 5% less time to run with a cache size of 256 vs the default of 20. My tests are primarily testing fontification and lexing, so making heavy use of regexps. It's enough for me to want to apply this patch and use it all the time. BTW I also got a 10% improvement just by compiling with -O3 and machine native optimisations so I'm quite happy generally and glad I undertook this exercise :-D – fommil Mar 31 '20 at 19:51
  • 1
    Hm, you might want to mention this on emacs-devel then, along with a trivial patch and benchmark code. – wasamasa Mar 31 '20 at 20:39