What are the differences between the regular expression engines such as emacs and posix-egrep?

Question

The GNU implementation of the find command uses "Emacs Regular Expressions" by default for its -regex predicate. This can be changed to options such as posix-egrep.

What are the differences in each engine in addition to syntaxes?

For example, do each differ in performance, simplicity?

Should a particular engine be used for specific scenarios (above and beyond personal preferences)?

The version of find is find (GNU findutils) 4.7.0-git

Related: Why does my regular expression work in X but not in Y? — Kusalananda, Jan 05 '20 at 17:37
@Kusalananda - The answer suggests that the only difference is syntax. The question I have posted is asking if there are other contrasts. — Ryan, Jan 05 '20 at 17:42
@Ryan The answer also states differences in capabilities (lookahead/lookbehind, for instance, which is not just a matter of syntax). As to performance, it's more a matter for specific implementation. — xenoid, Jan 05 '20 at 17:47
Which is why I did not mark it as a duplicate. If you are interested in specific implementations, you may want to mention exactly what find you are using. I'm assuming it's GNU find, but the libraries it's using may differ between versions and Unixes. — Kusalananda, Jan 05 '20 at 17:47
@Kusalananda - Thanks. I wasn't aware that there were variations of find. I have included the version in the question. — Ryan, Jan 05 '20 at 18:01
@xenoid - Does that mean that the only differences are capabilities and syntaxes? — Ryan, Jan 05 '20 at 18:02
@Ryan Yes. -regex and -regextype are totally non-standard. There are find-implementations that does not have these. Also, I imagine that GNU find could potentially be made to behave differently depending on what libraries it is linked with, at least with regards to performance (the BSD regular expression implementation in the C library may perform differently from what the same routines on Linux do, for example, and could also have incompatibilities with the Linux implementation). — Kusalananda, Jan 05 '20 at 18:25
@Kusalananda - Are they considered non-standard only because they are not implemented in other variations of find? — Ryan, Jan 06 '20 at 07:07
@Ryan They are considered non-standard since they are not part of the POSIX specification of the find utility. — Kusalananda, Jan 06 '20 at 07:14
for what I know of the regex, the most important thing is the complexity of your regex. the more precise it is the better it is, if you use a lot of .* in PRCE or * in BRE you may experience performances issues. I recently did benchmark for a personnal use case where a simple .* at the beginning of my regex would more than double the process time. The different implementation you're refering too are well tested and long time optimized I bet there's few differences between them and you could find one better than the other for a specific expression and the contrary too. — Kiwy, Jan 06 '20 at 13:03

jubilatious1 · Answer 1 · 2021-09-15T20:00:40.547

-1

You've asked a question in a daunting subject.

The best resource I can point you to is a PDF/video entitled, "Everything You Know About Regexes Is Wrong" by Damian Conway, former computer science professor (Monash University, Australia) and well-known Perl developer and author:

https://slides.yowconference.com/yowwest2015/Conway-EverythingYouKnowAboutRegexesIsWrong.pdf

https://youtu.be/ubvSjW6Nyqk

In his presentation/PDF, Conway states there are "six major dialects of regular expression syntax" including BRE, ERE, EMACS, VIM, PCRE, and PSIX (the last standing for "PERL6", recently renamed Raku).

As an example, on PDF page 13 Conway shows that a regex written in the VIM/EMACS editor dialects thusly:

/abc\|abx/

is actually written the following way in ERE, PCRE, and PERL6 (i.e. RAKU):

/abc|abx/

A number of other differences are noted, see the presentation/PDF for details.

Caveat emptor: the link https://www.gnu.org/software/gnulib/manual/html_node/Regular-expression-syntaxes.html has been cited here on StackExchange as an authoritative regex reference. However, in point-of-fact that html page makes no mention of either the PERL6 (aka RAKU) regex dialect or even the widely-distributed PCRE regex dialect.

edited Sep 15 '21 at 20:00

answered Sep 15 '21 at 19:20

jubilatious1

3,195
8
17

"now-standard" implies that it's in POSIX. Perhaps you meant widely-used. – Thomas Dickey Sep 15 '21 at 19:31
Sure: it has an emulation/whatever for POSIX, but its native API is not POSIX. – Thomas Dickey Sep 15 '21 at 19:44
if GNU find uses the the gnulib regex engines, and that manual for it is indeed automatically generated, presumably from something that also generates the matching engines, then I would say that's pretty much as authoritative a source one can get, for those dialects. And the question was about GNU find, not Perl, PCRE, Python or Raku. – ilkkachu Sep 15 '21 at 22:21
I suggest an edit to the title, then. GNU_Find is not mentioned in the title and my reading of the question was about general Regex dialects. FYI this question went unanswered for ~18 months and I was trying to take a 'civic-minded' approach here on U&L. – jubilatious1 Sep 19 '21 at 12:58

What are the differences between the regular expression engines such as emacs and posix-egrep?

1 Answers1