9

I recently got into a friendly argument with Ghoti about what constitutes a regular expression in the comments to my answer to this question. I claimed that the following is a regular expression:

`[Rr]eading[Tt]est[Dd]ata`

Ghoti disagreed, claiming it is a file glob instead. The glob page on wikipedia claims that (emphasis mine):

Globs do not include syntax for the Kleene star which allows multiple repetitions of the preceding part of the expression; thus they are not considered regular expressions, which can describe a larger set of regular languages over any given finite alphabet.

However, there is no citation for this claim, indicating that it is just a particular wikipedia editor's opinion.

The The Single UNIX ® Specification, Version 2, states that a Basic Regular Expression (BRE) can even be a single character:

An ordinary character is a BRE that matches itself: any character in the supported character set, except for the BRE special characters listed in BRE Special Characters .

So, what is the definition of a regular expression in the *nix world, and does that definition exclude file globs?

terdon
  • 242,166
  • 6
    In theoretical CS, a regular expression is a description of a regular language, which is one that can be recognized by a finite automaton. In the Unix world, it's much more complicated, and there is no single definition. There are 2 regex dialects in the POSIX spec: extended and basic, which are used by tools like grep, sed, and awk. Vim uses its own variety, as does Perl. – jw013 Aug 27 '12 at 14:42
  • So, by that definition, a file glob is a BRE right? – terdon Aug 27 '12 at 14:52
  • 2
    Nope, a file glob is NOT a BRE - what makes you think it is? If you read the POSIX description of BRE and the POSIX description of globbing, you'll notice that they are not the same. For instance, * has two different meanings in BRE and globs. Note: I don't think the term glob is used anywhere in the POSIX spec - it's called Pattern Matching instead and is described in the shell language chapter. – jw013 Aug 27 '12 at 14:58

2 Answers2

10

As lk- said, the -name option of find will treat the argument as a glob, not a regular expression.

Whether a string is interpreted as a glob or a regex or just a plain string depends on what is being used to do the interpreting. It's a matter of context. The string in your example, [Rr]eading[Tt]est[Dd]ata can be evaluated in a number of different ways, but what it is depends on how you are using it. Use it as a glob, it's a glob. Use it as a regex, it's a regex. In the case of the question where this originated, the OP described the string as a regex. Therefore we can assume he was planning to interpret it as a regex.

A single character can also be a regex, absolutely. It can also be a string, and it can also be a glob. It could be interpreted as a byte or a tinyint, if you like. It all depends on context.

There are a number of specifications for regular expressions in various forms. BRE and ERE are well documented. PCRE adds scads of functionality. Many regex interpreters will implement, for example, "all of ERE and some of PCRE". Or they'll do ERE minus some feature. If you go by formal specifications, many many tools claim regex-support that turns out to be incorrect or incomplete. Knowing the details lets you adapt your solutions to the collection of functionality available within whatever tool is evaluating your regex.

So ... if you're looking for definitions that "exclude" globs, you're looking at this from the wrong perspective. What it is is determined by how you use it.

ghoti
  • 6,602
6

[Rr]eading[Tt]est[Dd]ata appears to be valid as both a glob and a regular expression, and I believe has the same "meaning" in both interpretations. However, the -name option of find will treat the argument as a glob, not a regular expression.

This distinction will matter if you supply an argument such as foo*, which is both a valid glob and a valid regular expression, but has different meaning depending on the interpretation:

If interpreted as a glob pattern, this will match foo, foobar, foo123, etc.

If interpreted as a regular expression, this will match fo, foo, foooooo, etc.

lk-
  • 3,723
  • Thanks, I see the difference between a glob pattern and a regex. What's the formal definition of a regex though? – terdon Aug 27 '12 at 14:20
  • 1
    I don't know if there is a single definition for "regular expressions" as the term is commonly used. There are different syntax specifications, such as POSIX regular expressions or Perl regular expressions, which include other "features" such as backreferences or lookaheads. These may no longer be regular expressions in the strictest sense (in the context of regular formal languages) but are still referred to as such. – lk- Aug 27 '12 at 14:38