66

In the Wikipedia article on Regular expressions, it seems that [[:digit:]] = [0-9] = \d.

What are the circumstances where they do not equal? What is the difference?

After some research, I think one difference is that bracket expression [:expr:] is locale dependent.

muru
  • 72,889
harbinn
  • 872
  • 4
    Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things). – igal Jan 02 '18 at 03:34
  • @igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not. – harbinn Jan 02 '18 at 07:01

4 Answers4

75

Yes, it is [[:digit:]] ~ [0-9] ~ \d (where ~ means approximate).
In most programming languages (where it is supported)

\d ≡ `[[:digit:]]`            # (is identical to, it is a short hand for).  

The \d exists in less instances than [[:digit:]] (available in grep -P but not in POSIX).

Unicode digits

There are many digits in UNICODE, for example:

123456789 # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI

All of which may be included in [[:digit:]] or \d, and even some cases of [0-9].


POSIX

For the specific POSIX BRE or ERE:
The \d is not supported (not in POSIX but is in GNU grep -P). [[:digit:]] is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9], [0123456789], \d and [[:digit:]] mean exactly the same. The [0123456789] has no possible misinterpretations, [[:digit:]] is available in more utilities and in some cases mean only [0123456789]. The \d is supported by few utilities.

As for [0-9], the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).

[0123456789]

The most basic option for all ASCII digits.
Always valid, (AFAICT) no known instance where it fails.

It match only English Digits: 0123456789.

[0-9]

It is generally believed that [0-9] is only the ASCII digits 0123456789.
That is painfully false in some instances: Linux in some locale that is not "C" (June of 2020) systems, for example:

Assume:

str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

Try grep to discover that it allows most of them:

$ echo "$str" | grep -o '[0-9]\+'
0123456789
٠١٢٣٤٥٦٧٨
۰۱۲۳۴۵۶۷۸
߀߁߂߃߄߅߆߇߈
०१२३४५६७८

That sed has some troubles. Should remove only 0123456789 but removes almost all digits. That means that it accepts most digits but not some nine's (???):

$ echo "$str" | sed 's/[0-9]\{1,\}//g'
 ٩ ۹ ߉ ९

That even expr suffers of the same issues of sed:

expr "$str" : '\([0-9 ]*\)'             # also matching spaces.
0123456789 ٠١٢٣٤٥٦٧٨

And also ed

printf '%s\n' 's/[0-9]/x/g' '1,p' Q | ed -v <(echo "$str")
105
xxxxxxxxxx xxxxxxxxx٩ xxxxxxxxx۹ xxxxxxxxx߉ xxxxxxxxx९

[[:digit:]]

There are many languages: Perl, Java, Python, C. In which [[:digit:]] (and \d) calls for an extended meaning. For example, this perl code will match all the digits from above:

$ str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "$str" | perl -C -pe 's/[^\d]//g;' ; echo 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which is equivalent to select all characters that have the Unicode properties of Numeric and digits:

$ echo "$str" | perl -C -pe 's/[^\p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which grep could reproduce (the specific version of pcre may have a different internal list of numeric code points than Perl):

$ echo "$str" | grep -oP '\p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९

shells

Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example) (when tested on May 2018 version (AT&T Research) 93u+ 2012-08-01):

$ LC_ALL=en_US.utf8 ksh -c 'echo "${1//[0-9]}"' sh "$str"
  ۹ ߀߁߂߃߄߅߆߇߈߉ ९

Now (June 2020), the same package ksh93 from debian (same version sh (AT&T Research) 93u+ 2012-08-01):

$ LC_ALL=en_US.utf8 ksh -c 'echo "${1//[0-9]}"' sh "$str"

٩ ۹ ߉ ९

And that seems to me as a sure source of bugs waiting to happen.

  • In practice on POSIX systems, iswctype() and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard – Stéphane Chazelas May 15 '18 at 19:39
  • I wasn't aware that perl's \d in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see (*UCP) as in GNU grep -Po '(*UCP)\d' or grep -Po '(*UCP)[[:digit:]] for classes to be based on Unicode properties. – Stéphane Chazelas May 15 '18 at 19:46
  • I agree that the [:digit:] syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use [:digit:] because in practice that's the same as [0-9] and in any case, invariably I want to match on 0123456789, I never mean to match on ٠١٢٣٤٥٦٧٨٩, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about [:blank:] on the zsh ML. Those character classes are a bit of a mess. – Stéphane Chazelas May 15 '18 at 20:38
  • 4
    Replying to my 2 year old comment above, largely thanks to Isaac's finding here, I now no longer use [0-9] (except in zsh/perl or in the C locale where I know that works as expected) as what that matches is more or less random, and use [[:digit:]] in POSIX utilities, or [0123456789] when I can't be sure. The situation is even worse with [a-z]. – Stéphane Chazelas Jun 17 '20 at 07:42
  • does [:digit:] or [0-9] work with perl syntax? – samshers Sep 14 '20 at 10:09
  • @samshers If you are asking if they will work inside perl, then, yes, they will and they will match other languages digits. –  Sep 16 '20 at 05:35
  • "it ([0-9]) accepts most digits but not some nine's" - I think you already hint at the reason, collation order or something to that effect. Say your collation orders letters aAbBcC, then the letter range [a-c] would not include C, even though it contains A and B. Same goes for digits: (hindi eight) comes before 9, but (hindi nine) comes after (for some collations), so it's not inside the [0-9] range. – Silly Freak Jun 10 '21 at 15:56
  • @SillyFreak Sure, yes, you are correct. –  Jun 10 '21 at 17:10
16

This depends on how you define a digit; [0-9] tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); \d on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a regular expression flag enabled) or it could be a Unicode match of \p{Digit} which is rather a larger set of digits than [0-9] or /\d/a match.

$ perl -E 'say "match" if 42 =~ m/\d/'
match
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/\d/'
match
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/\d/a'
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/[0-9]/'
$ 

perldoc perlrecharclass for more information, or consult the documentation for the language in question to see how it behaves.

But wait, there's more! The locale may also vary what \d matches, so \d could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]. This is similar to the difference in C between isdigit(3) ([0-9]) and isnumber(3) ([0-9 plus whatever else from the locale).

There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]:

$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("\N{U+09EA}")'
4
$ 
thrig
  • 34,938
6

Different meaning of [0-9], [[:digit:]] and \d are presented in other answers. Here I would like to add differences in implementation of regex engine.

            [[:digit:]]    \d
grep -E               ✓     ×
grep -P               ✓     ✓
sed                   ✓     ×
sed -E                ✓     ×

So [[:digit:]] always works, \d depends. In grep's manual it's mentioned that [[:digit:]] is just 0-9 in the C locale.

PS1: If you know more, please expand the table.

PS2: GNU grep 3.1 and GNU 4.4 is used for test.

harbinn
  • 872
  • 2
  • There are many versions of grep and sed, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version of grep and sed it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
  • – ilkkachu Jan 02 '18 at 14:01
  • @ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form. – harbinn Jan 02 '18 at 14:39
  • @harbinn Please edit that into your answer. – Dan D. Jan 03 '18 at 04:56
  • @DanD. the version info added. thx for attention – harbinn Jan 04 '18 at 00:43
  • 1
    Note that the python built in re module does not support [[:digit:]] but the add in library regex does support it so I would niggle a little at the always works. It always works in posix complaint situations. – Steve Barnes Jan 05 '18 at 19:08
  • does [:digit:] or [0-9] work with perl syntax? – samshers Sep 14 '20 at 10:09