Difference between [0-9], [[:digit:]] and \d

Question

In the Wikipedia article on Regular expressions, it seems that [[:digit:]] = [0-9] = \d.

What are the circumstances where they do not equal? What is the difference?

After some research, I think one difference is that bracket expression [:expr:] is locale dependent.

Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things). — igal, Jan 02 '18 at 03:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not. — harbinn, Jan 02 '18 at 07:01

score 75 · Answer 1 · 2021-06-10T03:25:30.387

Yes, it is [[:digit:]] ~ [0-9] ~ \d (where ~ means approximate).
In most programming languages (where it is supported)

\d ≡ `[[:digit:]]`            # (is identical to, it is a short hand for).

The \d exists in less instances than [[:digit:]] (available in grep -P but not in POSIX).

Unicode digits

There are many digits in UNICODE, for example:

123456789 # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI

All of which may be included in [[:digit:]] or \d, and even some cases of [0-9].

POSIX

For the specific POSIX BRE or ERE:
The \d is not supported (not in POSIX but is in GNU grep -P). [[:digit:]] is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9], [0123456789], \d and [[:digit:]] mean exactly the same. The [0123456789] has no possible misinterpretations, [[:digit:]] is available in more utilities and in some cases mean only [0123456789]. The \d is supported by few utilities.

As for [0-9], the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).

[0123456789]

The most basic option for all ASCII digits.
Always valid, (AFAICT) no known instance where it fails.

It match only English Digits: 0123456789.

[0-9]

It is generally believed that [0-9] is only the ASCII digits 0123456789.
That is painfully false in some instances: Linux in some locale that is not "C" (June of 2020) systems, for example:

Assume:

str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

Try grep to discover that it allows most of them:

$ echo "$str" | grep -o '[0-9]\+'
0123456789
٠١٢٣٤٥٦٧٨
۰۱۲۳۴۵۶۷۸
߀߁߂߃߄߅߆߇߈
०१२३४५६७८

That sed has some troubles. Should remove only 0123456789 but removes almost all digits. That means that it accepts most digits but not some nine's (???):

$ echo "$str" | sed 's/[0-9]\{1,\}//g'
 ٩ ۹ ߉ ९

That even expr suffers of the same issues of sed:

expr "$str" : '\([0-9 ]*\)'             # also matching spaces.
0123456789 ٠١٢٣٤٥٦٧٨

And also ed

printf '%s\n' 's/[0-9]/x/g' '1,p' Q | ed -v <(echo "$str")
105
xxxxxxxxxx xxxxxxxxx٩ xxxxxxxxx۹ xxxxxxxxx߉ xxxxxxxxx९

[[:digit:]]

There are many languages: Perl, Java, Python, C. In which [[:digit:]] (and \d) calls for an extended meaning. For example, this perl code will match all the digits from above:

$ str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$str" | perl -C -pe 's/[^\d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which is equivalent to select all characters that have the Unicode properties of Numeric and digits:

$ echo "$str" | perl -C -pe 's/[^\p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९

Which grep could reproduce (the specific version of pcre may have a different internal list of numeric code points than Perl):

$ echo "$str" | grep -oP '\p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९

shells

Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example) (when tested on May 2018 version (AT&T Research) 93u+ 2012-08-01):

$ LC_ALL=en_US.utf8 ksh -c 'echo "${1//[0-9]}"' sh "$str"
  ۹ ߀߁߂߃߄߅߆߇߈߉ ९

Now (June 2020), the same package ksh93 from debian (same version sh (AT&T Research) 93u+ 2012-08-01):

$ LC_ALL=en_US.utf8 ksh -c 'echo "${1//[0-9]}"' sh "$str"
٩ ۹ ߉ ९

And that seems to me as a sure source of bugs waiting to happen.

In practice on POSIX systems, iswctype() and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard — Stéphane Chazelas, May 15 '18 at 19:39
I wasn't aware that perl's \d in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see (*UCP) as in GNU grep -Po '(*UCP)\d' or grep -Po '(*UCP)[[:digit:]] for classes to be based on Unicode properties. — Stéphane Chazelas, May 15 '18 at 19:46
I agree that the [:digit:] syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use [:digit:] because in practice that's the same as [0-9] and in any case, invariably I want to match on 0123456789, I never mean to match on ٠١٢٣٤٥٦٧٨٩, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about [:blank:] on the zsh ML. Those character classes are a bit of a mess. — Stéphane Chazelas, May 15 '18 at 20:38
Replying to my 2 year old comment above, largely thanks to Isaac's finding here, I now no longer use [0-9] (except in zsh/perl or in the C locale where I know that works as expected) as what that matches is more or less random, and use [[:digit:]] in POSIX utilities, or [0123456789] when I can't be sure. The situation is even worse with [a-z]. — Stéphane Chazelas, Jun 17 '20 at 07:42
@samshers If you are asking if they will work inside perl, then, yes, they will and they will match other languages digits. — , Sep 16 '20 at 05:35
"it ([0-9]) accepts most digits but not some nine's" - I think you already hint at the reason, collation order or something to that effect. Say your collation orders letters aAbBcC, then the letter range [a-c] would not include C, even though it contains A and B. Same goes for digits: ८ (hindi eight) comes before 9, but ९ (hindi nine) comes after (for some collations), so it's not inside the [0-9] range. — Silly Freak, Jun 10 '21 at 15:56

thrig · Answer 2 · 2018-01-02T15:18:39.033

This depends on how you define a digit; [0-9] tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); \d on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a regular expression flag enabled) or it could be a Unicode match of \p{Digit} which is rather a larger set of digits than [0-9] or /\d/a match.

$ perl -E 'say "match" if 42 =~ m/\d/'
match
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/\d/'
match
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/\d/a'
$ perl -E 'say "match" if "\N{U+09EA}" =~ m/[0-9]/'
$

perldoc perlrecharclass for more information, or consult the documentation for the language in question to see how it behaves.

But wait, there's more! The locale may also vary what \d matches, so \d could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]. This is similar to the difference in C between isdigit(3) ([0-9]) and isnumber(3) ([0-9 plus whatever else from the locale).

There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]:

$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("\N{U+09EA}")'
4
$

I think isnumber() is a BSD thing, at least based on the man page it seems so — ilkkachu, Jan 02 '18 at 18:06
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force \d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only [0-9]. — , Jun 04 '18 at 22:16

harbinn · Answer 3 · 2018-01-04T00:40:53.550

6

Different meaning of [0-9], [[:digit:]] and \d are presented in other answers. Here I would like to add differences in implementation of regex engine.

            [[:digit:]]    \d
grep -E               ✓     ×
grep -P               ✓     ✓
sed                   ✓     ×
sed -E                ✓     ×

So [[:digit:]] always works, \d depends. In grep's manual it's mentioned that [[:digit:]] is just 0-9 in the C locale.

PS1: If you know more, please expand the table.

PS2: GNU grep 3.1 and GNU 4.4 is used for test.

edited Jan 04 '18 at 00:40

answered Jan 02 '18 at 13:45

harbinn

872

2
There are many versions of grep and sed, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version of grep and sed it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image

ilkkachu

Jan 02 '18 at 14:01

@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form. – harbinn Jan 02 '18 at 14:39

@harbinn Please edit that into your answer. – Dan D. Jan 03 '18 at 04:56

@DanD. the version info added. thx for attention – harbinn Jan 04 '18 at 00:43

1

Note that the python built in re module does not support [[:digit:]] but the add in library regex does support it so I would niggle a little at the always works. It always works in posix complaint situations. – Steve Barnes Jan 05 '18 at 19:08

does [:digit:] or [0-9] work with perl syntax? – samshers Sep 14 '20 at 10:09

score 6 · Answer 4 · answered Jan 03 '18 at 07:18

The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.

Here are some of the more common use cases for matching a digit:

One-shot data extraction

Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. \d requires the fewest keystrokes, so it's very commonly used.

Input sanitizing

You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9], since it's the most restrictive and predictable one.

Data validation

You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]] is the way to go.

Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.

nice job, Is security problem related, such as ReDoS or others — frams, Jan 04 '18 at 00:56