10

Will grep [0-9] work in the same way as grep [:digit:]?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
D.Mar
  • 101

4 Answers4

17

No, [0-9] is not the same as [:digit:].

[0-9] matches the numerals 0 to 9.

[:digit:] matches 0 to 9, and numerals in non-western languages as well (e.g. Eastern Arabic).

Guido
  • 4,114
  • Please, could you provide a specific name for, or a link (with a table) to, the (preferably) shortest such character set? – agc Apr 13 '16 at 17:50
  • Also are you sure that '[0-9]' always matches only 0 to 9? Perhaps in some other character set, '[0-9]' pulls in other chars. – agc Apr 13 '16 at 17:55
  • 3
    @agc [0-9] matches the literal ASCII characters "0", "1", …, "9", the same way that [A-Z] matches ASCII characters "A" through "Z". These patterns are restricted to the ASCII character set by definition. On the other hand, [:digit:] designates a broader character class, which also includes Unicode characters for digits in other languages. – Guido Apr 13 '16 at 18:07
  • 4
    [charX-charY] implies low level library code that looks up the character code for charX in the current character set, and counts from there up to the code for charY. [A-Z] on an ASCII system matches 26 codes: {65,...90}. EBCDIC matches 41 codes: {193,...233}. [:upper:] would always match only 26 codes on ASCII, EBCDIC, etc. Thankfully, [0-9] matches 10 codes with ASCII or EBCDIC codes -- what's uncertain is whether any character sets exist (or shall exist) for which [0-9] matches more than or less than 10 codes. If such a set exists, [:digit:] is useful. – agc Apr 13 '16 at 21:11
  • Abligh's answer below says it so much better, so "never mind..." – agc Apr 13 '16 at 21:20
  • 4
    @Guido Actually [A-Z] matches more than ASCII letters sometimes: http://unix.stackexchange.com/questions/15980/does-should-lc-collate-affect-character-ranges On the other hand, I think all extant locales have [0-9] match only the ASCII digits. – Gilles 'SO- stop being evil' Apr 13 '16 at 22:33
3

You can change [[:digit:]] to [0-9] - note [:digit:] is inside […]. This depends on the encoding of the input. If it is ASCII, I don't think there will be a problem. With other encodings, the digits may not be contiguous, or the byte range maybe different. You could also miss special numbers in other writing systems.

muru
  • 72,889
3

To be precise [0-9] is only guaranteed to be equivalent to [:digit:] if:

  • the regexp parser supports [:digit:] (i.e. if it does not, then the existing [:digit:] probably doesn't do what you think it does), and:

  • the input character set is one such as ASCII where the only digits are the characters 0 - 9 and they are adjacent. This might not be true in (e.g.) unicode (where the digits may include characters other than the digits 0 - 9), or even in other 8 bit character sets where 0 - 9 may not be adjacent (as it happens in EBCDIC the digits 0 - 9 are adjacent).

Examples of unicode exceptions are shown here. As you can see the set of unicode characters in the category 'Number, Digit, Decimal' includes rather more than the 10 ASCII digits matched by [0-9]; it includes arabic indic, extended arabic, ngo, etc.

More information on numerals in unicode can be found here.

abligh
  • 397
  • The ASCII digits are adjacent and in order in all extant locales, so [0-9] is a safe way to match them. – Gilles 'SO- stop being evil' Apr 13 '16 at 22:34
  • @Gilles they are adjacent in ASCII but not in all 8 bit character sets. And it is not true that all digits are adjacent in Unicode (as there are digits other than [0-9]) – abligh Apr 14 '16 at 05:10
1

'[:digit:]' is theoretically more portable, the advantage being that it would not depend on one's local character set clumping digits all together.

Related example: With '[:upper:]' vs '[A-Z]' there's no difference in ASCII, but there is a difference on an old IBM EBCDIC system, where '[A-Z]' would span 41 chars not 26, (EBCDIC codes 193-233) and would therefore match EBCDIC "}\" et al.

agc
  • 7,223