How do I grep for a range of unicode characters?

Question

I've seen an example for one character. How to grep characters with their unicode value?

I'm interested in a method other than the shell substitution method

~~because shell substitution seems to be a bit limited e.g. it doesn't seem to work for a non-graphical unicode characters like codepoint of \u80.~~

I can get that method to work for a range, ~~but only to an extent, since it won't cover non-graphical characters like \u80 (unicode codepoint 80)~~

$ echo grep [$'\u41'-$'\u45']
grep [A-E]
$ echo 4142434445|xxd -r -p
ABCDE
$ echo 4142434445|xxd -r -p | grep [$'\u41'-$'\u45']
ABCDE

~~The $ method uses substitution at the shell level so won't work to e.g. look for characters from \u0080-\uFFFF or \u0080 upwards, 'cos if the shell can't display a character, it won't work.~~

ugrep is available through apt-get for debian, though not for the ubuntu version I have on the VPS I have. And I have to test it some more.

NOTE Turns out the shell substitution method does work for control characters so would even work for a range of them or any unicode characters, as no doubt would ugrep. Initially when I tried shell substitution with grep, I was unknowingly feeding in wrong bytes. e.g. echo 418042| xxd -r -p displays A▒B so I thought great that worked, and I was trying to grep on that. So I was piping wrong data to grep. 80 is not utf-8 for \u80. An echo of high characters e.g. £ clearly shows it is outputting utf-8. echo £ | xxd -p shows c2a30a c2a3 is utf-8 for £. When I fed in correct bytes it worked e.g. c280 is \u80 and even echo $'\u80' worked . This page is good for showing utf-8 mappings to unicode codepoints. https://www.utf8-chartable.de/

While shell substitution does work, i'm glad I have an answer that does a method other than shell substitution, as having an alternative is good.

egrep '[\u1600-\u16ff]' would be sort of the obvious thing to try, but not sure if will work — Barry Carter, Jul 26 '22 at 14:47
"won't work to e.g. look for characters from \u0080-\uFFFF or \u0080 upwards, 'cos if the shell can't display a character, it won't work" -- hmm, what shell are you using? Seems to me Bash can interpret and print e.g. $'\u00e4' or $'\u2667' just fine, giving ä and ♧. — ilkkachu, Jul 27 '22 at 06:36
@ilkkachu thanks.. I think I had control characters in mind. — barlop, Jul 27 '22 at 09:57
@ilkkachu More to the point, the shell does not display characters. This is left up to the terminal emulator and depends on the fonts available to the terminal. — doneal24, Jul 27 '22 at 18:42
@doneal24, yes, the shell isn't involved if we actually display something. I think I was mostly thinking about the possibility that the shell might croak on having to just handle some characters, for whatever reason not related to actual displaying. — ilkkachu, Jul 27 '22 at 20:17

score 5 · Accepted Answer · answered Jul 26 '22 at 17:19

5

In gnu-grep and similar you can use PCRE option -P and use \x{HHHH} syntax

$ grep -o -P '[\x{0410}-\x{042F}]+' # same as: grep -o -P '[А-Я]+'
абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕ
=> АБВГДЕ

answered Jul 26 '22 at 17:19

JJoao

12,170
1
23
45

barlop · Answer 2 · 2022-07-31T11:28:50.647

I spoke to some perl experts. and got perl one liner equivalents of grepping for a range of unicode characters.

$ echo £
£

So there's this concept of an ordinal, which is a numeric representation of a character. (I suppose regardless of whether something is an encoding or a codepoint. The word ordinal is useful to describe what follows \x which depending on options, could be an encoding , so stored bytes, or could be a unicode codepoint, bytes but not encoded for storing in / writing to memory)

It's in bytes/octets. Which can be represented in various bases.

%v is a format specifier of printf,

$ perl -e 'printf "%vx\n",A'
41
$ perl -e 'printf "%vx\n",4'
34

%vd would be 52 (decimal numeric representation of the character '4'. %vx is hex representation

UTF-8 encoding for £ is c2a3 https://www.utf8-chartable.de/

$ echo £ | xxd -p
c2a30a

When using \x with more than two digits, you have to use curly braces with it. \x{..}

$ echo £ | perl -CIO -ne 'print if /[\x0A]/'
£
$ echo £ | perl -CIO -ne 'print if /[\x{0080}-\x{FFFF}]/'
£

the -CIO converted the ordinal from a UTF-8 representation(c2a3), to a unicode codepoint representation(a3). Hence when using -CIO with \x, then what follows \x should be a unicode codepoint representation

The one below matches it to anything \u0080 and upwards. Not stopping at \uFFFF. Just a regex thing there.

$ echo £ | perl -CIO -ne 'print if /[^\x00-\x7f]/'
£

If you remove the -CIO then you'd be matching UTF-8 bytes, rather than unicode codepoint bytes. Because without the -CI, it's not converting/interpreting/decoding, the UTF-8 encoded bytes, to a unicode codepoint.

$ echo £ | perl -ne 'print if /\xc2/'
£
$ echo £ | perl -ne 'print if /\xa3/'
£

So in summary

$ echo £ | perl -CIO -ne 'print if /[\x{0080}-\x{FFFF}]/'
£
$ echo £ | perl -CIO -ne 'print if /[^\x00-\x7f]/'
£
$ echo £ | perl -CIO -ne 'print if /[^\x{00}-{x7f}]/'
£
$ echo £ | perl -CIO -ne 'print if /[^\x{0000}-{x007f}]/'
£

The perl -CIO is documented by perldoc perlrun

 -C [*number/list*]
         The -C flag controls some of the Perl Unicode features.
...
             I     1   STDIN is assumed to be in UTF-8
             O     2   STDOUT will be in UTF-8

and perldoc perlunicode and perldoc perlre mention \x{...}

score 2 · Answer 3 · answered Jul 27 '22 at 16:34

Setting LC_COLLATE to C on GNU systems at least should guarantee the order be based on the Unicode code point in locales where the charmap is multibyte (such as UTF-8, GB18030), and on the byte value other wise (which in locales using ASCII or ISO-8859-1 should also match the unicode code point order).

So:

LC_COLLATE=C grep $'[\u1111-\uaaaa]'

Should find line that contain at least one character whose unicode code point is between U+1111 and U+AAAA (encoded as per the locale's charmap as indicated by the LC_CTYPE setting). That assumes $LC_ALL is not otherwise set (as it would take precedence over $LC_COLLATE).

I would advise avoiding the range to span the invalid U+D800 - U+DFFF range. Codepoints in that range are reserved for UTF-16 encoding and are not for valid characters, and have otherwise been used to encode invalid characters with some tools. Use characters within the U+0001 and U+D7FF and U+E000 and U+10FFFF.

You'll also want to make sure the bounds of your range correspond to valid characters in your locale. The behaviour for $'\uxxxx', where U+xxxx is not a character in the locale's charset varies between shells that support that $'\u...' operator. In some shells (including ksh93 where $'...' comes from $'\u...' being from zsh), $'\u...' only works in locales that use UTF-8 as the charmap (see output of locale charmap).

How do I grep for a range of unicode characters?

3 Answers3