5

How do I grep for a range of unicode characters?

I've seen an example for one character. How to grep characters with their unicode value?

I'm interested in a method other than the shell substitution method

because shell substitution seems to be a bit limited e.g. it doesn't seem to work for a non-graphical unicode characters like codepoint of \u80.

I can get that method to work for a range, but only to an extent, since it won't cover non-graphical characters like \u80 (unicode codepoint 80)

$ echo grep [$'\u41'-$'\u45']
grep [A-E]

$ echo 4142434445|xxd -r -p ABCDE

$ echo 4142434445|xxd -r -p | grep [$'\u41'-$'\u45'] ABCDE

The $ method uses substitution at the shell level so won't work to e.g. look for characters from \u0080-\uFFFF or \u0080 upwards, 'cos if the shell can't display a character, it won't work.

ugrep is available through apt-get for debian, though not for the ubuntu version I have on the VPS I have. And I have to test it some more.

NOTE Turns out the shell substitution method does work for control characters so would even work for a range of them or any unicode characters, as no doubt would ugrep. Initially when I tried shell substitution with grep, I was unknowingly feeding in wrong bytes. e.g. echo 418042| xxd -r -p displays A▒B so I thought great that worked, and I was trying to grep on that. So I was piping wrong data to grep. 80 is not utf-8 for \u80. An echo of high characters e.g. £ clearly shows it is outputting utf-8. echo £ | xxd -p shows c2a30a c2a3 is utf-8 for £. When I fed in correct bytes it worked e.g. c280 is \u80 and even echo $'\u80' worked . This page is good for showing utf-8 mappings to unicode codepoints. https://www.utf8-chartable.de/

While shell substitution does work, i'm glad I have an answer that does a method other than shell substitution, as having an alternative is good.

barlop
  • 259
  • egrep '[\u1600-\u16ff]' would be sort of the obvious thing to try, but not sure if will work – Barry Carter Jul 26 '22 at 14:47
  • "won't work to e.g. look for characters from \u0080-\uFFFF or \u0080 upwards, 'cos if the shell can't display a character, it won't work" -- hmm, what shell are you using? Seems to me Bash can interpret and print e.g. $'\u00e4' or $'\u2667' just fine, giving ä and . – ilkkachu Jul 27 '22 at 06:36
  • 1
    @ilkkachu thanks.. I think I had control characters in mind. – barlop Jul 27 '22 at 09:57
  • @ilkkachu More to the point, the shell does not display characters. This is left up to the terminal emulator and depends on the fonts available to the terminal. – doneal24 Jul 27 '22 at 18:42
  • @doneal24, yes, the shell isn't involved if we actually display something. I think I was mostly thinking about the possibility that the shell might croak on having to just handle some characters, for whatever reason not related to actual displaying. – ilkkachu Jul 27 '22 at 20:17

3 Answers3

5

In gnu-grep and similar you can use PCRE option -P and use \x{HHHH} syntax

$ grep -o -P '[\x{0410}-\x{042F}]+' # same as: grep -o -P '[А-Я]+'
абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕ

=> АБВГДЕ

JJoao
  • 12,170
  • 1
  • 23
  • 45
2

I spoke to some perl experts. and got perl one liner equivalents of grepping for a range of unicode characters.

$ echo £
£

So there's this concept of an ordinal, which is a numeric representation of a character. (I suppose regardless of whether something is an encoding or a codepoint. The word ordinal is useful to describe what follows \x which depending on options, could be an encoding , so stored bytes, or could be a unicode codepoint, bytes but not encoded for storing in / writing to memory)

It's in bytes/octets. Which can be represented in various bases.

%v is a format specifier of printf,

$ perl -e 'printf "%vx\n",A'
41

$ perl -e 'printf "%vx\n",4' 34

%vd would be 52 (decimal numeric representation of the character '4'. %vx is hex representation

UTF-8 encoding for £ is c2a3 https://www.utf8-chartable.de/

$ echo £ | xxd -p
c2a30a

When using \x with more than two digits, you have to use curly braces with it. \x{..}

$ echo £ | perl -CIO -ne 'print if /[\x0A]/'
£

$ echo £ | perl -CIO -ne 'print if /[\x{0080}-\x{FFFF}]/' £

the -CIO converted the ordinal from a UTF-8 representation(c2a3), to a unicode codepoint representation(a3). Hence when using -CIO with \x, then what follows \x should be a unicode codepoint representation

The one below matches it to anything \u0080 and upwards. Not stopping at \uFFFF. Just a regex thing there.

$ echo £ | perl -CIO -ne 'print if /[^\x00-\x7f]/'
£

If you remove the -CIO then you'd be matching UTF-8 bytes, rather than unicode codepoint bytes. Because without the -CI, it's not converting/interpreting/decoding, the UTF-8 encoded bytes, to a unicode codepoint.

$ echo £ | perl -ne 'print if /\xc2/'
£

$ echo £ | perl -ne 'print if /\xa3/' £

So in summary

$ echo £ | perl -CIO -ne 'print if /[\x{0080}-\x{FFFF}]/'
£

$ echo £ | perl -CIO -ne 'print if /[^\x00-\x7f]/' £

$ echo £ | perl -CIO -ne 'print if /[^\x{00}-{x7f}]/' £

$ echo £ | perl -CIO -ne 'print if /[^\x{0000}-{x007f}]/' £

The perl -CIO is documented by perldoc perlrun

 -C [*number/list*]
         The -C flag controls some of the Perl Unicode features.

... I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8

and perldoc perlunicode and perldoc perlre mention \x{...}

barlop
  • 259
2

Setting LC_COLLATE to C on GNU systems at least should guarantee the order be based on the Unicode code point in locales where the charmap is multibyte (such as UTF-8, GB18030), and on the byte value other wise (which in locales using ASCII or ISO-8859-1 should also match the unicode code point order).

So:

LC_COLLATE=C grep $'[\u1111-\uaaaa]'

Should find line that contain at least one character whose unicode code point is between U+1111 and U+AAAA (encoded as per the locale's charmap as indicated by the LC_CTYPE setting). That assumes $LC_ALL is not otherwise set (as it would take precedence over $LC_COLLATE).

I would advise avoiding the range to span the invalid U+D800 - U+DFFF range. Codepoints in that range are reserved for UTF-16 encoding and are not for valid characters, and have otherwise been used to encode invalid characters with some tools. Use characters within the U+0001 and U+D7FF and U+E000 and U+10FFFF.

You'll also want to make sure the bounds of your range correspond to valid characters in your locale. The behaviour for $'\uxxxx', where U+xxxx is not a character in the locale's charset varies between shells that support that $'\u...' operator. In some shells (including ksh93 where $'...' comes from $'\u...' being from zsh), $'\u...' only works in locales that use UTF-8 as the charmap (see output of locale charmap).