26

I have the Unicode character ᚠ, represented by its Unicode code point 16A0, in a text file (the text file is encoded(?) as utf-8).

When I do grep '\u16A0' test.txt I get no result. How do I grep that character?

Siva
  • 9,077
Stupid
  • 363
  • 1
    @pLumo Yes, that worked. Thanks. What does the $ character before the regular expression do? – Stupid Jun 06 '19 at 14:39

4 Answers4

31

You can use ANSI-C quoting provided by your shell, to replace backslash-escaped characters as specified by the ANSI C standard. This should work for any command, not just grep, in shells like Bash and Zsh:

grep $'\u16A0'

For some more complex examples, you might refer to this related question and its answers.

Flimm
  • 4,218
pLumo
  • 22,565
  • 2
    Note that it's not ANSI C, the C language standard does not specify functionality of shells, and it was invented by David Korn for the Korn shell. https://unix.stackexchange.com/a/65819/5132 – JdeBP Jun 06 '19 at 18:40
  • 1
    except this ruins the regex if there is a control character – törzsmókus Jul 07 '21 at 08:59
7

You could use ugrep as a drop-in replacement of grep to match Unicode code point U+16A0:

ugrep '\x{16A0}' test.txt

It takes the same options as grep but offers vastly more features, such as:

ugrep searches UTF-8/16/32 input and other formats. Option -Q permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KIO8.

ugrep matches Unicode patterns by default (disabled with option -U). The regular expression pattern syntax is POSIX ERE compliant extended with PCRE-like syntax. Option -P may also be used for Perl matching with Unicode patterns.

See ugrep on GitHub for details.

0

Using Raku (formerly known as Perl_6)

Sample Input:

https://www.cogsci.ed.ac.uk/~richard/unicode-sample-3-2.html


Match the character, print the line:

~$ raku -ne '.put if m/ \x[16A0] /;'  file
ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ ᚧ ᚨ ᚩ ᚪ ᚫ ᚬ ᚭ ᚮ ᚯ ᚰ ᚱ ᚲ ᚳ ᚴ ᚵ ᚶ ᚷ ᚸ ᚹ ᚺ ᚻ ᚼ ᚽ ᚾ ᚿ ᛀ ᛁ ᛂ ᛃ ᛄ ᛅ ᛆ ᛇ ᛈ ᛉ ᛊ ᛋ ᛌ ᛍ ᛎ ᛏ ᛐ ᛑ ᛒ ᛓ ᛔ ᛕ ᛖ ᛗ ᛘ ᛙ ᛚ ᛛ ᛜ ᛝ ᛞ ᛟ ᛠ ᛡ ᛢ ᛣ ᛤ ᛥ ᛦ ᛧ ᛨ ᛩ ᛪ ᛫ ᛬ ᛭ ᛮ ᛯ ᛰ

#OR:

~$ raku -e 'lines.grep(/ \x[16A0] /).put;' file ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ ᚧ ᚨ ᚩ ᚪ ᚫ ᚬ ᚭ ᚮ ᚯ ᚰ ᚱ ᚲ ᚳ ᚴ ᚵ ᚶ ᚷ ᚸ ᚹ ᚺ ᚻ ᚼ ᚽ ᚾ ᚿ ᛀ ᛁ ᛂ ᛃ ᛄ ᛅ ᛆ ᛇ ᛈ ᛉ ᛊ ᛋ ᛌ ᛍ ᛎ ᛏ ᛐ ᛑ ᛒ ᛓ ᛔ ᛕ ᛖ ᛗ ᛘ ᛙ ᛚ ᛛ ᛜ ᛝ ᛞ ᛟ ᛠ ᛡ ᛢ ᛣ ᛤ ᛥ ᛦ ᛧ ᛨ ᛩ ᛪ ᛫ ᛬ ᛭ ᛮ ᛯ ᛰ

Match the character, print the (whitespace-separated) "word":

~$ raku -ne 'for .words() { .put if m/ \x[16A0] / };'  file
ᚠ

#OR:

~$ raku -e 'words.grep(/ \x[16A0] /).put;' file ᚠ

Match the character, print the (exact) matches:

~$ raku -e 'given slurp() { put m:g/  \x[16A0]  / };'   file
ᚠ

#OR:

~$ raku -e 'slurp.match(:global,/ \x[16A0] /).put;' file ᚠ

Match the character, count the matches and print :

~$ raku -e 'given slurp() { put m:g/  \x[16A0]  /.Int };'   file
1

#OR:

~$ raku -e 'slurp.match(:global,/ \x[16A0] /).elems.put;' file 1


NOTES:

  1. In Raku you can just as easily match against a Unicode name like \c[RUNIC LETTER FEHU FEOH FE F], which gives the same results as matching against \x[16A0] above.

  2. In Raku you can just as easily match against a Unicode character like , which gives the same results as matching against \x[16A0] or \c[RUNIC LETTER FEHU FEOH FE F] above.

  3. In Raku, you can use Unicode variables (as well as Unicode operators). So this works:

~$ raku -e 'my $ᚠ = slurp.match(/  \x[16A0]  /); say $ᚠ.Str.raku;'   file
"ᚠ"

https://docs.raku.org/language/regexes#Unicode_properties
https://docs.raku.org/language/unicode
https://docs.raku.org
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

With perl, pcre2grep, pcregrep or any grep implementation that use or can use PCRE or PCRE2 at least, you can use \x{16A0} to match on the character with value 0x16A0 (or just \xe9 for values <= 0xff).

For that value to be the unicode code point, we need to tell them that the input has to be decoded from UTF-8. In PCRE/PCRE2, that's by using (*UTF) at the start of the pattern (mostly equivalent to passing PCRE_UTF to the regexp engine), though recent versions of GNU grep at least do that automatically when invoked in a locale that uses UTF-8 as its charmap. With pcregrep and pcre2grep, that can also be enabled with the -u option (see also -U in pcre2grep).

In perl, that's via the -C (big char) option of the PERL_UNICODE environment variable. Where -C alone, short for -CSDL is to decode/recode recode input/output as UTF-8 if the locale uses UTF-8 like GNU grep does, or with -CSD to do that unconditionally, or explicitly decode from any encoding using for instance the Encode module.

In perl and PCRE2 (as used by pcre2grep or recent versions of GNU grep), you can also use \N{U+16A0}. And in perl its Unicode name: \N{RUNIC LETTER FEHU FEOH FE F}.

So:

perl -C -ne 'print if /\x{16A0}/'
perl -C -ne 'print if /\N{U+16A0}/'
perl -C -ne 'print if /\N{RUNIC LETTER FEHU FEOH FE F}/'
PERL_UNICODE=SD perl -ne 'print if /\x{16A0}/'
pcregrep -u '\x{16A0}'
pcregrep '(*UTF)\x{16A0}'
pcre2grep -u '\x{16A0}'
pcre2grep '(*UTF)\x{16A0}'
pcre2grep -U '\x{16A0}'
pcre2grep -u '\N{U+16A0}'
grep -P '\x{16A0}'

To match a character based on its Unicode value on an input that is not encoded in UTF-8, those won't work as they only work in UTF-8. In single byte charsets, the \xHH will work on the by value (the codepoint in the corresponding charset, not in Unicode).

For instance in the en_GB.iso885915 locale, where the Euro sign (U+20AC) is at 0xA4.

$ LC_ALL=en_GB.iso885915 luit
$ locale charmap
ISO-8859-15
$ printf %s € | od -An -vtx1
 a4
$ echo € | grep -P '\x{20ac}'
grep: character code point value in \x{} or \o{} is too large
$ echo € | grep -P '\N{U+20ac}'
grep: \N{U+dddd} is supported only in Unicode (UTF) mode
$ echo € | grep -P '(*UTF)\N{U+20ac}'
$ echo € | grep -P '\xA4'
€

So the options would be to convert the text to UTF-8:

$ echo € | iconv -t utf-8 | LC_ALL=C.UTF-8 grep -P '\x{20ac}' | iconv -f utf-8
€

Or if using perl use -Mopen=locale instead of -C to tell it to decode/encode the input/output as per the locale's charset rather than as UTF-8:

$ echo € | perl -Mopen=locale -ne 'print if /\N{U+20ac}/'
€

Or not do any decoding but match on the byte value of that character in the locale.

For instance, with GNU or zsh's or recent versions of bash's printf :

$ locale charmap
ISO-8859-15
$ printf '\u20ac' | od -An -vtx1
 a4
$ echo € | grep -F -- "$(printf '\u20ac')"
€

In zsh, you can also use $'\u20ac' which will expand to the encoding of the character in the current locale at the time (and report an error if there's no such character in that locale).

$ echo € | grep -F -- $'\u20ac'
€

Several other shells have copied that $'\uHHHH' from zsh since including ksh93, bash, mksh and a few ash-based shells but with some unfortunate differences: in ksh, that's expanded in UTF-8 regardless of the locale, and with bash, that's expanded in the locale at the time the code is read as opposed to the time the code is run, so for instance, in bash:

LC_CTYPE=C.UTF-8
{
  LC_CTYPE=en_GB.iso885915
  printf '\xA4\n' | grep -F -- $'\u20ac'
}

Or:

LC_CTYPE=C.UTF-8
euro() {
  grep -F -- $'\u20ac'
}
LC_CTYPE=en_GB.iso885915
printf '\xa4\n' | euro

Won't work because in both cases, the $'\u20ac' is expanded to its UTF-8 encoding as the LC_CTYPE=en_GB.iso885915 hadn't be run by the time the $'\u20ac' was parsed by the shell.