I have the Unicode character ᚠ, represented by its Unicode code point 16A0, in a text file (the text file is encoded(?) as utf-8).
When I do grep '\u16A0' test.txt
I get no result. How do I grep that character?
I have the Unicode character ᚠ, represented by its Unicode code point 16A0, in a text file (the text file is encoded(?) as utf-8).
When I do grep '\u16A0' test.txt
I get no result. How do I grep that character?
You can use ANSI-C quoting provided by your shell, to replace backslash-escaped characters as specified by the ANSI C standard. This should work for any command, not just grep
, in shells like Bash and Zsh:
grep $'\u16A0'
For some more complex examples, you might refer to this related question and its answers.
You could use ugrep as a drop-in replacement of grep to match Unicode code point U+16A0:
ugrep '\x{16A0}' test.txt
It takes the same options as grep but offers vastly more features, such as:
ugrep searches UTF-8/16/32 input and other formats. Option -Q permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KIO8.
ugrep matches Unicode patterns by default (disabled with option -U). The regular expression pattern syntax is POSIX ERE compliant extended with PCRE-like syntax. Option -P may also be used for Perl matching with Unicode patterns.
See ugrep on GitHub for details.
Using Raku (formerly known as Perl_6)
Sample Input:
https://www.cogsci.ed.ac.uk/~richard/unicode-sample-3-2.html
Match the character, print the line:
~$ raku -ne '.put if m/ \x[16A0] /;' file
ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ ᚧ ᚨ ᚩ ᚪ ᚫ ᚬ ᚭ ᚮ ᚯ ᚰ ᚱ ᚲ ᚳ ᚴ ᚵ ᚶ ᚷ ᚸ ᚹ ᚺ ᚻ ᚼ ᚽ ᚾ ᚿ ᛀ ᛁ ᛂ ᛃ ᛄ ᛅ ᛆ ᛇ ᛈ ᛉ ᛊ ᛋ ᛌ ᛍ ᛎ ᛏ ᛐ ᛑ ᛒ ᛓ ᛔ ᛕ ᛖ ᛗ ᛘ ᛙ ᛚ ᛛ ᛜ ᛝ ᛞ ᛟ ᛠ ᛡ ᛢ ᛣ ᛤ ᛥ ᛦ ᛧ ᛨ ᛩ ᛪ ᛫ ᛬ ᛭ ᛮ ᛯ ᛰ
#OR:
~$ raku -e 'lines.grep(/ \x[16A0] /).put;' file
ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ ᚧ ᚨ ᚩ ᚪ ᚫ ᚬ ᚭ ᚮ ᚯ ᚰ ᚱ ᚲ ᚳ ᚴ ᚵ ᚶ ᚷ ᚸ ᚹ ᚺ ᚻ ᚼ ᚽ ᚾ ᚿ ᛀ ᛁ ᛂ ᛃ ᛄ ᛅ ᛆ ᛇ ᛈ ᛉ ᛊ ᛋ ᛌ ᛍ ᛎ ᛏ ᛐ ᛑ ᛒ ᛓ ᛔ ᛕ ᛖ ᛗ ᛘ ᛙ ᛚ ᛛ ᛜ ᛝ ᛞ ᛟ ᛠ ᛡ ᛢ ᛣ ᛤ ᛥ ᛦ ᛧ ᛨ ᛩ ᛪ ᛫ ᛬ ᛭ ᛮ ᛯ ᛰ
Match the character, print the (whitespace-separated) "word":
~$ raku -ne 'for .words() { .put if m/ \x[16A0] / };' file
ᚠ
#OR:
~$ raku -e 'words.grep(/ \x[16A0] /).put;' file
ᚠ
Match the character, print the (exact) matches:
~$ raku -e 'given slurp() { put m:g/ \x[16A0] / };' file
ᚠ
#OR:
~$ raku -e 'slurp.match(:global,/ \x[16A0] /).put;' file
ᚠ
Match the character, count the matches and print :
~$ raku -e 'given slurp() { put m:g/ \x[16A0] /.Int };' file
1
#OR:
~$ raku -e 'slurp.match(:global,/ \x[16A0] /).elems.put;' file
1
NOTES:
In Raku you can just as easily match against a Unicode name like \c[RUNIC LETTER FEHU FEOH FE F]
, which gives the same results as matching against \x[16A0]
above.
In Raku you can just as easily match against a Unicode character like ᚠ
, which gives the same results as matching against \x[16A0]
or \c[RUNIC LETTER FEHU FEOH FE F]
above.
In Raku, you can use Unicode variables (as well as Unicode operators). So this works:
~$ raku -e 'my $ᚠ = slurp.match(/ \x[16A0] /); say $ᚠ.Str.raku;' file
"ᚠ"
https://docs.raku.org/language/regexes#Unicode_properties
https://docs.raku.org/language/unicode
https://docs.raku.org
https://raku.org
With perl
, pcre2grep
, pcregrep
or any grep
implementation that use or can use PCRE or PCRE2 at least, you can use \x{16A0}
to match on the character with value 0x16A0
(or just \xe9
for values <= 0xff).
For that value to be the unicode code point, we need to tell them that the input has to be decoded from UTF-8. In PCRE/PCRE2, that's by using (*UTF)
at the start of the pattern (mostly equivalent to passing PCRE_UTF
to the regexp engine), though recent versions of GNU grep
at least do that automatically when invoked in a locale that uses UTF-8 as its charmap. With pcregrep
and pcre2grep
, that can also be enabled with the -u
option (see also -U
in pcre2grep
).
In perl
, that's via the -C
(big char) option of the PERL_UNICODE
environment variable. Where -C
alone, short for -CSDL
is to decode/recode recode input/output as UTF-8 if the locale uses UTF-8 like GNU grep
does, or with -CSD
to do that unconditionally, or explicitly decode from any encoding using for instance the Encode
module.
In perl
and PCRE2 (as used by pcre2grep
or recent versions of GNU grep
), you can also use \N{U+16A0}
. And in perl
its Unicode name: \N{RUNIC LETTER FEHU FEOH FE F}
.
So:
perl -C -ne 'print if /\x{16A0}/'
perl -C -ne 'print if /\N{U+16A0}/'
perl -C -ne 'print if /\N{RUNIC LETTER FEHU FEOH FE F}/'
PERL_UNICODE=SD perl -ne 'print if /\x{16A0}/'
pcregrep -u '\x{16A0}'
pcregrep '(*UTF)\x{16A0}'
pcre2grep -u '\x{16A0}'
pcre2grep '(*UTF)\x{16A0}'
pcre2grep -U '\x{16A0}'
pcre2grep -u '\N{U+16A0}'
grep -P '\x{16A0}'
To match a character based on its Unicode value on an input that is not encoded in UTF-8, those won't work as they only work in UTF-8. In single byte charsets, the \xHH
will work on the by value (the codepoint in the corresponding charset, not in Unicode).
For instance in the en_GB.iso885915
locale, where the Euro sign (U+20AC) is at 0xA4.
$ LC_ALL=en_GB.iso885915 luit
$ locale charmap
ISO-8859-15
$ printf %s € | od -An -vtx1
a4
$ echo € | grep -P '\x{20ac}'
grep: character code point value in \x{} or \o{} is too large
$ echo € | grep -P '\N{U+20ac}'
grep: \N{U+dddd} is supported only in Unicode (UTF) mode
$ echo € | grep -P '(*UTF)\N{U+20ac}'
$ echo € | grep -P '\xA4'
€
So the options would be to convert the text to UTF-8:
$ echo € | iconv -t utf-8 | LC_ALL=C.UTF-8 grep -P '\x{20ac}' | iconv -f utf-8
€
Or if using perl
use -Mopen=locale
instead of -C
to tell it to decode/encode the input/output as per the locale's charset rather than as UTF-8:
$ echo € | perl -Mopen=locale -ne 'print if /\N{U+20ac}/'
€
Or not do any decoding but match on the byte value of that character in the locale.
For instance, with GNU or zsh's or recent versions of bash's printf
:
$ locale charmap
ISO-8859-15
$ printf '\u20ac' | od -An -vtx1
a4
$ echo € | grep -F -- "$(printf '\u20ac')"
€
In zsh, you can also use $'\u20ac'
which will expand to the encoding of the character in the current locale at the time (and report an error if there's no such character in that locale).
$ echo € | grep -F -- $'\u20ac'
€
Several other shells have copied that $'\uHHHH'
from zsh since including ksh93, bash, mksh and a few ash-based shells but with some unfortunate differences: in ksh, that's expanded in UTF-8 regardless of the locale, and with bash, that's expanded in the locale at the time the code is read as opposed to the time the code is run, so for instance, in bash
:
LC_CTYPE=C.UTF-8
{
LC_CTYPE=en_GB.iso885915
printf '\xA4\n' | grep -F -- $'\u20ac'
}
Or:
LC_CTYPE=C.UTF-8
euro() {
grep -F -- $'\u20ac'
}
LC_CTYPE=en_GB.iso885915
printf '\xa4\n' | euro
Won't work because in both cases, the $'\u20ac'
is expanded to its UTF-8 encoding as the LC_CTYPE=en_GB.iso885915
hadn't be run by the time the $'\u20ac'
was parsed by the shell.