How do I grep for a range of unicode characters?
I've seen an example for one character. How to grep characters with their unicode value?
I'm interested in a method other than the shell substitution method
because shell substitution seems to be a bit limited e.g. it doesn't seem to work for a non-graphical unicode characters like codepoint of \u80.
I can get that method to work for a range, but only to an extent, since it won't cover non-graphical characters like \u80 (unicode codepoint 80)
$ echo grep [$'\u41'-$'\u45']
grep [A-E]
$ echo 4142434445|xxd -r -p
ABCDE
$ echo 4142434445|xxd -r -p | grep [$'\u41'-$'\u45']
ABCDE
The $ method uses substitution at the shell level so won't work to e.g. look for characters from \u0080-\uFFFF
or \u0080
upwards, 'cos if the shell can't display a character, it won't work.
ugrep is available through apt-get for debian, though not for the ubuntu version I have on the VPS I have. And I have to test it some more.
NOTE Turns out the shell substitution method does work for control characters so would even work for a range of them or any unicode characters, as no doubt would ugrep. Initially when I tried shell substitution with grep, I was unknowingly feeding in wrong bytes. e.g. echo 418042| xxd -r -p
displays A▒B
so I thought great that worked, and I was trying to grep on that. So I was piping wrong data to grep. 80 is not utf-8 for \u80. An echo of high characters e.g. £ clearly shows it is outputting utf-8. echo £ | xxd -p
shows c2a30a
c2a3 is utf-8 for £. When I fed in correct bytes it worked e.g. c280
is \u80 and even echo $'\u80'
worked . This page is good for showing utf-8 mappings to unicode codepoints. https://www.utf8-chartable.de/
While shell substitution does work, i'm glad I have an answer that does a method other than shell substitution, as having an alternative is good.
egrep '[\u1600-\u16ff]'
would be sort of the obvious thing to try, but not sure if will work – Barry Carter Jul 26 '22 at 14:47$'\u00e4'
or$'\u2667'
just fine, givingä
and♧
. – ilkkachu Jul 27 '22 at 06:36