39

I am using following command to grep character set range for hexadecimal code 0900 (instead of अ) to 097F (instead of व). How I can use hexadecimal code in place of अ and व?

bzcat archive.bz2 | grep -v '<[अ-व]*\s' | tr '[:punct:][:blank:][:digit:]' '\n' | uniq | grep -o '^[अ-व]*$' | sort -f | uniq -c | sort -nr | head -50000 | awk '{print "<w f=\""$1"\">"$2"</w>"}' > hindi.xml

I get the following output:

    <w f="399651">और</w>
    <w f="264423">एक</w>
    <w f="213707">पर</w>
    <w f="74728">कर</w>
    <w f="44281">तक</w>
    <w f="35125">कई</w>
    <w f="26628">द</w>
    <w f="23981">इन</w>
    <w f="22861">जब</w> 
    ...

I just want to use hexadecimal code instead of अ and व in the above command.

If using hexadecimal code is not at all possible , can I use unicode instead of hexadecimal code for character set ('अ-व') ?

I am using Ubuntu 10.04

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

4 Answers4

30

Look at grep: Find all lines that contain Japanese kanjis.

Text is usually encoded in UTF-8; so you have to use the hex vales of the bytes used in UTF-8 encoding.

grep "["$'\xe0\xa4\x85'"-"$'\xe0\xa4\xb5'"]"

and

grep '[अ-व]'

are equivalent, and they perform a character class / bracket expression locale-based matching (that is, matching is dependent on the sorting rules of Devanagari script (that is, the matching is NOT "any char between \u0905 and \0935" but instead "anything sorting between Devanagari A and Devanagari VA"; there may be differences.

($'...' is the "ANSI-C escape string" syntax for bash, ksh, and zsh. It is just an easier way to type the characters. You can also use the \uXXXX and \UXXXXXXXX escapes to directly ask for code points in bash and zsh.)

On the other hand, you have this (note -P):

grep -P "\xe0\xa4[\x85-\xb5]"

that will do a binary matching with those byte values.

Mr R
  • 218
  • 2
  • 8
7

If shell escaping is enough you can use the $'\xHH' syntax like this:

grep -v "<["$'\x09\x00'"-"$'\x09\x7F'"]*\s"

Is that enough for your use case?

  • echo 'अ-व' | hd gives me e0 a4 85 - e0 a4 b5 – enzotib Aug 26 '11 at 14:30
  • Indeed the OP gave unicode values, not hexadecimal dumps in UTF-8 encoding :-/ Since grep is not linked with any lib, I guess it's not possible to have the range conversion be performed by grep :-/ – Stéphane Gimenez Aug 26 '11 at 14:48
  • 1
    Btw, zsh is able to interpret "\u0900" and "\u097F", but the behavior will rely on the UTF-8 encoded range being continuous (probably it is). – Stéphane Gimenez Aug 26 '11 at 14:49
  • No grep -v "<["$'\x09\x00'"-"$'\x09\x7F'"]*\s" gives following output

    x F FF FFFFFF FFFF xx FFF xxx .... This is not expected. :( , Can I use unicode instead of hexadecimal code or character set ('अ-व') ?

    – Dhrubo Bhattacharjee Aug 28 '11 at 02:54
7

The "hexadecimal" value 0x0900 you wrote is exactly the value of the UNICODE code point which is also in hexadecimal.

hexadecimal code 0900 (instead of अ)

I believe that what you mean to say is the hexadecimal UNICODE code point: U0905.

The character at U-0900 is not the one you used: .
That character is U0905, part of this Unicode page, or listed at this page.

In bash (installed by default in Ubuntu), or directly with the program at: /usr/bin/printf (but not with sh printf), an Unicode character could be produced with:

$ printf '\u0905'
अ
$ /usr/bin/printf '\u0905'
अ

However, that character, which comes from a code point number could be represented by several byte streams depending of which code page is used.
It should be obvious that \U0905 is 0x09 0x05 in UTF-16 (UCS-2, etc)
and 0x00 0x00 0x09 0x05 in UTF-32.
It may not be obvious but in utf-8 it is represented by 0xe0 0xa4 0x85:

$ /usr/bin/printf '\u0905' | od -vAn -tx1
e0 a4 85

If the locale of your console is something similar to en_US.UTF-8.

And I am talking about the shell because it is the one that transforms a string into what the application receives. This:

grep "$(printf '\u0905')" file

makes grep "see" the character you need.
To understand the line above you may use echo:

$ echo grep "$(printf '\u0905')" file
grep अ file

Then, we can build a character range, as you request:

$ echo grep "$(printf '[\u0905-\u097f]')" file
grep [अ-ॿ] file

That answer your question:

How I can use hexadecimal code in place of अ and व?

  • This is by far the best answer---it clearly addresses the issue of unicode points' representations in the shell and shows how to go back and forth between them hex codes. – stefano Mar 11 '19 at 16:09
4

we wanted to convert the non-ascii open double quote and close double quote to regular double quotes ("). Also the non-ascii single quote to regular single quote (').

to see them in the file (ubuntu bash shell):

$ grep -P "\x92" infile.txt  (single)
$ grep -P "\x93" infile.txt  (open double)
$ grep -P "\x94" infile.txt  (close double)

translate them:

$ /bin/sed "s/\x92/'/g" a.txt > b.txt
$ /bin/sed 's/\x93/"/g' b.txt > c.txt
$ /bin/sed 's/\x94/"/g' c.txt > d.txt
slm
  • 369,824