4

Some code I am working with has a bunch of comments written in Japanese and I am working on translating them to English. Is there some way to "grep" for all lines containing Japanese characters or at least any non-ascii characters?

hugomg
  • 5,747
  • 4
  • 39
  • 54

4 Answers4

2

Grepping for non-ASCII characters is easy: set a locale where only ASCII characters are valid, search for invalid characters.

LC_CTYPE=C grep '[^[:print:]]' myfile

If you want to search for Japanese characters, it's a bit more complicated. With grep, you'll need to make sure that your LC_CTYPE locale setting matches the encoding of the files. You'll also need to make sure that your LC_COLLATE setting is set to Japanese if you want to use a character range expression. For example, on Linux (I determined the first and last character that's considered Japanese by looking at the LC_COLLATE section /usr/share/i18n/locales/ja_JP):

LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep '[。-龥]' myfile

or if you want to stick to ASCII in your script

LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep $'[\uff61-\u9fa5]' myfile

This includes a few punctuation characters that are also used in English such as and ×.

Perl has built-in features to classify characters. You can use the \p character class to match characters based on Unicode properties. Pass the command line switch -CSD to tell Perl that everything is in Unicode with the UTF-8 encoding.

perl -CSD -ne 'print if /\p{Hiragana}|\p{Katakana}/' myfile

If your files aren't encoded in UTF-8, you'll have to call binmode explicitly to tell Perl about their encoding. That's too advanced a perllocale usage for me. Alternatively you can first recode the line into UTF-8.

Alternatively, in Perl, you can use numerical character ranges. For example, to search for characters in the Hiragana and Katakana Unicode blocks:

perl -CSD -ne 'print if /[\x{3040}-\x{30ff}]/' a
  • 1
    The grep [^[:print:]] version is also printing tab characters. Is there a way to avoid that? BTW you were right about the file encodings, turns out it was actually EUCJP – hugomg Apr 02 '15 at 01:00
  • 1
    @hugomg Add a tab inside the outer brackets: grep '[^[:print:]TAB]' myfile or grep '[^TAB[:print:]]' myfile or grep $'[^[:print:]\t]' myfile or grep $'[^\t[:print:]]' myfile (with an actual tab character instead of TAB). – Gilles 'SO- stop being evil' Apr 02 '15 at 01:11
  • In this answer @janis suggests using grep '[^[:print:][:space:]]' to handle tab and space characters. – Christian Long Aug 30 '16 at 23:29
1

Try this:

grep '[^[:print:][:space:]]'

(Depending on your locale setting maybe you have to prepend it by LANG=C.)

Janis
  • 14,222
1

If you don't mind using perl, it has more extensive Unicode support in the form of classes such as {Katakana} and {Hiragana} which I don't think are currently available in even in those versions of grep that provide some PCRE support. However it does appear to require explicit UTF-8 decoding e.g.

perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' somefile

To traverse directories like grep's -R, you could use the find command, something like

find -type f -exec perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;

or to mimic recursive grep's default filename:match labeled output format,

find -type f -exec perl -MEncode -lne 'printf "%s:%s\n",$ARGV,$_ if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;
steeldriver
  • 81,074
  • Sadly, none of these worked for me, maybe because the file is encoded in iso-8859-1 (though fiddling with LC_CTYPE and the parameter to decode didn't seem to help). I managed to find a solution to my problem in the thread you linked to though :) – hugomg Apr 01 '15 at 03:17
  • @hugomg A file encoded in ISO 8859-1 cannot contain any Japanese characters. It's probably UTF-8, EUCJP or a JIS variant. – Gilles 'SO- stop being evil' Apr 01 '15 at 23:40
0

My files were encoded in iso-8859-1 so anything that tried to read the input in my default locale (utf-8) would not recognize the Japanese characters. In the end I managed to solve my problem with the following command:

env LC_CTYPE=iso-8859-1  grep -nP '[\x80-\xff]' ./*

-P is to allow for the Perllike syntax for character ranges.
-n is for printing the line numbers next to the line names

\x80 to \xff are the "non ascii" characters

Changing the LC_CTYPE environment variable to iso-8859-1 makes grep read my fields byte-by-byte and lets me detect any "extended ascii" bytes as possible Japanese characters. If I use the default system encoding of UTF-8 grep exits with an "invalid UTF-8 byte sequence in input" error.

hugomg
  • 5,747
  • 4
  • 39
  • 54