Is there a way to grep a folder and show all lines containing non-ascii characters?

Question

Some code I am working with has a bunch of comments written in Japanese and I am working on translating them to English. Is there some way to "grep" for all lines containing Japanese characters or at least any non-ascii characters?

Does it have to be grep? perl has more extensive unicode support, I think e.g. print if /\P{ASCII}/ or possibly print if /\p{Hiragana}/, print if /\p{Katakana}/ etc. See for example How Can I Run a Regex that Tests Text for Characters in a Particular Alphabet or Script? — steeldriver, Apr 01 '15 at 00:55
@steeldriver: Perl is OK. But how do I run that search for every file in a directory, recursively? And is it going to print file names and line numbers like grep does? (You can put that as an answer, btw) — hugomg, Apr 01 '15 at 02:22
OK my perl-fu is not strong but I will try to put together an answer: in the meantime, I found this near-duplicate that you may find helpful grep: Find all lines that contain Japanese kanjis — steeldriver, Apr 01 '15 at 02:38
If the characters you're looking for are comprised of invalid byte-sequences in your current encoding, then you can probably just find them like: grep -xv '.*' * because the .* will only match a line head to tail if it is comprised entirely of characters. — mikeserv, Apr 01 '15 at 06:19

score 2 · Accepted Answer · edited Apr 13 '17 at 12:36

Grepping for non-ASCII characters is easy: set a locale where only ASCII characters are valid, search for invalid characters.

LC_CTYPE=C grep '[^[:print:]]' myfile

If you want to search for Japanese characters, it's a bit more complicated. With grep, you'll need to make sure that your LC_CTYPE locale setting matches the encoding of the files. You'll also need to make sure that your LC_COLLATE setting is set to Japanese if you want to use a character range expression. For example, on Linux (I determined the first and last character that's considered Japanese by looking at the LC_COLLATE section /usr/share/i18n/locales/ja_JP):

LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep '[｡-龥]' myfile

or if you want to stick to ASCII in your script

LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep $'[\uff61-\u9fa5]' myfile

This includes a few punctuation characters that are also used in English such as ⓒ and ×.

Perl has built-in features to classify characters. You can use the \p character class to match characters based on Unicode properties. Pass the command line switch -CSD to tell Perl that everything is in Unicode with the UTF-8 encoding.

perl -CSD -ne 'print if /\p{Hiragana}|\p{Katakana}/' myfile

If your files aren't encoded in UTF-8, you'll have to call binmode explicitly to tell Perl about their encoding. That's too advanced a perllocale usage for me. Alternatively you can first recode the line into UTF-8.

Alternatively, in Perl, you can use numerical character ranges. For example, to search for characters in the Hiragana and Katakana Unicode blocks:

perl -CSD -ne 'print if /[\x{3040}-\x{30ff}]/' a

The grep [^[:print:]] version is also printing tab characters. Is there a way to avoid that? BTW you were right about the file encodings, turns out it was actually EUCJP — hugomg, Apr 02 '15 at 01:00
@hugomg Add a tab inside the outer brackets: grep '[^[:print:]TAB]' myfile or grep '[^TAB[:print:]]' myfile or grep $'[^[:print:]\t]' myfile or grep $'[^\t[:print:]]' myfile (with an actual tab character instead of TAB). — Gilles 'SO- stop being evil', Apr 02 '15 at 01:11
In this answer @janis suggests using grep '[^[:print:][:space:]]' to handle tab and space characters. — Christian Long, Aug 30 '16 at 23:29

Janis · Answer 2 · 2015-04-01T00:38:07.557

1

Try this:

grep '[^[:print:][:space:]]'

(Depending on your locale setting maybe you have to prepend it by LANG=C.)

edited Apr 01 '15 at 00:38

answered Mar 31 '15 at 23:56

Janis

14,222

This ends up with lots of false positives because its also printing lines with tabs ("\t") on them. – hugomg Apr 01 '15 at 00:33
I changed the answer to ignore spaces and TABs as well. – Janis Apr 01 '15 at 00:38

score 1 · Answer 3 · edited Apr 01 '15 at 23:50

1

If you don't mind using perl, it has more extensive Unicode support in the form of classes such as {Katakana} and {Hiragana} which I don't think are currently available in even in those versions of grep that provide some PCRE support. However it does appear to require explicit UTF-8 decoding e.g.

perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' somefile

To traverse directories like grep's -R, you could use the find command, something like

find -type f -exec perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;

or to mimic recursive grep's default filename:match labeled output format,

find -type f -exec perl -MEncode -lne 'printf "%s:%s\n",$ARGV,$_ if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;

edited Apr 01 '15 at 23:50

Stéphane Chazelas

544,893

answered Apr 01 '15 at 02:52

steeldriver

81,074

Sadly, none of these worked for me, maybe because the file is encoded in iso-8859-1 (though fiddling with LC_CTYPE and the parameter to decode didn't seem to help). I managed to find a solution to my problem in the thread you linked to though :) – hugomg Apr 01 '15 at 03:17
@hugomg A file encoded in ISO 8859-1 cannot contain any Japanese characters. It's probably UTF-8, EUCJP or a JIS variant. – Gilles 'SO- stop being evil' Apr 01 '15 at 23:40

score 0 · Answer 4 · answered Apr 01 '15 at 03:14

My files were encoded in iso-8859-1 so anything that tried to read the input in my default locale (utf-8) would not recognize the Japanese characters. In the end I managed to solve my problem with the following command:

env LC_CTYPE=iso-8859-1  grep -nP '[\x80-\xff]' ./*

-P is to allow for the Perllike syntax for character ranges.
-n is for printing the line numbers next to the line names

\x80 to \xff are the "non ascii" characters

Changing the LC_CTYPE environment variable to iso-8859-1 makes grep read my fields byte-by-byte and lets me detect any "extended ascii" bytes as possible Japanese characters. If I use the default system encoding of UTF-8 grep exits with an "invalid UTF-8 byte sequence in input" error.

Is there a way to grep a folder and show all lines containing non-ascii characters?

4 Answers4