Why is an explicit LANG=C required when searching for hex representations of characters in grep?

Question

When I want to recursively search TeX files for characters unsupported by my font, I typically start with a search for non-breakable spaces and zero-width spaces. These are difficult to produce on the terminal command line, therefore I use their UTF-8 hexidecimal representations.

env LANG=C grep -obUaP "\xc2\xa0" $(find -name '*.tex')
env LANG=C grep -obUaP "\xe2\x80\x8b" $(find -name '*.tex')

Why do I need to explicitly set the LANG environment variable to C as shown above: env LANG=C

Notes

Using -U and -a simultaneously may seem erroneous, but this version of the manual states that

When type is ‘binary’, grep may treat non-text bytes as line terminators even without the -z (--null-data) option.

-a forces only line terminators to be line terminators (not so clear).

http://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html

@StephenKitt Good question. The -U changes the behavior of grep to allow for a bytewise search. As far as -a goes, the man page does not elaborate on the matter, but it says that -a makes grep process bytes as text, which is correct in this case. I am open for criticism, suggestions, or reasoning to improve my knowledge here. — Jonathan Komar, Oct 04 '17 at 13:12
I don’t know the answer in detail, I was just under the impression that -U (--binary) and -a (--text) are contradictory ;-). Or were you thinking of -u (--unix-byte-offsets)? — Stephen Kitt, Oct 04 '17 at 13:28
@StephenKitt The reason is that when reading in binary using -U, any non-text bytes including line terminators are treated as line terminators. -a forces only line terminators to be treated as line terminators. So using -a is merely a precaution to ensure that the line numbering is correct in the output when you know that that input is supposed to be text. — Jonathan Komar, Oct 04 '17 at 15:26

Jonathan Komar · Accepted Answer · 2017-10-05T07:07:23.507

My version of the grep manual does not include this, but the grep 3.0 elaborates on this topic.

Warning: The -a (--binary-files=text) option might output binary garbage, which can have nasty side effects if the output is a terminal and if the terminal driver interprets some of it as commands. On the other hand, when reading files whose text encodings are unknown, it can be helpful to use -a or to set ‘LC_ALL='C'’ in the environment, in order to find more matches even if the matches are unsafe for direct display.

From this answer: https://unix.stackexchange.com/a/87763/33386

In the C locale, characters are single bytes, the charset is ASCII [...]

This probably is the reason why this helps with the display of characters when scanning unknown text files. It forces an ASCII character set.

Why is an explicit LANG=C required when searching for hex representations of characters in grep?

Notes

1 Answers1