When I want to recursively search TeX files for characters unsupported by my font, I typically start with a search for non-breakable spaces and zero-width spaces. These are difficult to produce on the terminal command line, therefore I use their UTF-8 hexidecimal representations.
env LANG=C grep -obUaP "\xc2\xa0" $(find -name '*.tex')
env LANG=C grep -obUaP "\xe2\x80\x8b" $(find -name '*.tex')
Why do I need to explicitly set the LANG
environment variable to C
as shown above: env LANG=C
Notes
Using -U
and -a
simultaneously may seem erroneous, but this version of the manual states that
When type is ‘binary’, grep may treat non-text bytes as line terminators even without the -z (--null-data) option.
-a
forces only line terminators to be line terminators (not so clear).
http://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html
-U
and-a
? – Stephen Kitt Oct 04 '17 at 13:07-U
changes the behavior of grep to allow for a bytewise search. As far as-a
goes, the man page does not elaborate on the matter, but it says that-a
makesgrep
process bytes as text, which is correct in this case. I am open for criticism, suggestions, or reasoning to improve my knowledge here. – Jonathan Komar Oct 04 '17 at 13:12-U
(--binary
) and-a
(--text
) are contradictory ;-). Or were you thinking of-u
(--unix-byte-offsets
)? – Stephen Kitt Oct 04 '17 at 13:28-U
, any non-text bytes including line terminators are treated as line terminators.-a
forces only line terminators to be treated as line terminators. So using-a
is merely a precaution to ensure that the line numbering is correct in the output when you know that that input is supposed to be text. – Jonathan Komar Oct 04 '17 at 15:26