What makes grep consider a file to be binary?

Question

I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.

I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

--null-data may be useful if NUL is the delimiter. – Steve-o Sep 01 '11 at 13:27 — Steve-o, Sep 01 '11 at 13:27

score 174 · Accepted Answer · answered Sep 01 '11 at 13:28

174

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

There might a workaround like this cat file | tr -d '\000' | yourgrep to eliminate all null first, and then to search through file.

answered Sep 01 '11 at 13:28

bbaja42

2,800

189

... or use -a/--text, at least with GNU grep. – derobert Nov 26 '12 at 20:44
1

@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first NUL (probably becauses it calls C's printf and gives it the matched line?). On such a system a grep cmd .sh_history will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a NUL at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely) – Olivier Dulac Nov 25 '13 at 11:46
+1 for using -a / --text with GNU grep, because you can mix this easily with recursive search, e.g.
egrep -r -a mystring .

Thanks @derobert
– phil_w Jul 14 '15 at 19:08
6

Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure. – Michael Martinez Aug 14 '15 at 16:58
3

My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs. – bbaja42 Aug 18 '15 at 22:31
3

I had a file which grep on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete. – cp.engr Feb 15 '16 at 16:15
Thnx for the comment. I hadn't realized that different platform would handle the issue differently. – bbaja42 Mar 02 '16 at 15:56
Not true, grep will work with the file each line as string but will stop after the first NUL coincidence. Take a look on https://stackoverflow.com/questions/50992292/grep-not-parsing-the-whole-file – Miguel Ortiz Jun 22 '18 at 21:01
MiguelOrtiz and cp.engr could be right. In Windows Subsystem for Linux, grep treat any file that contains Chinese characters as binary file. While grep in MobaXterm considers Chinese characters as plain text. Both of them treat NUL or \0 as binary file. – Weekend Jul 23 '18 at 03:08
I have used grep with utf-8, so it can handle long dash in utf-8. It may depend on local. It is definitely not a file system flag. – ctrl-alt-delor Oct 23 '18 at 20:03
2

BSD grep (which is available on MacOS) also supports -a / --text – Nathan Long Oct 30 '18 at 15:21
doing cat, pipe, tr, pipe AGAIN... seems like a whole lot of wasted resources... when you can just use grep --text option... and not use up lots of extra cpu and memory (two processes, two pipes). – Trevor Boyd Smith Feb 08 '19 at 20:59
This answer is the only one that worked on my Ubuntu 20 for a seemingly innocent log file that had NUL characters. Also, as mentioned by other comments to this answer, grep --text -i <search string> <file name> works great too. – Binita Bharati Jun 23 '21 at 14:51
@MichaelMartinez : Anything falling outside the Ascii 32-126 range would be my guess …….. ehhh….. you do realize the typical newline \n (or \r\n in Windows) already falls out of your range criteria ? (and the horizontal tab ( 0x9 :: \11 :: \t ) too) – RARE Kpop Manifesto Feb 22 '24 at 07:45
@RAREKpopManifesto you get the gist of my comment, right? In any case this is easily testable: echo Ascii > ./testfile; grep A ./testfile; dd if=/dev/zero of=./testfile bs=1 seek=5 count=2; grep A ./testfile – Michael Martinez Feb 28 '24 at 15:13
@MichaelMartinez : i'm not saying \0 is a false criteria. If it were I'd be more concerned about grep itself. I'm merely saying your suggested filtering criteria might be overly broad, even without looking at grep source, and that plenty of ASCII bytes, even with no consideration of Unicode, would exist in any properly formatted "text" file – RARE Kpop Manifesto Mar 05 '24 at 23:11
(I put text in quotes cuz I'll let others philosophically debate whether something like a .mp4 file comprised of just a single stream of subtitles using only ASCII 32-126 chars constitute a text file or not, cuz it's easily extractable with a text-oriented parser despite being wrapped in a binary container) – RARE Kpop Manifesto Mar 05 '24 at 23:24

score 170 · Answer 2 · answered Sep 02 '15 at 09:43

170

grep -a worked for me:

$ grep --help
[...]
 -a, --text                equivalent to --binary-files=text

answered Sep 02 '15 at 09:43

Plouff

1,801

9

This is the best, least expensive answer IMO. – pydsigner Sep 24 '16 at 18:32
2

But not POSIX compliant – Matteo Aug 09 '19 at 08:08
2

Would you mind to explain why it is not? It would be good to make it clear, for all of us who find this answer as an option. Thanks :). – ivanleoncz Nov 04 '19 at 19:36
1

Hey I've come here a SECOND time to relearn this LOL. A French accent (diacritic) in the text was causing grep to barf – zzapper Oct 26 '20 at 09:43

score 28 · Answer 3 · answered Nov 26 '12 at 20:24

28

You can use the strings utility to extract the text content from any file and then pipe it through grep, like this: strings file | grep pattern.

answered Nov 26 '12 at 20:24

holgero

381

4

Ideal for grepping log files that might be partly corrupted – Hannes R. Feb 27 '15 at 07:43
2

yes, sometimes binary mixed logging also happens. This is good. – Sep 03 '17 at 16:59
This will work with any UTF-8 file, or something like --encoding option should be specified? -e S seems insufficient – Pablo A Nov 17 '22 at 20:16

score 25 · Answer 4 · edited Dec 03 '23 at 20:59

25

GNU grep 2.24 RTFS

Conclusion: 2 and 2 cases only:

NUL, e.g. printf 'a\0' | grep 'a'
encoding error according to the C99 mbrlen(), e.g.:
```
export LC_CTYPE='en_US.UTF-8'
printf 'a\x80' | grep 'a'
```
because \x80 cannot be the first byte of an UTF-8 Unicode point: UTF-8 - Description | en.wikipedia.org

Those checks are only done up to the Nth byte of the input, where N = TODO (32KiB in one test system). If the check would fail after the Nth byte, the file is still considered a text file. (mentioned by Stéphane Chazelas).

Only up to the first buffer read

So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.

I imagine this is for performance reasons.

E.g.: this prints the line:

printf '%10000000s\n\x80a' | grep 'a'

but this does not:

printf '%10s\n\x80a' | grep 'a'

The actual buffer size depends on how the file is read. E.g. compare:

export LC_CTYPE='en_US.UTF-8'
(printf '\n\x80a') | grep 'a'
(printf '\n'; sleep 1; printf '\x80a') | grep 'a'

With the sleep, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.

RTFS

git clone git://git.savannah.gnu.org/grep.git 
cd grep
git checkout v2.24

Find where the stderr error message is encoded:

git grep 'Binary file'

Leads us to src/grep.c:

if (!out_quiet && (encoding_error_output
                   || (0 <= nlines_first_null && nlines_first_null < nlines)))
  {
    printf (_("Binary file %s matches\n"), filename);

If those variables were well named, we basically reached the conclusion.

encoding_error_output

Quick grepping for encoding_error_output shows that the only code path that can modify it goes through buf_has_encoding_errors:

clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
  return true;

then just man mbrlen.

nlines_first_null and nlines

Initialized as:

intmax_t nlines_first_null = -1;
/* removed for brevity */
nlines = 0;

so when a null is found 0 <= nlines_first_null becomes true.

TODO when can nlines_first_null < nlines ever be false? I got lazy.

POSIX

Does not define binary options for grep, and GNU grep does not document it, so RTFS is the only way.

edited Dec 03 '23 at 20:59

Cristian Ciupitu

2,510

answered Apr 12 '16 at 20:50

Ciro Santilli OurBigBook.com

18,092
4
117
102

1

Impressive explication! – user394 Apr 13 '16 at 02:02
2

Note that the check for valid UTF-8 only happens in UTF-8 locales. Also note that the check is only done on the first buffer read from the file which for a regular file seems to be 32768 bytes on my system, but for a pipe or socket can be as small as one byte. Compare (printf '\n\0y') | grep y with (printf '\n'; sleep 1; printf '\0y') | grep y for instance. – Stéphane Chazelas Apr 13 '16 at 12:18
@StéphaneChazelas "Note that the check for valid UTF-8 only happens in UTF-8 locales": do you mean about the export LC_CTYPE='en_US.UTF-8' as in my example, or something else? Buf read: amazing example, added to answer. You have obviously read the source more than me, reminds me of those hacker koans "The student was enlightened" :-) – Ciro Santilli OurBigBook.com Apr 13 '16 at 13:05
1

I didn't look into great detail either, but did very recently – Stéphane Chazelas Apr 13 '16 at 13:09
@StéphaneChazelas I was not able to reproduce the UTF locale part of this with GNU grep 2.16. printf 'a\x80' | LC_ALL=en_US.UTF-8 grep a did not warn, whereas changing 80 to 00 did warn. – jrw32982 Jun 08 '16 at 18:15
@jrw32982 interesting. Maybe open up 2.16 and see if the encoding_error_output is there. Maybe it was added since. – Ciro Santilli OurBigBook.com Jun 08 '16 at 19:20
1

@CiroSantilli巴拿馬文件六四事件法轮功 what version of GNU grep did you test against? – jrw32982 Jun 08 '16 at 23:33
@jrw32982 2.24, same I opened source for. Ubuntu 16.04. – Ciro Santilli OurBigBook.com Jun 09 '16 at 18:13
@CiroSantilli巴拿馬文件六四事件法轮功 The grep 2.16 source looks substantially different than the grep 2.24 source. There is no encoding_error_output. The checks are for a NUL in the first buffer or if there are "holes" in the file indicating a NUL character somewhere. If -z is specified, then it checks instead for \x80 (\200). – jrw32982 Jun 12 '16 at 04:02
@jrw32982 thanks for input! What does a "hole" in the file mean? – Ciro Santilli OurBigBook.com Jun 12 '16 at 06:59
1

@CiroSantilli巴拿馬文件六四事件法轮功 sparse file – jrw32982 Jun 12 '16 at 14:12
2

Dammit, the only answer that thoroughly and precisely addresses the questions is sits down here with 10% of the votes of the most voted one. – Quasímodo Mar 03 '21 at 20:20
2

@Quasímodo https://cirosantilli.com/stack-overflow#image-stack-overflow-in-a-nutshell – Ciro Santilli OurBigBook.com Mar 04 '21 at 06:33

score 8 · Answer 5 · edited Jun 01 '15 at 20:24

8

One of my text files was suddenly being seen as binary by grep:

$ file foo.txt
foo.txt: ISO-8859 text

Solution was to convert it by using iconv:

iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt

edited Jun 01 '15 at 20:24

kenorb

20,988

answered Dec 08 '14 at 21:30

zzapper

1,140

1

This happened to me as well. In particular, the cause was an ISO-8859-1-encoded non-breaking space, which I had to replace with a regular space in order to get grep to search in the file. – Gallaecio Jun 09 '15 at 13:50
4

grep 2.21 treats ISO-8859 text files as if they are binary, add export LC_ALL=C before grep command. – netawater Aug 17 '15 at 02:52
@netawater Thanks! This is e.g. the case if you have something like Müller in a text-file. That's 0xFC hexadecimal, so outside the range grep would expect for utf8 (up to 0x7F). Check with printf 'a\x7F' | grep 'a' as Ciro describe above. – Anne van Rossum Nov 26 '16 at 16:51

score 5 · Answer 6 · edited Feb 09 '15 at 11:49

5

The file /etc/magic or /usr/share/misc/magic has a list of sequences that the command file uses for determining the file type.

Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.

grep on Linux has some options to handle binary files like --binary-files or -U / --binary

edited Feb 09 '15 at 11:49

fduff

5,035

answered Sep 01 '11 at 13:27

klapaucius

456

More precisely, encoding error according to C99's mbrlen(). Example and source interpretation at: http://unix.stackexchange.com/a/276028/32558 – Ciro Santilli OurBigBook.com Apr 12 '16 at 20:51

golimar · Answer 7 · 2016-04-14T16:49:00.580

Actually answering the question "What makes grep consider a file to be binary?", you can use iconv:

$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert

In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv output pointed me to the line and column numbers of those characters

In the case of NUL characters, iconv will consider them normal and will not print that kind of output so this method is not suitable

score 2 · Answer 8 · edited Sep 10 '15 at 11:14

2

One of my students had this problem. There is a bug in grep in Cygwin. If the file has non-Ascii characters, grep and egrep see it as binary.

edited Sep 10 '15 at 11:14

TPS

2,481

answered Sep 10 '15 at 09:31

Joan Pontius

29

That sounds like a feature, not a bug. Especially given there is a command-line option to control it (-a / --text) – Will Sheppard Jan 29 '18 at 11:39

score 1 · Answer 9 · edited Jun 01 '15 at 20:29

1

I had the same problem. I used vi -b [filename] to see the added characters. I found the control characters ^@ and ^M. Then in vi type :1,$s/^@//g to remove the ^@ characters. Repeat this command for ^M.

Warning: To get the "blue" control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.

edited Jun 01 '15 at 20:29

kenorb

20,988

answered Apr 03 '15 at 18:58

Not Sure

11

score 1 · Answer 10 · answered Dec 02 '20 at 16:17

I also had this problem but in my case it was caused when a matched line is too long.

file myfile.txt
myfile.txt: UTF-8 Unicode text, with very long lines

grep would run through the entire file fine with many patterns but when a pattern matched a "very long line" it stopped with Binary file myfile.txt matches.

Adding -a also solves this problem but pre-parsing the file for NULL or other invalid chars would have no effect (there are none otherwise grep would not complete for other patterns). In this case the offending line had 25k+ characters!

What I don't understand is why it only happens when grep tries to return the line and not when it is processing it looking for other patterns.