1

A colleague created a build tree (via gradle :dependencies > dependencies.txt) and emailed it to me. I grepped for a library I wanted to know the version of so I executed:

grep log4j dependencies.txt

but got zero matches and my shell just printed a new prompt. Since it was a long file and I trusted grep, I didn't open it and check. Then after a lot of back-and-forth discussion I was told that the file was created on a Windows machine. Even then I was surprised that grep wouldn't work - the search string isn't interrupted by newlines. But after executing:

dos2unix dependencies.txt

Grep started showing the matches I wanted.

Obviously my understanding of how grep works was incorrect. Why would grep not behave the same way on file contents on different operating systems when the search term occurs without any newlines in between?

Further info

  • file dependencies.txt returns dependencies.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
  • LC_ALL=C grep log4j dependencies.txt returns nothing
  • grep o dependencies.txt returned Binary file depdencies.txt matches
  • grep --text dependencies.txt returned nothing
Sridhar Sarnobat
  • 1,802
  • 20
  • 27

1 Answers1

3

UTF-16 text consists of 16-bit pieces, so each letter is stored in at least two bytes. If it's just ASCII characters, every other byte is a zero byte (NUL byte, \0, not the character zero). Your Mac is very likely not set up to deal with that.

In particular, the NUL bytes are taken as string terminators in C, so many tools may not be able to deal with them at all. Even if they could deal with them, they might take each NUL as a distinct character, so you'd need something like l.o.g.4.j to match that string.

But the funny thing is, that NUL bytes aren't visible when printing, so if you were to e.g. cat the file to the terminal, it might look just normal...

The NULs are also the reason grep considers the file binary.

See also: What makes grep consider a file to be binary?

ilkkachu
  • 138,973
  • Thanks for the answer. FYI cat does indeed print the contents normally, but piping that output to grep doesn't. Also, opening with vim and /-searching works. I was a bit embarassed when the vim part happened – Sridhar Sarnobat Mar 03 '21 at 19:14
  • @SridharSarnobat, yes, cat doesn't change the contents, so it doesn't matter if you do cat file | grep, or just grep file (or grep < file). The NULs come to your terminal, the terminal ignores them. less shows that as U^@T^@F^@-^@1^@6^@ ^@t^@e^@x^@t^@ ^@c^@o^@n^@s^@i^@s^@t^@s etc. with the ^@ in inverse color marking the NULs. – ilkkachu Mar 03 '21 at 19:16
  • Indeed more shows the padding characters (though not less, curiously). – Sridhar Sarnobat Mar 03 '21 at 19:17