52

Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)

Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)

(I realize that any such test would necessarily be heuristic, and therefore imperfect.)

kjo
  • 15,339
  • 25
  • 73
  • 114
  • 10
    file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in. – Bratchley Apr 10 '16 at 16:37
  • @Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary? – kjo Apr 10 '16 at 16:48
  • The reason I wrote it as a comment was because I didn't technically provide a full solution I just nudged you into a particular direction. you may have to play around with the options to get what you want – Bratchley Apr 10 '16 at 17:06
  • 1
    @don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands. – Bratchley Apr 10 '16 at 17:18
  • 1
    @don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”? – Gilles 'SO- stop being evil' Apr 10 '16 at 21:05
  • 1
    @Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not. – don_crissti Apr 10 '16 at 21:15
  • # more vi yields ******** vi: Not a text file ******** – AbraCadaver Apr 11 '16 at 15:06
  • With GNU grep I would use "grep -I '' " am I missing anything? – bjoerng Feb 17 '21 at 20:43

10 Answers10

37

If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):

file -b --mime-type filename | sed 's|/.*||'
heemayl
  • 56,300
meuh
  • 51,383
  • 26
    Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those. – Boldewyn Apr 11 '16 at 07:38
  • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems. – meuh Apr 11 '16 at 07:49
  • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file... – Boldewyn Apr 11 '16 at 08:19
  • 7
    @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs. – Tobia Apr 11 '16 at 08:46
  • 4
  • You would have better luck grepping the output of file -b <filename>. – Anthony Rutledge Oct 18 '21 at 02:07
26

Another approach would be to use isutf8 from the moreutils collection.

It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.

techraf
  • 5,941
Wander Nauta
  • 361
  • 2
  • 3
  • 5
    Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out. – meuh Apr 11 '16 at 14:07
18

If you like the heuristic used by GNU grep, you could use it:

isbinary() {
  LC_MESSAGES=C grep -Hm1 '^' < "${1-$REPLY}" | grep -q '^Binary'
}

It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.

The ${1-$REPLY} form allows you to use it as a zsh glob qualifier:

ls -ld -- *(.+isbinary)

would list the binary files.

9

You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:

ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
    echo text
else
    echo binary
fi

This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.

Boldewyn
  • 519
  • 4
    Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text". – Stéphane Chazelas Apr 11 '16 at 09:12
  • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better. – Boldewyn Apr 11 '16 at 10:54
8

You can write a script that calls file, and use a case-statement to check for the cases you are interested in.

For example

#!/bin/sh
case $(file "$1") in
(*script*|*\ text|*\ text\ *)
    echo text
    ;;
(*)
    echo binary
    ;;
esac

though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,

Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text

Some use the string "text" as part of a different type, e.g.,

SoftQuad troff Context intermediate   
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet

likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.

As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:

$ ls -l *.svg
-r--r--r-- 1 tom users  6679 Jul 26  2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30  2012 sink_48x48.svg
-r--r--r-- 1 tom users  5929 Jul 25  2012 vile_48x48.svg
-r--r--r-- 1 tom users  3553 Jul 28  2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg:    SVG Scalable Vector Graphics image
vile-mini.svg:     SVG Scalable Vector Graphics image
vile_48x48.svg:    SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg:    image/svg+xml
vile-mini.svg:     image/svg+xml
vile_48x48.svg:    image/svg+xml

which I selected after seeing a thousand files show only 6 with "text" in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.

The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").

There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).

According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.

Further reading:

Thomas Dickey
  • 76,765
  • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types. – Peter Cordes Apr 11 '16 at 23:34
  • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed. – Thomas Dickey Apr 11 '16 at 23:39
8

file has an option --mime-encoding that attempts to detect the encoding of a file.

 $file --mime-encoding Documents/poster2.pdf 
Documents/poster2.pdf: binary
 $file --mime-encoding projects/linux/history-torvalds/Makefile 
projects/linux/history-torvalds/Makefile: us-ascii
 $file --mime-encoding graphe.tex 
Dgraphe.tex: us-ascii
 $file --mime-encoding software.tex 
software.tex: utf-8

You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.

For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:

#! /bin/sh -

[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
    if file --mime-encoding -- "$i" | grep -q binary
    then
        hexdump -C -- "$i"
    else
        /bin/cat -- "$i"
    fi
done
lgeorget
  • 13,914
4

Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.

So, what do you want to do with that classification?

  • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
  • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
  • Any other case… may have any other definition.
ESL
  • 171
3
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'

will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).

msh210
  • 173
  • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "\t$_" : "" for @ARGV' -- – jrw32982 Apr 21 '17 at 12:20
1

I contributed to https://github.com/audreyr/binaryornot It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI. It uses a fairly efficient heuristic to determine if a file is text or binary.

1

I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.

You use the diff command and check your file against a test text file:

$ diff filetocheck testfile.txt

Now if filetocheck is a binary file, the output would be:

Binary files filetocheck and testfile.txt differ

This way you could leverage the diff command and e.g. write a function which does the check in a script.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232