All files are binary under the hood: they are stored as a sequence of bits.
The bits of files are actually grouped in bytes. Every file consists of an integer number of bytes. All unix systems, and in fact almost all computers, have bytes composed of 8 bits (known as octets in networking terminology). There is a natural way to interpret bytes as 8-bit numbers, i.e. numbers between 0 and 28-1 = 255.
To see them as binary, you need a tool that writes them out in binary notation. Humans aren't well suited to binary notation: it takes far too long to write anything. It is more common to use hexadecimal notation, with 16 different digits. For example, 41
(sixty-five in hexadecimal) is more comfortable to read than 01000001
(sixty-five in binary). You can use a command such as od
(“octal dump”) or hexdump
or hd
to list a file with octal or hexadecimal notation for each byte (od -t x1
switches to hexadecimal).
Bytes can represent characters. There are several character encodings used in the unix world. They are all based on ASCII, which defines the interpretation of bytes between 0 and 127. Notice that this only defines a meaning for half of the possible byte values. For example, 65 represents the capital letter A
, 97 represents the lowercase letter a
, 30 represents the digit 0
, and so on. Some character encodings represent each character by one byte; for example, in the latin-1 encoding, 163 represents £
, 241 represents ñ
and so on. The maximum number of characters that one can represent this way is 256, which isn't much; therefore, there are other encodings which use more than one byte per character. The de facto standard encoding in the unix world nowadays is UTF-8, which is a variable-length encoding (different characters take up different numbers of bytes) for the Unicode character set.
A text file is a binary file that happens to contain intelligible text. In fact, for unix programs, a file is a text file as long as it respects two conditions:
- A text file may not contain any null byte (a byte with a numerical value of 0). This byte does not represent any character and is used as a special marker internally in many text manipulation programs.
- A text file consists of a sequence of lines, and each line is terminated by a newline character (which has the numerical value 10).
Machine executables are a particular kind of binary file. If you run the cat
command on them, you'll see garbage with the occasional bit of text. These files may coincidentally contain commands for your terminal, too. You can use the program strings
to see all the text fragments in a binary file, leaving out the non-printable characters.
Machine executables aren't exactly a sequence of machine instructions: they also contain a little extra information that tells the operating system how to load the file into memory, usually also some data used by the program, and optionally debugging information. Most unix systems use the ELF format for machine executables. This format specifies how a file containing machine code is divided into sections, and that part is independent of the machine architecture; some sections contain code, and the meaning of that code is specific to a particular machine architecture.
You can use the command objdump -D /path/to/machine-executable
to display a listing of the executable in a human-readable form: assembly language. Well, readable by a trained human anyway. Assembly language is specific to a processor architecture and maps directly to machine instructions.
It is possible to write a complete program in assembly language, but this is rarely done for non-trivial programs, because it takes a long time. If you're really crazy, you might write your program directly in binary. Some people have tried to come up with the shortest possible program that prints Hello world
; Ryan Henszey explains how to write a 142-byte ELF executable for PC processors; Brian Raiter analyzed the ELF format and came up with a 45-byte program that Linux is willing to execute (that program prints nothing).
There are also executables that are not binary files; they are known as scripts. And conversely, there are many binary files that are not executable: images, videos, compressed files, word processor documents, code libraries without an entry point, executables for other processor architectures, …
xxd -b file
. – Emanuel Berg Jul 10 '12 at 16:47