Mystery of binary files

Question

This is about files straight from the compiler, say g++, and the -o (outfile) flag.

If they are binary, shouldn't they just be a bunch of 0's and 1's?

When you cat them, you get unintelligible output but also intact words.

If you file them, you get the answer immediately - there seem to be no computation. Do the binary files in fact have headers with this kind of information?

I thought a binary executable was just the program just compiled, only in the form of machine instructions that your CPU can instantly and unambiguously understand. If so, isn't that instruction set just bit patterns? But then, what's all the other stuff in the binaries? How do you display the bits?

Also, if you somehow get hold of the manual of your processor, could you write a binary manually, one machine instruction at a time? That would be terribly ineffective, but very fascinating if you got it to work even for a "Hello World!" demo.

The key point is that each 1 and each 0 in the phrase, "It's only ones and zeros" refers to bits, not bytes. Typically, programs present text/data by the byte. Even this 1 and this 0 is a representation of an 8-bit byte, which hold the underlying true 1's and 0's . — Peter.O, Jul 10 '12 at 06:38

score 18 · Accepted Answer · edited Mar 20 '17 at 10:04

This Super User question: Why don't you see binary code when you open a binary file with text editor? addresses your first point quite well.

Binary and text data aren't separated: They are simply data. It depends on the interpretation that makes them one or the other. If you open binary data (such as an image file) in a text editor, much of it won't make sense, because it does not fit your chosen interpretation (as text).

Files are stored as zeros and ones (e.g. voltage/no voltage on memory, magnetization/no magnetization on hard drive). You don't see zeros and ones when cat ing the files because the 0/1 sequences won't be of much use to an human; characters make more sense, and an hexdump is better for most purposes (try hexdump on a file).

Executable files do have a header that describes parameters such as the architecture for which the program was built, and what sections of the file are code and data. This is what file uses to identify the characteristics of your binary file.

Finally: yes, you can write programs in assembly language using CPU opcodes directly. Take a look at Introduction to UNIX assembly programming and the Intel x86 documentation for a starting point.

score 10 · Answer 2 · answered Jul 10 '12 at 00:35

10

All files are stored as 1's and 0's, cat just tries to interpret each BYTE (8 bits) as a character, that's why you see the unintelligible characters.

answered Jul 10 '12 at 00:35

mikhailvs

221

score 6 · Answer 3 · edited Apr 13 '17 at 12:38

All files are binary under the hood: they are stored as a sequence of bits.

The bits of files are actually grouped in bytes. Every file consists of an integer number of bytes. All unix systems, and in fact almost all computers, have bytes composed of 8 bits (known as octets in networking terminology). There is a natural way to interpret bytes as 8-bit numbers, i.e. numbers between 0 and 2⁸-1 = 255.

To see them as binary, you need a tool that writes them out in binary notation. Humans aren't well suited to binary notation: it takes far too long to write anything. It is more common to use hexadecimal notation, with 16 different digits. For example, 41 (sixty-five in hexadecimal) is more comfortable to read than 01000001 (sixty-five in binary). You can use a command such as od (“octal dump”) or hexdump or hd to list a file with octal or hexadecimal notation for each byte (od -t x1 switches to hexadecimal).

Bytes can represent characters. There are several character encodings used in the unix world. They are all based on ASCII, which defines the interpretation of bytes between 0 and 127. Notice that this only defines a meaning for half of the possible byte values. For example, 65 represents the capital letter A, 97 represents the lowercase letter a, 30 represents the digit 0, and so on. Some character encodings represent each character by one byte; for example, in the latin-1 encoding, 163 represents £, 241 represents ñ and so on. The maximum number of characters that one can represent this way is 256, which isn't much; therefore, there are other encodings which use more than one byte per character. The de facto standard encoding in the unix world nowadays is UTF-8, which is a variable-length encoding (different characters take up different numbers of bytes) for the Unicode character set.

A text file is a binary file that happens to contain intelligible text. In fact, for unix programs, a file is a text file as long as it respects two conditions:

A text file may not contain any null byte (a byte with a numerical value of 0). This byte does not represent any character and is used as a special marker internally in many text manipulation programs.
A text file consists of a sequence of lines, and each line is terminated by a newline character (which has the numerical value 10).

Machine executables are a particular kind of binary file. If you run the cat command on them, you'll see garbage with the occasional bit of text. These files may coincidentally contain commands for your terminal, too. You can use the program strings to see all the text fragments in a binary file, leaving out the non-printable characters.

Machine executables aren't exactly a sequence of machine instructions: they also contain a little extra information that tells the operating system how to load the file into memory, usually also some data used by the program, and optionally debugging information. Most unix systems use the ELF format for machine executables. This format specifies how a file containing machine code is divided into sections, and that part is independent of the machine architecture; some sections contain code, and the meaning of that code is specific to a particular machine architecture.

You can use the command objdump -D /path/to/machine-executable to display a listing of the executable in a human-readable form: assembly language. Well, readable by a trained human anyway. Assembly language is specific to a processor architecture and maps directly to machine instructions.

It is possible to write a complete program in assembly language, but this is rarely done for non-trivial programs, because it takes a long time. If you're really crazy, you might write your program directly in binary. Some people have tried to come up with the shortest possible program that prints Hello world; Ryan Henszey explains how to write a 142-byte ELF executable for PC processors; Brian Raiter analyzed the ELF format and came up with a 45-byte program that Linux is willing to execute (that program prints nothing).

There are also executables that are not binary files; they are known as scripts. And conversely, there are many binary files that are not executable: images, videos, compressed files, word processor documents, code libraries without an entry point, executables for other processor architectures, …

Mystery of binary files

3 Answers3

Linked

Related