10

A common way to do things with a couple of files is—and don't hit me for that:

for f in $(ls); do …

Now, to be safe against files with spaces or other strange characters, a naive way would be to do:

find . -type f -print0 | while IFS= read -r -d '' file; …

Here, the -d '' is short for setting the ASCII NUL as in -d $'\0'.

But why is that so? Why are '' and $'\0' the same? Is that due to the C roots of Bash with an empty string always being null-terminated?

slhck
  • 473
  • Referring to the "naïve" way, is there a better way of doing this? – iruvar Jan 12 '13 at 15:44
  • 2
    By the way if you want to do safe operations iterating over a set of files - use for f in * instead of parsing ls. –  Jan 12 '13 at 16:01
  • @htor I know for i in $(ls) is terribly stupid—I'm almost ashamed I used it as a bad example here. – slhck Jan 12 '13 at 16:04
  • @ChandraRavoori Yes, for example by using find … -exec instead of looping around files, which works for most cases where you'd use such a for loop instead. Here, find takes care of everything for you. – slhck Jan 12 '13 at 16:06
  • @slhck, thanks. What about situations involving multi-step operations on each file where a loop may be preferable for readability reasons? Is there a better loop option than the "naïve way" above? – iruvar Jan 12 '13 at 16:13
  • @ChandraRavoori In that case, use find … -exec sh -c '…' {} ';'. Here, within sh -c you can call the file as the argument and even use multiple lines. See Gilles' answer here for more: http://unix.stackexchange.com/a/9500/5893 – slhck Jan 12 '13 at 16:15

2 Answers2

10

The man page of bash reads:

          -d delim
                 The first character of delim is  used  to  terminate  the
                 input line, rather than newline.

Because strings are usually null terminated, the first character of an empty string is the null byte. - Makes sense to me. :)

The source reads:

static unsigned char delim;
[...]
    case 'd':
      delim = *list_optarg;
      break;

For an empty string delim is simply the null byte.

Volker Siegel
  • 17,283
michas
  • 21,510
  • When you say "strings are usually null terminated", is that not the case somewhere in a POSIX environment? From the days when I was learning C for school, of course it makes sense to assume so; I was just checking. – slhck Jan 12 '13 at 08:45
  • But one could regard any string as containing arbitrarily many empty strings, e.g. if you concatenate '' and "X" you get "X". So the you could argue that the first substring bash encounters is the empty string. For example if you use the empty string in javascript's split() it will split between each character. I suspect a "for historical reasons" may be the best explanation we can get. – donothingsuccessfully Jan 12 '13 at 08:48
  • Well, not quite because "concatenating" a C-style '\0' with 'X\0' should give you 'X\0', if done right. This doesn't have much to do with high-level functions in languages such as JavaScript @don – slhck Jan 12 '13 at 09:06
  • Thanks, michas, for adding the source. delim = *list_optarg; makes it clear why it's that way. – slhck Jan 12 '13 at 09:08
  • @slhck: Sorry, I didn't make myself clear. You asked "why are '' and $'\0' the same?", michas gave the proximate explaination of "that's what the code does". I outlined an alternative way of handling the empty string that I saw as equally reasonable and suggested that chosing one or the other was simply a matter of convention or happenstance. – donothingsuccessfully Jan 12 '13 at 12:16
6

There are two deficiencies in bash that compensate each other.

When you write $'\0', that is internally treated identically to the empty string. For example:

$ a=$'\0'; echo ${#a}
0

That's because internally bash stores all strings as C strings, which are null-terminated — a null byte marks the end of the string. Bash silently truncates the string to the first null byte (which is not part of the string!).

# a=$'foo\0bar'; echo "$a"; echo ${#a}
foo
3

When you pass a string as an argument to the -d option of the read builtin, bash only looks at the first byte of the string. But it doesn't actually check that the string is not empty. Internally, an empty string is represented as a 1-element byte array that contains just a null byte. So instead of reading the first byte of the string, bash reads this null byte.

Then, internally, the machinery behind the read builtin works well with null bytes; it keeps reading byte by byte until it finds the delimiter.

Other shells behave differently. For example, ash and ksh ignore null bytes when they read input. With ksh, ksh -d "" reads until a newline. Shells are designed to cope well with text, not with binary data. Zsh is an exception: it uses a string representation that copes with arbitrary bytes, including null bytes; in zsh, $'\0' is a string of length 1 (but read -d '', oddly, behaves like read -d $'\0').

  • The behavior of read changed in bash 4.3 so that it now skips null bytes. For example read x< <(printf a\\0a) sets x to aa instead of a. – Lri Jun 09 '14 at 02:45