1

Now we're all familiar with not using:

find . -print | xargs cmd

but using

find . -print0 | xargs -0 cmd

To cope with filenames containing e.g. newline, but what about a line I have in a script:

find $@ -type f -print | while read  filename

Well, I assumed it would be something like:

find $@ -type f -print0 | while read -d"\0" filename

And if I'd simply done:

 find $@ -type f -print0 | while read filename

I'd be seeing the NULLs?

But No, the while loop exits after zero times around (in both cases) I assume because the read returned zero, also I assume because it read a NULL (\0) .

Feels like the bash read should sport a "-0" option.

Have I misread what's happening or is there a different way to frame this?

For this example I may well have to recode to use xargs but that's a whole heap of new processes I didn't want to fork.

GraemeV
  • 148

1 Answers1

3

When using read, you can use just -d '' to read up to the next null character.

From the bash manual, regarding the read built-in utility:

-d delim
The first character of delim is used to terminate the input line, rather than newline. If delim is the empty string, read will terminate a line when it reads a NUL character.

You probably also want to set IFS to an empty string to stop read from trimming flanking whitespaces from the data, and to use read with -r to be able to read strings containing backslashes properly. You also need to double quote the expansion $@ if you want your script or shell function to support search paths containing newlines, spaces, filename globbing characters, etc:

find "$@" -type f -print0 |
while IFS= read -r -d '' pathname; do
    # use "$pathname" to do something
done

Personally, I would not pass pathnames out of find at all if it's not desperately needed, but execute the needed operations via -exec, e.g.,

find "$@" -type f -exec sh -c '
    for pathname do
        # use "$pathname" to do something
    done' sh {} +

Related topics:

Kusalananda
  • 333,661
  • @they Thanks for that, it works , as I guess you know.

    I was a little confused how it could work, so I grabbed a copy of the bash source:

    ~/src/bash/bash-5.1/builtins/read.def : Line 319

    case 'd': delim = *list_optarg; break;

    For the non C guys out there. Strings in C are null delimited so "foo" would be an array of char of length 4 'f','o','o','\0' if list_optarg held this then *list_optarg would get list_optarg[0] i.e. 'f'. If you passed '' then the array would be length of 1 and list_optarg[0] would be '\0' (single char holding a zero).

    – GraemeV Nov 15 '21 at 14:22
  • An earlier version of the 1st response used $'\0' (which would create 2 nuls an use the 1st one) see:

    [link][https://stackoverflow.com/questions/55135775/meaning-of-read-and-d-and-0-in-bash?noredirect=1&lq=1][link]

    I feel '$\0' aids readability ....makes it clear you're looking for NULs

    – GraemeV Nov 15 '21 at 14:22
  • @GraemeV That was an error on my part. The use of $'\0' is nonsensical, as it is the same as '' in the bash shell. You can see that by using echo $'\0' | hexdump -C and comparing that with echo '' | hexdump -C. You will also notice that the result is the same as with $'\0hello'. This does not hold true for the zsh shell. In my view, it is clearer with -d '' as that is explicitly mentioned in the manual. – Kusalananda Nov 15 '21 at 14:27
  • BTW, the reason I use read rather than -exec or xargs. Consider, this is recursing down a tree of a million files:

    -exec will fork off a million new processes xargs will fork off several hundred thousand processes whereas read will fork no additional process

    – GraemeV Nov 15 '21 at 14:30
  • While both '' and $'\0' obviously work , the "verbose" form stand out (looks odd) and so people are likely to see it and pause for thought. Similarly the code in bash would be clearer were it coded as:

    list_optarg[0]

    Rather than

    *list_optarg

    Helps with self documentation.

    – GraemeV Nov 15 '21 at 14:34
  • @GraemeV Look at my find command at the end of my answer. That will invoke sh -c for batches of as many pathnames as possible at once, not once per pathname. This is due to the {} + at the end (rather than {} \;). This is probably at least as good as xargs would do. For millions of files, the overhead of launching sh -c, even thousands of times, is negligible. Using {} + is supported by the POSIX standard. – Kusalananda Nov 15 '21 at 14:35
  • 1
    @GraemeV, I have the vague understanding that the fact read -d '' works like that is originally an accident due to the way C strings work and it wasn't even documented to begin with. (Well, the Bash 4.4 man page on my system doesn't document it.) – ilkkachu Nov 15 '21 at 19:12