233

I wrote the following script to diff the outputs of two directores with all the same files in them as such:

#!/bin/bash

for file in `find . -name "*.csv"`  
do
     echo "file = $file";
     diff $file /some/other/path/$file;
     read char;
done

I know there are other ways to achieve this. Curiously though, this script fails when the files have spaces in them. How can I deal with this?

Example output of find:

./zQuery - abc - Do Not Prompt for Date.csv
Amir Afghani
  • 7,203
  • 11
  • 27
  • 23
  • 16
    I disagree with that this would be a duplicate. The accepted answer answers how to loop over filenames with spaces; that has nothing to do with "why is looping over find's output bad practise". I found this question (not the other) because I need to loop over filenames with spaces, as in: for file in $LIST_OF_FILES; do ... where $LIST_OF_FILES is not the output of find; it's just a list of filenames (separated by newlines). – Carlo Wood Feb 14 '18 at 01:58
  • 1
    @CarloWood - file names can include newlines, so your question is rather unique: looping over a list of filenames that can contain spaces but not newlines. I think you're going to have to use the IFS technique, to indicate that the break occurs at '\n' – Diagon Nov 28 '18 at 05:52
  • @Diagon- woah, I never realized that file names are allowed to contain newlines. I use mostly (only) linux/UNIX and there even spaces are rare; I certainly never in my entire life saw newlines being used :p. They might as well forbid that imho. – Carlo Wood Nov 30 '18 at 20:16
  • @CarloWood - filenames end in a null ('\0', same as ''). Anything else is acceptable. – Diagon Jan 02 '19 at 23:00
  • @CarloWood You have to remember that people vote first and read second... – code_dredd Jan 29 '21 at 10:34
  • Oops! Just now, I would like to add my answer below, but I see that this question has been (mis-)closed. I am working on a similar problem as the OP and found a solution wanna share. My suggested approach involves rewriting the script without using find and for. That's why I mentioned that this is not a duplicate question. I cannot use the solutions provided below because I am adhering to POSIX, which means no read -d, and I am sourcing a file in the script, which means no spawning a subshell. Using find does not meet my requirements, as find cannot take find . -name '*/*.csv'. – midnite Dec 18 '23 at 14:54
  • find is not usable when it comes to */*. for is a pain in axxhxxx when the string contains $IFS. I forgo them both. It is important to understand where and how globbing occurs. Consider why $ ls */*.csv works in the shell. Globbing occurs right at the place where it is unquoted. $ ls */*.csv becomes $ ls 'dir name/file one.csv' 'dir name/file two.csv' 'dir name/file three.csv'. ls takes three arguments, with spaces, without any problems. for kinda overdo things, smashes this array of three arguments into one plain string, then re-splits it with $IFS. Very bad. – midnite Dec 18 '23 at 15:12
  • My solution: fname_pattern='*/*.csv' ; process() { while [ $# -gt 0 ]; do echo "file = $1" ; diff "$1" "/some/other/path/$1" ; read -r _ ; shift ; done ; } ; process $fname_pattern. Rewrite using a function. Call the function process with unquoted variable which let the globbing occurs. The function takes an array of (whatever number of) elements. Using while ... shift to process each element. Benefits: (1) No subshell, allows source files in my case; (2) Won't break for any weird characters; (3) Allows / in the pattern, which is not possible by find. – midnite Dec 18 '23 at 15:25
  • Benefit (4) POSIX compliant, as no read -d. Remark: If one wants to match pattern */with 3 spaces*.csv, one should properly quote it, fname_pattern="*/'with 3 spaces'*.csv". – midnite Dec 18 '23 at 15:52

9 Answers9

330

Short answer (closest to your answer, but handles spaces)

OIFS="$IFS"
IFS=$'\n'
for file in `find . -type f -name "*.csv"`  
do
     echo "file = $file"
     diff "$file" "/some/other/path/$file"
     read line
done
IFS="$OIFS"

Better answer (also handles wildcards and newlines in file names)

find . -type f -name "*.csv" -print0 | while IFS= read -r -d '' file; do
    echo "file = $file"
    diff "$file" "/some/other/path/$file"
    read line </dev/tty
done

Best answer (based on Gilles' answer)

find . -type f -name '*.csv' -exec sh -c '
  file="$0"
  echo "$file"
  diff "$file" "/some/other/path/$file"
  read line </dev/tty
' exec-sh {} ';'

Or even better, to avoid running one sh per file:

find . -type f -name '*.csv' -exec sh -c '
  for file do
    echo "$file"
    diff "$file" "/some/other/path/$file"
    read line </dev/tty
  done
' exec-sh {} +

Long answer

You have three problems:

  1. By default, the shell splits the output of a command on spaces, tabs, and newlines
  2. Filenames could contain wildcard characters which would get expanded
  3. What if there is a directory whose name ends in *.csv?

1. Splitting only on newlines

To figure out what to set file to, the shell has to take the output of find and interpret it somehow, otherwise file would just be the entire output of find.

The shell reads the IFS variable, which is set to <space><tab><newline> by default.

Then it looks at each character in the output of find. As soon as it sees any character that's in IFS, it thinks that marks the end of the file name, so it sets file to whatever characters it saw until now and runs the loop. Then it starts where it left off to get the next file name, and runs the next loop, etc., until it reaches the end of output.

So it's effectively doing this:

for file in "zquery" "-" "abc" ...

To tell it to only split the input on newlines, you need to do

IFS=$'\n'

before your for ... find command.

That sets IFS to a single newline, so it only splits on newlines, and not spaces and tabs as well.

If you are using sh or dash instead of ksh93, bash or zsh, you need to write IFS=$'\n' like this instead:

IFS='
'

That is probably enough to get your script working, but if you're interested to handle some other corner cases properly, read on...

2. Expanding $file without wildcards

Inside the loop where you do

diff $file /some/other/path/$file

the shell tries to expand $file (again!).

It could contain spaces, but since we already set IFS above, that won't be a problem here.

But it could also contain wildcard characters such as * or ?, which would lead to unpredictable behavior. (Thanks to Gilles for pointing this out.)

To tell the shell not to expand wildcard characters, put the variable inside double quotes, e.g.

diff "$file" "/some/other/path/$file"

The same problem could also bite us in

for file in `find . -name "*.csv"`

For example, if you had these three files

file1.csv
file2.csv
*.csv

(very unlikely, but still possible)

It would be as if you had run

for file in file1.csv file2.csv *.csv

which will get expanded to

for file in file1.csv file2.csv *.csv file1.csv file2.csv

causing file1.csv and file2.csv to be processed twice.

Instead, we have to do

find . -name "*.csv" -print | while IFS= read -r file; do
    echo "file = $file"
    diff "$file" "/some/other/path/$file"
    read line </dev/tty
done

read reads lines from standard input, splits the line into words according to IFS and stores them in the variable names that you specify.

Here, we're telling it not to split the line into words, and to store the line in $file.

Also note that read line has changed to read line </dev/tty.

This is because inside the loop, standard input is coming from find via the pipeline.

If we just did read, it would be consuming part or all of a file name, and some files would be skipped.

/dev/tty is the terminal where the user is running the script from. Note that this will cause an error if the script is run via cron, but I assume this is not important in this case.

Then, what if a file name contains newlines?

We can handle that by changing -print to -print0 and using read -d '' on the end of a pipeline:

find . -name "*.csv" -print0 | while IFS= read -r -d '' file; do
    echo "file = $file"
    diff "$file" "/some/other/path/$file"
    read char </dev/tty
done

This makes find put a null byte at the end of each file name. Null bytes are the only characters not allowed in file names, so this should handle all possible file names, no matter how weird.

To get the file name on the other side, we use IFS= read -r -d ''.

Where we used read above, we used the default line delimiter of newline, but now, find is using null as the line delimiter. In bash, you can't pass a NUL character in an argument to a command (even builtin ones), but bash understands -d '' as meaning NUL delimited. So we use -d '' to make read use the same line delimiter as find. Note that -d $'\0', incidentally, works as well, because bash not supporting NUL bytes treats it as the empty string.

To be correct, we also add -r, which says don't handle backslashes in file names specially. For example, without -r, \<newline> are removed, and \n is converted into n.

A more portable way of writing this that doesn't require bash or zsh or remembering all the above rules about null bytes (again, thanks to Gilles):

find . -name '*.csv' -exec sh -c '
  file="$0"
  echo "$file"
  diff "$file" "/some/other/path/$file"
  read char </dev/tty
' exec-sh {} ';'

*3. Skipping directories whose names end in .csv

find . -name "*.csv"

will also match directories that are called something.csv.

To avoid this, add -type f to the find command.

find . -type f -name '*.csv' -exec sh -c '
  file="$0"
  echo "$file"
  diff "$file" "/some/other/path/$file"
  read line </dev/tty
' exec-sh {} ';'

As glenn jackman points out, in both of these examples, the commands to execute for each file are being run in a subshell, so if you change any variables inside the loop, they will be forgotten.

If you need to set variables and have them still set at the end of the loop, you can rewrite it to use process substitution like this:

i=0
while IFS= read -r -d '' file; do
    echo "file = $file"
    diff "$file" "/some/other/path/$file"
    read line </dev/tty
    i=$((i+1))
done < <(find . -type f -name '*.csv' -print0)
echo "$i files processed"

Note that if you try copying and pasting this at the command line, read line will consume the echo "$i files processed", so that command won't get run.

To avoid this, you could remove read line </dev/tty and send the result to a pager like less.


NOTES

I removed the semi-colons (;) inside the loop. You can put them back if you want, but they are not needed.

These days, $(command) is more common than `command`. This is mainly because it's easier to write $(command1 $(command2)) than `command1 \`command2\``.

read char doesn't really read a character. It reads a whole line so I changed it to read line.

mivk
  • 3,596
Mikel
  • 57,299
  • 15
  • 134
  • 153
  • 3
    putting while in a pipeline can create issues with the subshell created (variables in the loop block not visible after the command completes for example). With bash, I would use input redirection and process substitution: while read -r -d $'\0' file; do ...; done < <(find ... -print0) – glenn jackman Mar 18 '11 at 01:23
  • Sure, or using a heredoc: while read; do; done <<EOF "$(find)" EOF. Not so easy to read however. – Mikel Mar 18 '11 at 01:41
  • @glenn jackman: I tried to add more explanation just now. Did I just make it better or worse? – Mikel Mar 18 '11 at 02:36
  • You don't need IFS, -print0, while and read if you handle find to its full, as shown below in my solution. – user unknown Mar 19 '11 at 23:10
  • 1
    Your first solution will cope with any character except newline if you also turn off globbing with set -f. – Gilles 'SO- stop being evil' Apr 04 '11 at 19:28
  • Yes, but then we'd have to restore it at the end of the loop. The first solution was intended to be simple, so I'm reluctant to change it. Now you made this comment, at least it's on record. Thanks. :-) – Mikel Apr 04 '11 at 21:05
  • tldr; IFS=$'\n' – Ken Sharp Jan 08 '17 at 20:58
  • Thank you very much for IFS=$'\n' - this was crazy, handling a single file list (from file) with spaces in filenames in for/while was nearly impossible without it... – antivirtel Feb 27 '17 at 22:47
  • 1
    the "best" answer is relative, and i would say whatever is most understandable/maintainable by the scripter. for me, that is a slight modification to the first one. rather than saving/restoring IFS, you can use a subshell: (IFS=$'\n'; for file in ... ) – Jayen Dec 16 '17 at 22:56
  • May I ask why not just temperately setting $IFS='' to empty string and let the unquoted $pattern expands? pattern='*.csv' ; old_IFS=${IFS} ; IFS='' ; for f in ${pattern} ; do IFS=${old_IFS:-"${IFS}"} ; unset old_IFS ; : play around here ; done. I have tested (1) it works with dir and fname with spaces, (2) it works with globbing in both dir and fname, (3) works with dir and fname with globbing characters * too. It is kind of too good to be true. Are there any bugs that I have overlook? @Jayen - In my case I cannot do it in a sub-shell. – midnite Jan 09 '24 at 13:47
26

This script fails if any file name contains spaces or shell globbing characters \[?*. The find command outputs one file name per line. Then the command substitution `find …` is evaluated by the shell as follows:

  1. Execute the find command, grab its output.
  2. Split the find output into separate words. Any whitespace character is a word separator.
  3. For each word, if it is a globbing pattern, expand it to the list of files it matches.

For example, suppose there are three files in the current directory, called `foo* bar.csv, foo 1.txt and foo 2.txt.

  1. The find command returns ./foo* bar.csv.
  2. The shell splits this string at the space, producing two words: ./foo* and bar.csv.
  3. Since ./foo* contains a globbing metacharacter, it's expanded to the list of matching files: ./foo 1.txt and ./foo 2.txt.
  4. Therefore the for loop is executed successively with ./foo 1.txt, ./foo 2.txt and bar.csv.

You can avoid most problems at this stage by toning down word splitting and turning off globbing. To tone down word splitting, set the IFS variable to a single newline character; this way the output of find will only be split at newlines and spaces will remain. To turn off globbing, run set -f. Then this part of the code will work as long as no file name contains a newline character.

IFS='
'
set -f
for file in $(find . -name "*.csv"); do …

(This isn't part of your problem, but I recommend using $(…) over `…`. They have the same meaning, but the backquote version has weird quoting rules.)

There's another problem below: diff $file /some/other/path/$file should be

diff "$file" "/some/other/path/$file"

Otherwise, the value of $file is split into words and the words are treated as glob patterns, like with the command substitutio above. If you must remember one thing about shell programming, remember this: always use double quotes around variable expansions ($foo) and command substitutions ($(bar)), unless you know you want to split. (Above, we knew we wanted to split the find output into lines.)

A reliable way of calling find is telling it to run a command for each file it finds:

find . -name '*.csv' -exec sh -c '
  echo "$0"
  diff "$0" "/some/other/path/$0"
' {} ';'

In this case, another approach is to compare the two directories, though you have to explicitly exclude all the “boring” files.

diff -r -x '*.txt' -x '*.ods' -x '*.pdf' … . /some/other/path
  • I'd forgotten about wildcards as another reason to quote properly. Thanks! :-) – Mikel Mar 18 '11 at 02:34
  • instead of find -exec sh -c 'cmd 1; cmd 2' ";", you should use find -exec cmd 1 {} ";" -exec cmd 2 {} ";", because the shell needs to mask the parameters, but find doesn't. In the special case here, echo "$0" doesn't need to be a part of the script, just append -print after the ';'. You didn't include a question to proceed, but even that can be done by find, as shown below in my soulution. ;) – user unknown Mar 19 '11 at 23:25
  • 2
    @userunknown: The use of {} as a substring of a parameter in find -exec is not portable, that's why the shell is needed. I don't understand what you mean by “the shell needs to mask the parameters”; if it's about quoting, my solution is properly quoted. You're right that the echo part could be performed by -print instead. -okdir is a fairly recent GNU find extension, it's not available everywhere. I didn't include the wait to proceed because I consider that extremely poor UI and the asker can easily put read in the shell snippet if he wants. – Gilles 'SO- stop being evil' Mar 19 '11 at 23:59
  • Quoting is a form of masking, isn't it? I don't understand your remark about what is portable, and what not. Your example (2nd from bottom) uses -exec to invoke sh and uses {} - so where is my example (beside -okdir) less portable? find . -name "*.csv" -exec diff {} /some/other/path/{} ";" -print – user unknown Mar 20 '11 at 01:05
  • 2
    “Masking” isn't common terminology in shell literature, so you'll have to explain what you mean if you want to be understood. My example uses {} only once and in a separate argument; other cases (used twice or as a substring) are not portable. “Portable” means that it'll work on all unix systems; a good guideline is the POSIX/Single Unix specification. – Gilles 'SO- stop being evil' Mar 20 '11 at 01:15
  • May I ask why not just temperately setting $IFS='' to empty string and let the unquoted $pattern expands? pattern='*.csv' ; old_IFS=${IFS} ; IFS='' ; for f in ${pattern} ; do IFS=${old_IFS:-"${IFS}"} ; unset old_IFS ; : play around here ; done. I have tested (1) it works with dir and fname with spaces, (2) it works with globbing in both dir and fname, (3) works with dir and fname with globbing characters * too. Is there any bug that I have overlook? – midnite Jan 09 '24 at 16:51
  • @midnite If you have a pattern in a variable, you don't need to do anything with IFS: pattern='*.csv'; for f in $pattern; do process -- "$f"; done works fine. But that's not related to the question which is about find output. – Gilles 'SO- stop being evil' Jan 09 '24 at 18:59
  • Thanks @Gilles'SO-stopbeingevil' - If pattern='my file *.csv' contains spaces, we need to unset IFS. OP question is to loop through all*.csvfiles with spaces. OP attempted to solve it by usingfind, but I do not thinkfind` is doing any good here. It makes things more complicated which is not necessary. – midnite Jan 09 '24 at 19:06
  • @midnite If you want to have spaces inside the pattern, the natural way to do it is pattern='my\ file\ *.csv. The value of pattern here is a space (more generally IFS) separated list. You can avoid quoting all non-wildcard characters by unsetting IFS, but that just makes using the variable more complicated. But anyway, none of that helps with the question which is about generating a list of files with find. find is recursive and allows more ways to select files. The shell pattern *.csv is only equivalent in the special case when there are no subdirectories containing .csv files. – Gilles 'SO- stop being evil' Jan 09 '24 at 19:18
  • Thanks @Gilles'SO-stopbeingevil' - You gave a very good point that find will get files in subdirectories also. I need to glob expand and loop through file names with spaces also. I just find the solution is very simple: IFS='' and unquote. It is kind of "too good to be true", as others are using find, arrays, etc. My code is further illustrated here: https://unix.stackexchange.com/a/766527/150246 . Are there any faults or bugs that I have overlook? – midnite Jan 09 '24 at 19:48
22

I'm surprised to not see readarray mentioned. It makes this very easy when used in combination with the <<< operator:

$ touch oneword "two words"

$ readarray -t files <<<"$(ls)"

$ for file in "${files[@]}"; do echo "|$file|"; done
|oneword|
|two words|

Using the <<<"$expansion" construct also allows you to split variables containing newlines into arrays, like:

$ string=$(dmesg)
$ readarray -t lines <<<"$string"
$ echo "${lines[0]}"
[    0.000000] Initializing cgroup subsys cpuset

readarray has been in Bash for years now, so this should probably be the canonical way to do this in Bash.

8

I'm surprised nobody mentioned the obvious zsh solution here yet:

for file (**/*.csv(ND.)) {
  do-something-with $file
}

((D) to also include hidden files, (N) to avoid the error if there's no match, (.) to restrict to regular files.)

bash4.3 and above now supports it partially as well:

shopt -s globstar nullglob dotglob
for file in **/*.csv; do
  [ -f "$file" ] || continue
  [ -L "$file" ] && continue
  do-something-with "$file"
done
7

Afaik find has all you need.

find . -okdir diff {} /some/other/path/{} ";"

find takes itself care for calling the programs savely. -okdir will prompt you before the diff (are you sure yes/no).

No shell involved, no globbing, jokers, pi, pa, po.

As a sidenote: If you combine find with for/while/do/xargs, in most cases, you're doing it wrong. :)

user unknown
  • 10,482
  • Thanks for the answer. Why are you doing it wrong if you combine find with for/while/do/xargs? – Amir Afghani Mar 18 '11 at 14:56
  • 2
    Find already iterates over a subset of files. Most people who show up with questions could just use one of the actions (-ok(dir) -exec(dir), -delete) in combination wiht ";" or + (later for parallel invocation). The main reason to do so, is, that you don't have to fiddle around with file parameters, masking them for the shell. Not that important: You needn't new processes all the time, less memory, more speed. shorter program. – user unknown Mar 18 '11 at 21:05
  • Not here to crush your spirit, but compare: time find -type f -exec cat "{}" \; with time find -type f -print0 | xargs -0 -I stuff cat stuff. The xargs version was faster by 11 seconds when processing 10000 empty files. Be careful when asserting that in most cases combining find with other utilities is wrong. -print0 and -0 are there to deal with spaces in the file names by using a zero byte as the item separator rather than a space. – Jonathan Komar Jul 05 '17 at 11:00
  • @JonathanKomar: Your find/exec commando took 11.7 s on my system with 10.000 files, the xargs version 9.7 s, time find -type f -exec cat {} + as suggested in my previous comment took 0.1 s. Note the subtile difference between "it is wrong" and "you're doing it wrong", especially when decorated with a smilie. Did you, for instance, do it wrong? ;) BTW, spaces in the filename are no problem for the above command and find in general. Cargo cult programmer? And by the way, combining find with other tools is fine, just xargs is most of the time superflous. – user unknown Jul 05 '17 at 12:48
  • 1
    @userunknown I explained how my code deals with spaces for posterity (education of future viewers), and was not implying that your code does not. The + for parallel calls is very fast, as you mentioned. I would not say cargo cult programmer, because this ability to use xargs in this way comes in handy on numerous occasions. I agree more with the Unix philosophy: do one thing and do it well (use programs separately or in combination to get a job done). find is walking a fine line there. – Jonathan Komar Jul 06 '17 at 07:21
6

Loop through any files (any special character included) with the completely safe find (see the link for documentation):

exec 9< <( find "$absolute_dir_path" -type f -print0 )
while IFS= read -r -d '' -u 9
do
    file_path="$(readlink -fn -- "$REPLY"; echo x)"
    file_path="${file_path%x}"
    echo "START${file_path}END"
done
l0b0
  • 51,350
2

File names with spaces in them look like multiple names on the command line if they're not quoted. If your file is named "Hello World.txt", the diff line expands to:

diff Hello World.txt /some/other/path/Hello World.txt

which looks like four file names. Just put quotes around the arguments:

diff "$file" "/some/other/path/$file"
  • This helps but it doesn't solve my problem. I still see cases where the file is being split up into multiple tokens. – Amir Afghani Mar 18 '11 at 00:37
  • This answer is misleading. The problem is the for file in \find . -name "*.csv"`` command. If there is a file called Hello World.csv, file will be set to ./Hello and then to World.csv. Quoting $file won't help. – G-Man Says 'Reinstate Monica' Mar 04 '15 at 19:11
1

With bash4, you can also use the builtin mapfile function to set an array containing each lines and iterate on this array.

$ tree 
.
├── a
│   ├── a 1
│   └── a 2
├── b
│   ├── b 1
│   └── b 2
└── c
    ├── c 1
    └── c 2

3 directories, 6 files
$ mapfile -t files < <(find -type f)
$ for file in "${files[@]}"; do
> echo "file: $file"
> done
file: ./a/a 2
file: ./a/a 1
file: ./b/b 2
file: ./b/b 1
file: ./c/c 2
file: ./c/c 1
jfgiraud
  • 251
1

Double quoting is your friend.

diff "$file" "/some/other/path/$file"

Otherwise the variable's contents get word-split.

geekosaur
  • 32,047
  • 3
    This is misleading. The problem is the for file in \find . -name "*.csv"`` command. If there is a file called Hello World.csv, file will be set to ./Hello and then to World.csv. Quoting $file won't help. – G-Man Says 'Reinstate Monica' Mar 04 '15 at 19:11