Short answer (closest to your answer, but handles spaces)
OIFS="$IFS"
IFS=$'\n'
for file in `find . -type f -name "*.csv"`
do
echo "file = $file"
diff "$file" "/some/other/path/$file"
read line
done
IFS="$OIFS"
Better answer (also handles wildcards and newlines in file names)
find . -type f -name "*.csv" -print0 | while IFS= read -r -d '' file; do
echo "file = $file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
done
Best answer (based on Gilles' answer)
find . -type f -name '*.csv' -exec sh -c '
file="$0"
echo "$file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
' exec-sh {} ';'
Or even better, to avoid running one sh
per file:
find . -type f -name '*.csv' -exec sh -c '
for file do
echo "$file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
done
' exec-sh {} +
Long answer
You have three problems:
- By default, the shell splits the output of a command on spaces, tabs, and newlines
- Filenames could contain wildcard characters which would get expanded
- What if there is a directory whose name ends in
*.csv
?
1. Splitting only on newlines
To figure out what to set file
to, the shell has to take the output of find
and interpret it somehow, otherwise file
would just be the entire output of find
.
The shell reads the IFS
variable, which is set to <space><tab><newline>
by default.
Then it looks at each character in the output of find
. As soon as it sees any character that's in IFS
, it thinks that marks the end of the file name, so it sets file
to whatever characters it saw until now and runs the loop. Then it starts where it left off to get the next file name, and runs the next loop, etc., until it reaches the end of output.
So it's effectively doing this:
for file in "zquery" "-" "abc" ...
To tell it to only split the input on newlines, you need to do
IFS=$'\n'
before your for ... find
command.
That sets IFS
to a single newline, so it only splits on newlines, and not spaces and tabs as well.
If you are using sh
or dash
instead of ksh93
, bash
or zsh
, you need to write IFS=$'\n'
like this instead:
IFS='
'
That is probably enough to get your script working, but if you're interested to handle some other corner cases properly, read on...
2. Expanding $file
without wildcards
Inside the loop where you do
diff $file /some/other/path/$file
the shell tries to expand $file
(again!).
It could contain spaces, but since we already set IFS
above, that won't be a problem here.
But it could also contain wildcard characters such as *
or ?
, which would lead to unpredictable behavior. (Thanks to Gilles for pointing this out.)
To tell the shell not to expand wildcard characters, put the variable inside double quotes, e.g.
diff "$file" "/some/other/path/$file"
The same problem could also bite us in
for file in `find . -name "*.csv"`
For example, if you had these three files
file1.csv
file2.csv
*.csv
(very unlikely, but still possible)
It would be as if you had run
for file in file1.csv file2.csv *.csv
which will get expanded to
for file in file1.csv file2.csv *.csv file1.csv file2.csv
causing file1.csv
and file2.csv
to be processed twice.
Instead, we have to do
find . -name "*.csv" -print | while IFS= read -r file; do
echo "file = $file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
done
read
reads lines from standard input, splits the line into words according to IFS
and stores them in the variable names that you specify.
Here, we're telling it not to split the line into words, and to store the line in $file
.
Also note that read line
has changed to read line </dev/tty
.
This is because inside the loop, standard input is coming from find
via the pipeline.
If we just did read
, it would be consuming part or all of a file name, and some files would be skipped.
/dev/tty
is the terminal where the user is running the script from. Note that this will cause an error if the script is run via cron, but I assume this is not important in this case.
Then, what if a file name contains newlines?
We can handle that by changing -print
to -print0
and using read -d ''
on the end of a pipeline:
find . -name "*.csv" -print0 | while IFS= read -r -d '' file; do
echo "file = $file"
diff "$file" "/some/other/path/$file"
read char </dev/tty
done
This makes find
put a null byte at the end of each file name. Null bytes are the only characters not allowed in file names, so this should handle all possible file names, no matter how weird.
To get the file name on the other side, we use IFS= read -r -d ''
.
Where we used read
above, we used the default line delimiter of newline, but now, find
is using null as the line delimiter. In bash
, you can't pass a NUL character in an argument to a command (even builtin ones), but bash
understands -d ''
as meaning NUL delimited. So we use -d ''
to make read
use the same line delimiter as find
. Note that -d $'\0'
, incidentally, works as well, because bash
not supporting NUL bytes treats it as the empty string.
To be correct, we also add -r
, which says don't handle backslashes in file names specially. For example, without -r
, \<newline>
are removed, and \n
is converted into n
.
A more portable way of writing this that doesn't require bash
or zsh
or remembering all the above rules about null bytes (again, thanks to Gilles):
find . -name '*.csv' -exec sh -c '
file="$0"
echo "$file"
diff "$file" "/some/other/path/$file"
read char </dev/tty
' exec-sh {} ';'
*3. Skipping directories whose names end in .csv
find . -name "*.csv"
will also match directories that are called something.csv
.
To avoid this, add -type f
to the find
command.
find . -type f -name '*.csv' -exec sh -c '
file="$0"
echo "$file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
' exec-sh {} ';'
As glenn jackman points out, in both of these examples, the commands to execute for each file are being run in a subshell, so if you change any variables inside the loop, they will be forgotten.
If you need to set variables and have them still set at the end of the loop, you can rewrite it to use process substitution like this:
i=0
while IFS= read -r -d '' file; do
echo "file = $file"
diff "$file" "/some/other/path/$file"
read line </dev/tty
i=$((i+1))
done < <(find . -type f -name '*.csv' -print0)
echo "$i files processed"
Note that if you try copying and pasting this at the command line, read line
will consume the echo "$i files processed"
, so that command won't get run.
To avoid this, you could remove read line </dev/tty
and send the result to a pager like less
.
NOTES
I removed the semi-colons (;
) inside the loop. You can put them back if you want, but they are not needed.
These days, $(command)
is more common than `command`
. This is mainly because it's easier to write $(command1 $(command2))
than `command1 \`command2\``
.
read char
doesn't really read a character. It reads a whole line so I changed it to read line
.
find
andfor
. That's why I mentioned that this is not a duplicate question. I cannot use the solutions provided below because I am adhering to POSIX, which means noread -d
, and I am sourcing a file in the script, which means no spawning a subshell. Usingfind
does not meet my requirements, asfind
cannot takefind . -name '*/*.csv'
. – midnite Dec 18 '23 at 14:54find
is not usable when it comes to*/*
.for
is a pain in axxhxxx when the string contains$IFS
. I forgo them both. It is important to understand where and how globbing occurs. Consider why$ ls */*.csv
works in the shell. Globbing occurs right at the place where it is unquoted.$ ls */*.csv
becomes$ ls 'dir name/file one.csv' 'dir name/file two.csv' 'dir name/file three.csv'
.ls
takes three arguments, with spaces, without any problems.for
kinda overdo things, smashes this array of three arguments into one plain string, then re-splits it with$IFS
. Very bad. – midnite Dec 18 '23 at 15:12fname_pattern='*/*.csv' ; process() { while [ $# -gt 0 ]; do echo "file = $1" ; diff "$1" "/some/other/path/$1" ; read -r _ ; shift ; done ; } ; process $fname_pattern
. Rewrite using a function. Call the functionprocess
with unquoted variable which let the globbing occurs. The function takes an array of (whatever number of) elements. Usingwhile ... shift
to process each element. Benefits: (1) No subshell, allows source files in my case; (2) Won't break for any weird characters; (3) Allows/
in the pattern, which is not possible byfind
. – midnite Dec 18 '23 at 15:25read -d
. Remark: If one wants to match pattern*/with 3 spaces*.csv
, one should properly quote it,fname_pattern="*/'with 3 spaces'*.csv"
. – midnite Dec 18 '23 at 15:52