1

I have come across some odd behaviour while using the find command that I cannot find an explanation for.

I have a .txt file with 1 filename per line and I am using the find command to recursively search through my database for that file. When I use the command like this:

for filename in `cat filelist.csv`; do
find /location*/time*/ -name *${filename}*txt
done

I get the expected output of 1 output per line. However when I then use the same command but set the output as a variable (which I ultimately need to do):

for filename in `cat filelist.csv`; do
out=`find /location*/time*/ -name *${filename}*txt`
echo ${out}
done

The find command seems to be printing all the matching files in a folder in on the same line. I have 2 questions:

  1. What is causing this behaviour?
  2. How can I get find to output each matching file (even if there are many matching files in a folder) to a new line as a variable?

Cheers!

AdminBee
  • 22,803
Isaac
  • 11
  • 1
    It's not the find command - it's the shell's expansion of the unquoted ${out}. See for example New line in bash variables – steeldriver Sep 07 '21 at 15:02
  • 1
    and the glob on find's command line is not quoted, so the shell will expand it before find, if there are any matching files in the current directory. Also the unquoted expansions will break if there are any filenames with whitespace or embedded glob characters... – ilkkachu Sep 07 '21 at 15:19
  • This question seems relevant, close enough to perhaps be useful to you but probably not close enough to be a dupe: find: How to efficiently search for big list of filenames. In short, either use an array to construct a list of -name args for find, or pipe find's output into grep (use find -print0 and grep -z for NUL separated filenames) – cas Sep 07 '21 at 17:37

3 Answers3

1

This happens just because any newlines are "folded" and changed to spaces when the shell expands a variable. So if your out variable contains newlines, ${out} changes all these newlines into spaces. But "${out}" preserves the newlines.

raj
  • 1,123
  • 4
  • 11
0

If your filelist.csv contained exact matches for the files, you could just use something like find ... -print0 | grep -z -F -f filelist.csv | xargs -0r...but it seems you want to match partial filenames listed in that file (with any characters before the filename and ".txt" appended). To do that, the easiest way is to use a regular expression.

You can use process substitution to transform the partial filenames in filelist.csv into an appropriate regular expression filelist.csv as it's being read by grep.

BTW, unless you use sed's -i option (don't do that for this particular task) this transformation is not permanent, it won't affect the original filelist.csv file, only the stream of text being fed into grep -f.

Alternatively, you could pipe the output of find . -name '*.txt' into grep. That way, the input that grep sees is already filtered for filenames ending with .txt, so sed isn't needed to modify the regular expressions.

Anyway, try something like this:

First, some setup stuff for this experiment:

$ cat filelist.csv 
test
foo

$ touch test test.txt foo foo.txt footest footest.txt

$ ls -l total 4 -rw-r--r-- 1 cas cas 10 Sep 8 04:01 filelist.csv -rw-r--r-- 1 cas cas 0 Sep 8 04:01 foo -rw-r--r-- 1 cas cas 0 Sep 8 04:01 footest -rw-r--r-- 1 cas cas 0 Sep 8 04:01 footest.txt -rw-r--r-- 1 cas cas 0 Sep 8 04:01 foo.txt -rw-r--r-- 1 cas cas 0 Sep 8 04:01 test -rw-r--r-- 1 cas cas 0 Sep 8 04:01 test.txt

Then use the bash built-in mapfile to populate an array called out with the output of find and grep:

$ mapfile -d '' out < \
    <(find . -type f -print0 |
        grep -z -f <(sed -e 's/^\(.*\)/.*\1\.txt$/' filelist.csv)

or:

$ mapfile -d '' out < \
    <(find . -type f -name '*.txt' -print0 |
        grep -z -f filelist.csv )

The results:

$ declare -p out
declare -a out=([0]="./foo.txt" [1]="./footest.txt" [2]="./test.txt")

$ ls -l "${out[@]}" -rw-r--r-- 1 cas cas 0 Sep 8 04:01 ./footest.txt -rw-r--r-- 1 cas cas 0 Sep 8 04:01 ./foo.txt -rw-r--r-- 1 cas cas 0 Sep 8 04:01 ./test.txt

Note how the out array only contains foo.txt, footest.txt, and test.txt, but not foo or test or footest.

BTW, you can iterate over the filenames in $out with something like:

for f in "${out[@]}"; do
  echo "$f"
  do-something-else-with "$f"
done

or to iterate over the indices of the array (0, 1, 2) rather than the values -sometimes that is more useful, e.g. when you have two or more arrays with the same indices that you want to use together in some way, or when you need to use the index for some other purpose:

for i in "$!{out[@]}"; do
   echo "${out[$i]}"
done 

Remember:

  1. double-quote your variables (i.e. type "$var", not just $var) when you don't want shell to word-split them or expand globs or act on shell metacharacters like ; or & that may be in them. This is almost always. Rule of thumb: if you don't know exactly WHY you need to use a variable without double-quoting it in any particular situation, then double-quote it. Not quoting $out or $filename was the proximate cause of your initial problem.

  2. never assume filenames won't have annoying characters like spaces and newlines in them - these are perfectly valid characters for filenames in unix, so your scripts will have to deal with them. In fact, the only character which can't be in a path/filename is a NUL.

  3. always use NUL as the separator between arbitrary or unknown filenames. It's the only separator which will work with any filename.

  4. there are lots of exceptions but: most of the time, you should use an array when you want a variable to hold multiple values, not a white-space separated string or similar method of "faking/emulating an array". Especially when those values are filenames or other things where your separator is a valid character in one of the values.

cas
  • 78,579
0

To find all the files whose name ends in txt and contains any of the contents of the lines in filelist.csv, in the zsh shell you could do:

print -rC1 -- **/*(${(j[|])~${(fb)"$(<filelist.csv)"}}*})*txt(ND)

Or broken down one step at a time:

csv_contents=$(<filelist.csv)
non_empty_lines_of_csv=(${(f)csv_contents})
lines_with_wildcards_excaped=(${(b)non_empty_lines_of_csv})
ored_patterns=${(j[|])lines_with_wildcards_excaped}
filename_pattern="*($ored_patterns)*txt"

print -rC1 -- **/$~filename_pattern(ND)

If filelist.csv contained:

???
foo bar
baz

That would eventually expand a recursive glob such as:

**/*(\?\?\?|foo bar|baz)*txt(ND)

As to the problems with your:

for filename in `cat filelist.csv`; do
out=`find /location*/time*/ -name *${filename}*txt`
echo ${out}
done
  • for var in `cmd` doesn't loop over the lines of the output of cmd, `cmd` gets the output of cmd, removes the trailing newline characters, then performs split+glob on it (split only in zsh), so splits on characters of $IFS (by default space, tab and newline) and then expands globs in the resulting words. So if cmd output a* b*, it would not loop once with var being a* b*, it would loop over filenames in the current directory starting with a and then those starting with b.
  • in -name *${filename}*.txt, as those * and ${filename} are not quoted, again we have shell globbing going on. So if ${filename} is abc, the *abc*txt will be expanded by the shell to the list of matching files. If there's a file called xabcytxt in the current directory, that becomes -name xabcytxt. Also, even if you corrected it to -name "*${filename}*txt", if $filename was *, that would become -name '***txt' and find would return all filenames ending in txt, not just those also containing a *.
  • in echo ${out}, again $out being unquoted means split+glob (except in zsh). Also echo should be avoided to output arbitrary data, as it doesn't do what you want in a variety of corner cases that varies from implementation to implementation.

See also: