find: How to efficiently search for big list of filenames

Question

I need to find a few hundred files, where the base names are provided by some list (let's call it baseNames). I then need to search for these base names + three given extensions.

Example: Assume one of the base names extracted from the input list is FOO and the given extensions are .txt, .csv, .py. Then I need to find FOO.txt, FOO.csv, FOO.py.

The current approach in my bash script is as follows:

for bn in ${baseNames}; do
  find ${searchDir} '(' -name "$bn.txt" -o -name "$bn.csv" -o -name "$bn.py" ')'
done

This works but is quite inefficient. For every base name I need to run find on the whole searchDir again, which contains quite a lot of files and hence takes a while.

Is there a way to provide a list of files that find should search for, via options or pipes?

Obviously I am aware of -name ... -or, but it's also obvious that this approach is unrealistic if I have a couple of hundred files. For simplicity, you may also leave aside the extensions. Just assume I have a huge list of files that I want to search for as input to find.

if you have a list, use an array not an unquoted string. – cas Aug 12 '21 at 08:06 — cas, Aug 12 '21 at 08:06
@cas: Don't get it, sorry. Can you elaborate? – andreee Aug 12 '21 at 08:13 — andreee, Aug 12 '21 at 08:13

cas · Accepted Answer · 2021-08-12T10:30:55.463

Use an array. e.g.

#!/bin/bash
baseNames=(FOO BAR BAZ)
findNames=('(')
for bn in "${baseNames[@]}"; do
  for ext in txt csv py; do
    findNames+=("$bn.$ext" '-o' '-name')
  done
done
replace the final '-o' and '-name' in the array with a close parenthesis
unset 'findNames[-1]'
findNames[-1]=')'
If using a version of bash before v4.3, use:
#unset 'findNames[${#findNames[@]}-1]'
#findNames[${#findNames[@]}-1]=')'
declare -p findNames

Output from the declare -p is (some line feeds and spaces added to break it up and make it more readable):

declare -a findNames=(
  [0]="("
    [1]="-name" [2]="FOO.txt" [3]="-o" [4]="-name" [5]="FOO.csv"
    [6]="-o" [7]="-name" [8]="FOO.py" [9]="-o" [10]="-name" [11]="BAR.txt"
    [12]="-o" [13]="-name" [14]="BAR.csv" [15]="-o" [16]="-name" [17]="BAR.py"
    [18]="-o" [19]="-name" [20]="BAZ.txt" [21]="-o" [22]="-name" [23]="BAZ.csv"
    [24]="-o" [25]="-name" [26]="BAZ.py"
  [27]=")"
)

To use the array with find, you'd do something like:

searchDir="./"
find "$searchDir" "${findNames[@]}"

This would result in the following find command being executed (line breaks added for readability):

find ./ ( -name FOO.txt -o -name FOO.csv -o -name FOO.py \
  -o -name BAR.txt -o -name BAR.csv -o -name BAR.py \
  -o -name BAZ.txt -o -name BAZ.csv -o -name BAZ.py )

The ( and ) don't need to be escaped here because the shell treats them as just literal arguments (the array has already been expanded by bash), not an instruction to start a sub-shell. If you were typing them into a shell, you'd have to escape or quote them.

Alternatively, use grep. e.g. find "$searchDir" -type f | grep -E '/(FOO|BAR|BAZ)\.(txt|csv|py)$'. or GNU find's -regex or -iregex options: find "$searchDir" -type f -regextype awk -regex '/(FOO|BAR|BAZ)\.(txt|csv|py)$' — cas, Aug 12 '21 at 08:26
just noticed a typo in my comment. The -regex pattern should start with '.*/, not just '/. — cas, Aug 12 '21 at 09:18
I had to used the grep solution, because the system I have to work on uses bash 4.1.2, which doesn't recognize findNames[-1] unfortunately. But I tried your solution on a newer bash version, where it works as expected. Looks like I need to brush up my knowledge about arrays in bash, though. — andreee, Aug 12 '21 at 09:52
For compatibility with bash before v4.3, you can use unset 'findNames[${#findNames[@]}-1]' followed by findNames[${#findNames[@]}-1]=')'. This still works in newer versions of bash, the much shorter arrayname[-1] from 4.3+ is convenient, but not essential. See Remove the last element from an array for more details. — cas, Aug 12 '21 at 10:18

Kusalananda · Answer 2 · 2021-08-12T09:53:54.867

The following sh script reads base names from the file names, which contains one name per line (names should be quoted if they contain spaces etc.), and calls an in-line sh -c script with batches of these (50 at a time). I'm dividing the input into batches, just in case the data expands to a list that is too long for a single invocation of find (we need to construct commands with a total combined length of more than n time the length of the input data, where n is the number of filename suffixes to look for).

The in-line script constructs an "OR list" of -name tests based on the given base names for find. Each base name is entered into the list with variations for the three filenames suffixes .txt, .csv, and .py.

The list is maintained in the list of positional parameters, "$@".

Once the list is complete, find is called to find the regular files matching these names, in or under some directory $topdir.

topdir=$HOME
<names xargs -L 50 sh -c '
        topdir=$1; shift
    for name do
            for suffix in txt csv py; do
                    set -- &quot;$@&quot; -o -name &quot;$name.$suffix&quot;
            done
            shift  # shift off current base name
    done
    shift  # shift off the initial &quot;-o&quot;

    find &quot;$topdir&quot; -type f \( &quot;$@&quot; \) -print

' sh "$topdir"

Run with a lower number than 50 and with sh -x -c in place of sh -c to see what commands the in-line script is actually executing.

If you want to use named arrays and the bash shell:

topdir=$HOME
<names xargs -L 50 bash -c '
        topdir=$1; shift
        unset tests
    for name do
            for suffix in txt csv py; do
                    tests+=( -o -name &quot;$name.$suffix&quot; )
            done
    done

    find &quot;$topdir&quot; -type f \( &quot;${tests[@]:1}&quot; \) -print

' bash "$topdir"

Here, an array, tests, is used instead of the list of positional parameters. The strange-looking "${tests[@]:1}" expands to the list of elements of the array, except for the first element (which will be -o).

Although, if you're using bash, you may just as well use its globbing facilities (originally inherited from the ksh shell):

shopt -s extglob globstar dotglob nullglob
topdir=$HOME
printf -v pattern '%s/**/@(%s).@(txt|csv|py)' "$topdir" "$(paste -s -d '|' - <names)"
eval "pathnames=( $pattern )"
The following loop is only for illustration.
If you really just wanted to list the names, use
printf '%s\n' "${pathnames[@]}"
for pathname in "${pathnames[@]}"; do
        printf '%s\n' "$pathname"
done

This constructs an extended globbing pattern from the contents of the names file. This pattern may end up looking like

/home/myself/**/@(name1|name2|name3).@(txt|csv|py)

... which would match the names that you presumably are interested in. Note though that you would have to do any file type tests yourself in the loop (to sort out regular files from directories etc.)

The shell options set at the top of the script enables the use of the extend pattern @(...|...) (extglob), the use of ** to match down into subdirectories (globstar), allows us to names that are hidden or located in hidden subdirectories (dotglob). We also set nullglob ta make the pattern disappear if there are no matches at all.

This question is specifically for bash. Why use set -- when bash supports arrays? set -- is an ugly hack for ancient shells which don't have arrays, repurposing ARGV to be a pretend-array. — cas, Aug 12 '21 at 08:48
@cas Bash supports manipulating the list of positional parameters with set -- too, and my code works in bash with no need for modifications. You may change sh -c into bash -c if you wish. I will add a bash variation for you in a short while, but it's going to be a bit more verbose. — Kusalananda, Aug 12 '21 at 08:50
yes, this set -- hack works in bash. but you don't need to do it, because bash supports arrays. As does ksh. and zsh. and any other non-ancient bourne-like sh (yes, posix sh and dash and ash all qualify as "ancient". deliberately so). I already wrote my own answer based on bash arrays. — cas, Aug 12 '21 at 08:54
@cas I would use arrays if I needed to. There is no real need here, but I do realize some people find it more comfortable to deal with named arrays than with the slightly mysterious list of positional parameters. — Kusalananda, Aug 12 '21 at 09:00
Could you try to be just a little more condescending, please? I wasn't quite insulted enough. It's not "mysterious", this set -- trick is and always was an ugly hack. One of the main reasons why modern shells got arrays was so that we didn't have to use such ugly hacks any more. So we could programmatically build arg lists for commands without having to deal with all the associated white space and quoting hassles. Arrays work, without hackery. — cas, Aug 12 '21 at 09:04
@cas Sorry. I thought your main concern was with readability and how easy it was to understand the code, and I do think that the list of positional parameters is mysterious to many people, so they would want to use named arrays instead. I did most certainly not intend to insult in any way. In fact, I'm happy that you pointed it out as I'm usually not using arrays in my answers unless I have to. As for "ugly hack", using set -- is at least portable to any POSIX shell that I'm aware of, which is why I'm using it. — Kusalananda, Aug 12 '21 at 09:10
@cas As you may have noticed, I have added a bash variation of my code as a direct consequence of you pointing out the "ancient" nature of my first piece of code. Thank you again for bringing this to my attention, and do feel free to point this out to me again in the future as I'm bound to forget. — Kusalananda, Aug 12 '21 at 09:11
-L 50 is to take words from 50 lines, it still processes whitespace and quoting in those lines. To process 50 newline delimited words, you'd need GNU xargs and xargs -rd '\n' -n 50 or use dirty and unreliable hacks POSIXly. — Stéphane Chazelas, Aug 12 '21 at 09:28

Stéphane Chazelas · Answer 3 · 2021-09-17T21:58:33.130

With zsh (you are using zsh syntax in your code when you're leaving those parameter expansions unquoted):

names=(foo bar baz)
exts=(txt csv py)
print -rC1 - **/(${(~j[|])names}).${(~j[|])exts})(ND)

Where ${(j[|])array} joins the elements of the array with |, a | considered as a glob operator thanks to ~. N for nullglob, D for dotglob.

Or of course do directly:

print -rC1 - **/(foo|bar|baz).(cvs|py|txt)(ND)

If names and extensions are found one per line in some files, use

names=( ${(f)"$(< names.txt)"} )
 exts=( ${(f)"$(< exts.txt)"}  )

You could also do:

print -rC1 - **/$^names.$^exts(ND)

But that would be a lot less efficient as that would expand a recursive glob for each combination of name + ext.

To use find to do the searching:

cmd=(find . '(') or=()
for name ($^names.$^exts) cmd+=($or -name ${(b)name}) or=(-o)
cmd+=(')')
$cmd

find: How to efficiently search for big list of filenames

3 Answers3

replace the final '-o' and '-name' in the array with a close parenthesis

If using a version of bash before v4.3, use:

The following loop is only for illustration.

If you really just wanted to list the names, use

printf '%s\n' "${pathnames[@]}"

Linked