7

How does the order of arguments in a find call affect the speed of the results?

Compare for example (A)

find -name dir -type d

and (B)

find -type d -name dir

Or any other of combinations of arguments (e.g. using -or or -and). I would expect find to be smart in some way.

I tried to collect some statistics by executing both A and B with time and 5 repetitions. However with

 11.86, 7.23, 5.25, 5.87, 7.16

for A and for B:

9.73, 6.56, 8.69, 7.14, 6.35

this is not really conclusive, with mean values around 7.5s for both, and quite a high variance.

So, to repeat my question, does the order of arguments matter using find?

Bernhard
  • 12,272
  • There are definitely commands that will reduce the set of files and directories as find goes along parsing commands, but I'm not entirely sure that the order of -name .. and -type d would matter. Your testing would seem to indicate that it doesn't. – slm Nov 15 '13 at 15:20
  • @slm I am interested in more than this specific example. But I can also not explain the huge differences between the timings... – Bernhard Nov 15 '13 at 15:39
  • I would employ an strace to see what the underlying find is up to to diagnose further. – slm Nov 15 '13 at 15:46
  • @slm see my answer. That does not seem to help. – terdon Nov 15 '13 at 15:47
  • I would then move to something like fatrace, https://launchpad.net/fatrace. This will allow you to trace the file events that find is triggering which should lead you to a more accurate picture of what files are being "touched" when find is running. – slm Nov 15 '13 at 15:52
  • See fatrace in action on this Q&A: http://unix.stackexchange.com/questions/86875/determining-specific-file-responsible-for-high-i-o/87290#87290 – slm Nov 15 '13 at 15:57
  • I deleted my answer since Stephane explained that GNU find will reorder its arguments itself so both commands probably do the same thing under the hood. – terdon Nov 15 '13 at 16:28

1 Answers1

11

What is costly, is doing system calls on the files (for the system calls themselves and for the I/O).

Things like -type, -mtime require a lstat(2) system call on the file. -name, -path, -regex don't (though of course it will have done system calls to the directories they contain to read their content).

Usually, find does an lstat() anyway (because it needs to know whether a file is a directory or not to descend into it, unless that information is provided in the readdir()), but there are cases where it can do without it. For instance, if the number of links of a directory is less then 3, then in some filesystems, find knows it doesn't have subdirs, and some find implementations will optimise by not doing lstats in there.

-xtype will cause a stat(2), -printf ..., -ls may cause a stat(), lstat(), readlink(), -lname a lstat() and readlink().

That's why you may want to put the -name/-path/-regex... first. If they can rule out a file, they can avoid one or more syscalls.

Now, a -regex may be more expensive than a -name, but I'm not too sure you'd get much by swapping them.

Also note that some find implementations like GNU find do reorder the checks by default when possible. See:

info find 'Optimisation Options'

on a GNU system (there on gnu.org for the latest version of GNU findutils).

Typically, if you did your tests on a GNU system, both commands would do the same thing because find would have moved the -name forward anyway.

So, for the -type d -name ... vs -name ... -type d to make a difference, you need a find implementation that doesn't optimise by reordering those predicates and one that does some optimisation by not doing an lstat() on every file.

Where there will be a (huge) difference regardless of the implementation is in:

find . -name 'x*' -exec test -d {} \; -print

vs:

find . -exec test -d {} \; -name 'x*' -print

find can't reorder the -exec as doing so could introduce functional differences (find can't know whether the command that is executed is only for testing or does something else).

And of course -exec ... {} \; is several orders of magnitude more expensive than any other predicate since it means forking a process and execute a command in it (itself running many system calls) and wait for it and its exit code.

$ time find /usr/lib -exec test -d {} \; -name z\* -print > /dev/null
1.03s user 12.52s system 21% cpu 1:03.43 total
$ time find /usr/lib -name z\* -exec test -d {} \;  -print > /dev/null
0.09s user 0.14s system 62% cpu 0.367 total

(the first one calls test for every file in /usr/lib (56685), the second one only on those files whose name starts with z (147)).

Note that -exec test -d {} \; is not the same as -type d. It's the portable equivalent of the GNU specific -xtype d.

  • If I understand your first sentence, you're saying that each file/directory that is part of the "found set" needs to be dealt with by having a corresponding system call? The act of making this system call is one of the more costly items, b/c certain calls require, for example, lstat(2) information, right? – slm Nov 15 '13 at 16:01
  • @slm, I'm saying that when find has to decide whether a file is selected or not, it's better to put the checks that do not involve a syscall first. – Stéphane Chazelas Nov 15 '13 at 16:04
  • 1
    The -exec proof is pretty nice :) – Bernhard Nov 18 '13 at 10:35