The first command you mention, find . -type f -exec wc -l {} +
,
really says "run wc -l
on as many files as possible, until all of
them have been processed". This can run wc
multiple times!
On the other hand, find . -type f -exec cat {} + | wc -l
can run
cat
several times, but will only run wc
once. (More in detail,
this is because in this case cat
is called by find
, which can and
does decide to run it however many times it wants, whereas the part
after the pipe character, wc -l
, is beyond the reach of find
, and
is therefore run by your shell, just once.)
You say that the first command "yields 394968", but it really does
not; on my system its output ends with:
(Many more lines elided...)
23 ./po/Makefile.win
64 ./po/README
1 ./VERSION-NICK
97 ./README
258450 total
Yet, by adding grep total
, one can see that wc
was really run twice:
$ find . -type f -exec wc -l {} + | grep total
1590407 total
258450 total
And, indeed, 1590407 plus 258450 is 1848857, which agrees with the second command.
An explanation of why wc
was run more than once
in the find -exec wc +
version of the command
is vaguely hinted at by the find man page:
-exec command {} +
This variant of the -exec
action runs the specified command on
the selected files, but the command line is built by appending
each selected file name at the end;
the total number of invocations of the command
will be much less than the number of
matched files. The command line is built in much the same way
that xargs
builds its command lines.
Note how this says "much less than ..." rather than "only once". The
documentation for xargs hints that its option --max-chars
is set
automatically if not set by the user:
--max-chars=max-chars
-s max-chars
Use at most max-chars
characters per command line, including the
command and initial-arguments and the terminating nulls at the
ends of the argument strings.
The largest allowed value is system-dependent,
and is calculated as the argument length limit
for exec, less the size of your environment, less 2048 bytes of
headroom. If this value is more than 128KiB, 128Kib is used as
the default value; otherwise, the default value is the maximum.
This limits how many filenames can be passed to a single call to wc
,
explaining why, for large numbers of files, several calls to wc
will
occur, each operating on a partition of the input.