4

I have family pictures and movies in a folder /media/data/Selbstgemacht and I'd like to find the size of all pictures. In /media/data I use find Selbstgemacht -type f -iname '*.jpg' -exec du -ch '{}' + which returns 5,1GB.
However, if I step down into the folder "Selbstgemacht" and use find . -type f -iname '*.jpg' -exec du -ch '{}' + it returns 7,0GB.

I then compared the output of find to check if they find the same files:
From parent folder find Selbstgemacht -type f -iname '*.jpg' -printf '%P\n' |sort > test1.txt
From subfolder find . -type f -iname '*.jpg' -printf '%P\n' |sort > ../test2.txt

The files are identical, so both find commands find exactly the same files, which leads me to thinking that the difference in the size du reports must be due to something else.

What exactly is the cause here?

System information:

  • Debian stable
  • find (GNU findutils) 4.4.2
    • D_TYPE O_NOFOLLOW(enabled)
    • LEAF_OPTIMISATION, FTS(), CBO(level=0)
  • du (GNU coreutils) 8.13
Jan
  • 7,772
  • 2
  • 35
  • 41

2 Answers2

6

find ... -exec cmd {} + will execute cmd as many times as necessary so as not to break the limit of the size of the arguments passed to a command.

When using find . -exec du {} +, the size of the file list is smaller than when using find verylongdirname -exec du {} +.

So it's likely the find verylongdirname will run more du commands than the find . one. The total you see in the end is the total for the last run on du, which does not include all the files (there will have been more totals earlier, you can pipe the command to grep 'total$' to confirm.

  • I think I understand what you mean and both invocations indeed have multiple totals. But summing up the totals, both are different (52,6GB vs 51,5GB). That said, the answer marked here http://unix.stackexchange.com/questions/41550/find-the-total-size-of-certain-files-within-a-directory-branch is not correct, isn't it? – Jan Aug 05 '14 at 09:23
  • 1
    The sum difference could be accounted to hard link discrepancies or errors/imprecision in calculation (try without -h). Yes, I've added a note to that other question. – Stéphane Chazelas Aug 05 '14 at 09:37
2

What you should see is that in both cases you probably do not get the disk space usage of your pictures. If you have thousands of pictures, it probably on both cases exceeds the limit for the exec call.

Why? Well the -exec (...) + command adds parameters to the execvp system call. The man page defines the limit of its underlying system call as follow (extract from the execve man page):

Limits on size of arguments and environment
   Most UNIX implementations impose some limit on the total  size  of  the
   command-line argument (argv) and environment (envp) strings that may be
   passed to a new program. (...)

   On  kernel  2.6.23  and  later, most architectures support a size limit
   derived from the soft RLIMIT_STACK resource  limit  (see  getrlimit(2))
   that is in force at the time of the execve() call.  (...)   This change
   allows programs to have a much larger argument and/or environment list.
   For these  architectures,  the  total  size  is  limited  to 1/4 of the
   allowed stack size. (...) Since Linux 2.6.25, the kernel places a floor
   of 32 pages on this size limit, so that, even when RLIMIT_STACK is  set
   very low, applications are guaranteed to have at least as much argument
   and environment space as was provided by Linux 2.6.23 and earlier (This
   guarantee  was not provided in Linux 2.6.23 and 2.6.24.)  Additionally,
   the limit per string is 32 pages (the kernel constant  MAX_ARG_STRLEN),
   and the maximum number of strings is 0x7FFFFFFF.

So if you have a long list of files, you can quickly reach the system limits. In addition, when the relative path is longer, it is using more memory which can trigger that you reach limits faster, hence the different results of your 2 commands.

There is a solution

A solution on GNU systems is to use an input list of files to du using the --files0-from options. With your example:

find Selbstgemacht -type f -iname '*.jpg' -print0 | du --files0-from=- -ch

The first command lists all the files and outputs them on the standard output separated by NUL (\0). This list is then "ingested" by du from the standard input (the - file name) and du sum up the total.

Huygens
  • 9,345
  • 1
    find doesn't invoke a shell to run the command, so there's no command line or space characters involved here. find uses execvp() (wrapper around the execve() system call) directly, and that's were the limitation is (on the cumulative size of the argv and envp passed to that system call). Shells (except csh) don't have a limit of their own on their command line. – Stéphane Chazelas Aug 06 '14 at 11:59
  • Yes, that sounds correct regarding execvp. I will update my answer. Thank you. – Huygens Aug 06 '14 at 13:04