How to run a command only if a specific file has a certain size

Question

How can I execute a command only if a certain file exceeds a defined size? Both should at the end run as a oneliner in crontab.

Pseudocode:

* * * * * find /cache/myfile.csv -size +5G && echo "file is > 5GB"

If your goal is to trigger this when the file exceeds that size, but the file is only infrequently written to, you may want to use incron to trigger the check instead of running it every minute. — Austin Hemmelgarn, May 17 '23 at 20:34
I don't look closely at the answers. Just FYI, beware of sparse files reported wrongly. These are not easy to handle. — akostadinov, May 18 '23 at 00:02

cas · Answer 1 · 2023-05-18T00:12:35.007

11

If you have GNU stat, you can use its --printf option to get its size.

e.g.

size=$(stat --printf '%s' /cache/myfile.csv)
if [ "$size" -gt 5368709120 ] ; then  # 5 GiB = 5 * 1024 * 1024 * 1024
  echo "file is > 5GB"
fi

See man stat for details.

BSD's stat (e.g. on FreeBSD and on Mac) has a similar formatting option, -f:

size=$(stat -f '%z' /cache/myfile.csv)

Alternatively, you could use perl's built-in stat function, or its -s file test operator (which is similar to bash's -s file test but it returns the file's size rather than just true if it exists and is non-empty). perl's stat function returns a 13-element list (array) of metadata about a file containing the following data (copied from perldoc -f stat):

[...] Not all fields are supported on all filesystem types. Here are
the meanings of the fields:
0 dev      device number of filesystem
  1 ino      inode number
  2 mode     file mode  (type and permissions)
  3 nlink    number of (hard) links to the file 
  4 uid      numeric user ID of file's owner
  5 gid      numeric group ID of file's owner
  6 rdev     the device identifier (special files only) 
  7 size     total size of file, in bytes
  8 atime    last access time in seconds since the epoch
  9 mtime    last modify time in seconds since the epoch
 10 ctime    inode change time in seconds since the epoch (*)
 11 blksize  preferred I/O size in bytes for interacting with the
             file (may vary from file to file)
 12 blocks   actual number of system-specific blocks allocated
             on disk (often, but not always, 512 bytes each)
(The epoch was at 00:00 January 1, 1970 GMT.)

Field 7 is the one we need.

To return the file's size (for later use in a shell command or script) using stat:

# stat
perl -e 'print scalar((stat(shift))[7])' /cache/myfile.csv
-s
perl -e 'print -s shift' /cache/myfile.csv

Or to do it all in perl:

# stat
perl -e 'print "File is > 5 GiB\n" if (stat(shift))[7] > 5*1024*1024*1024' /cache/myfile.csv
-s
perl -e 'print "File is > 5 GiB\n" if -s shift > 510241024*1024' /cache/myfile.csv

See perldoc -f stat and perldoc -f -X (as well as help test in bash).

BTW, perl's shift function removes the first element of an array (by default @ARGV, the array of command line args, if not specified) and returns its value. It's often used in a loop to process all elements of an array, but here we're only interested in the first arg (the filename). See perldoc -f shift for details, including notes on lexical scope and use in a subroutine.

edited May 18 '23 at 00:12

answered May 16 '23 at 13:27

cas

78,579

The OP question is if find-command was successful – Gilles Quénot May 16 '23 at 13:40
13

yes, but the OP is asking the wrong question, using the wrong tool for the job. If you want to measure your door's height or width, you use a tape-measure or a ruler, not a bucket or a hammer. Similarly, if you want to know the size of a file, you use stat, not find (and not ls either). Part of our job when answering a question is to tell people when they're using the wrong tool or asking the wrong question, to find the underlying task hidden beneath the XY Problem. – cas May 16 '23 at 13:45
@cas If you want to test the size of a file portably, then find is the correct tool (albeit not with "+5G" as the argument to -size). – Kusalananda May 17 '23 at 13:08
2

Portability was never part of the question. It's tagged linux, and linux means GNU tools on everything but tiny distros with only busybox available (and even busybox stat has a -c formatting option with %s meaning size in bytes just like GNU stat). More to the point, find is the wrong tool for getting metadata about a file such as the file's size. That's stat's job, it's what it's for. If stat didn't have formatting options, the next best option is not find, it's perl with its built-in stat() function because that's a trivial one-liner compared to a dozen or so lines in C. – cas May 17 '23 at 15:48
e.g. perl -e '$size = (stat(shift))[7]; print $size' /cache/myfile.csv – cas May 17 '23 at 15:49
or, even shorter, perl -e 'print scalar((stat(shift))[7])' /cache/myfile.csv – cas May 17 '23 at 15:57
1

OP asked for a one-liner for use in crontab. How can this be used in the way OP asked? – marcelm May 17 '23 at 20:19
they already know how to use crontab, they didn't know how to test the file size. – cas May 17 '23 at 22:43
the perl version can be even shorter - perl's -s file test returns the size of the file (bash's -s test only returns true if the file exist and is not empty, false otherwise), so extracting the size from the list returned by stat() isn't necessary. e.g. perl -e 'print -s shift' filename to output the size for use in shell, or do it all in perl with print "File is > 5GB\n" if -s shift > 5*1024*1024*1024' filename. See perldoc -f -X for docs on perl's file tests (and help test in bash for bash's file tests). – cas May 17 '23 at 23:03

score 8 · Accepted Answer · edited May 17 '23 at 12:52

To use the file size as a precondition you can use stat or find:

[ -n "$(find /cache/myfile.csv -prune -size +5G 2>/dev/null)" ] && echo "file is > 5GB"

Or if the target command (echo, here) is short, put it into the exec part of `find

find /cache/myfile.csv -prune -size +5G -exec echo "file is > 5GB" \;

The -prune is in case myfile.csv might be a file of type directory, to prevent find from descending into it.

Gilles Quénot · Answer 3 · 2023-05-16T13:34:36.177

4

If you need to treat files in a shell, both version only execute shell's command only if all conditions are met: is a file, is named myfile.csv and is > 5G:

find /cache -name 'myfile.csv' -type f -size +5G -exec bash -c '
    echo "$1 is > 5GB"
' bash {} \;

or

find /cache -name 'myfile.csv' -type f -size +5G -exec bash -c '
    for file; do echo "$file is > 5GB"; done
' bash {} +

edited May 16 '23 at 13:34

answered May 16 '23 at 13:07

Gilles Quénot

33,867

I don't want to iterate the files. I just want to use this as a precondition before starting another process. I could as well write find .... +5G && start.sh. So, only start the 2nd command if the find command found the file which was above a certain size. – membersound May 16 '23 at 13:15
1

So use the first version, and replace echo by start.sh – Gilles Quénot May 16 '23 at 13:16
if you don't want to iterate over files, then don't use find. You could use stat instead. – cas May 16 '23 at 13:23
2

@membersound If /cache/myfile isn't a directory, neither command in the answer will do much iterating. Using find is about the only portable way of conditionally executing a command based on the size of a file. – Kusalananda May 16 '23 at 13:25
@Kusalananda, for readable files, wc -c can get the size of a file portably (though not always as efficiently in the wc implementations that don't do optimisations when the size of the file can be obtained other than by reading it). – Stéphane Chazelas May 17 '23 at 16:43

Stéphane Chazelas · Answer 4 · 2023-05-18T05:41:42.807

Note that some shells have the feature built-in.

SHELL=/bin/tcsh
* * * * * if (-Z /cache/myfile.csv > 5*1024*1024*1024) echo 'file is > 5GiB'

Or with zsh, here using glob qualifiers and an anonymous functions, though zsh also has a stat builtin that predates both GNU and BSD stat:

SHELL=/bin/zsh
* * * * * (){ if (($#)) echo 'file is > 5GiB'; } /cache/myfile.csv(NLG+5)

(note that like for find -size +5G, we're talking of gibibytes (1GiB = 1,073,741,824 bytes) here, not gigabytes (1GB = 1,000,000,000 bytes))

For symlinks, tcsh will get the size of the file it eventually resolved to while zsh's LG+5 qualifier like find's -size will check the size of symlink itself. Change to -LG+5 to check the size after symlink resolution. zsh's stat builtin gives you information after symlink resolution by default, -L to change that. In GNU and BSD stat, that's reversed. Same with find where -L tells it to follow symlinks.

For more ways to get the size of a file, see How can I get the size of a file in a bash script?

How to run a command only if a specific file has a certain size

4 Answers4

-s

-s