0

If you have a list of hundreds of thousands of files across an enormous filesystem, what's the fastest way in bash to get the last modified time of all of them?

Assuming there's no way to sort them that improves the speed, the core of this question is really: what is the fastest way to get a last modified time on a file in bash? stat seems to be the most common method for this, but there's also find (with -printf "%T+") and date -r. Are there others?

(does it depend on the filesystem?)

  • You mean something other than xargs stat --printf='%Y %n\n' <list? – Kusalananda Apr 02 '22 at 10:47
  • 1
    There is no sorted index on last modified time so the only way to obtain them is to iterate across your list. The question as it stands doesn't really make a lot of sense: it would seem to read better as "how can I get the last modified time for a large list of files?" – Chris Davies Apr 02 '22 at 11:13
  • The iteration is fast, the slow part is getting the timestamp. I've adjusted the question title to focus on that part. And yes, I wanted to see if anything is faster than stat, or if all the ways to do it are basically the same. I tried three approaches in my answer below, but would love to know if there are more I could try. – Jun-Dai Bates-Kobashigawa Apr 02 '22 at 11:29

2 Answers2

5

If you have GNU find (the one that supports -printf).

find /filesystem/mount/point -xdev -printf '%T@\t%p\0' > timestamps

is going to be the fastest. find is highly optimised to traverse directory trees, and then it does that lstat() system calls itself to retrieve the timestamps. It will also call lstat() on paths relative to the directory where it finds them which means less work to do for the kernel than if lstat() was called on the full path.

With %T@ which prints the timestamp as decimal epoch time, all it has to do is convert the numbers (second and nanosecond) from binary to decimal which is a lot less effort than %T+ which needs to compute the calendar time in the user's timezone.

There are many different and incompatible implementations of a stat command, but none of them find files, they just do some stat()/lstat()/statx()/statfs() or equivalent to retrieve metadata information from the files whose paths are given as arguments, so you need something else to find the files and pass their full paths to stat.

Because on most systems, commands can only take a limited number of arguments, that means you'll likely need to call the stat utility several times, each in its own process, each having to be loaded, initialise, process its arguments, etc.

One exception is the stat builtin of zsh which does predate GNU or BSD stat (though not GNU find's -printf).

zsh can find the files with its recursive globs so can do the whole process without having to run another command, but is never going to be as efficient as find.

Note that date -r (also a GNU non-standard extension) does a stat() or equivalent, not lstat(). So for symlinks, it reports the timestamp of the target (or fails if the link can't be resolved), not that of the symlink. Among the various stat implementations, some use stat(), some use lstat() by default but all can be told to switch between the two.

To optimise it further, you could implement it in C, do your directory traversal by hand without some of the extra safeguards that find implements. On recent versions of Linux, using statx() which can be told to retrieve less information might help.

If you have locate/mlocate/plocate, using its cached list of file would save you having to crawl the file system and might help speed up the process (at the risk of giving you stale information).

Since version 4.9, GNU find can be passed the list of files to process from stdin with -files0-from -, so you can do:

LC_ALL=C locate -0 '/filesystem/mount/point/*' |
  find -files0-from - -prune -printf '%T@\t%p\0' > timestamps

That would be more efficient than using something like | xargs -r0 stat --printf '%.9Y\t%n\0' -- (here assuming GNU stat and that none of the input filepaths is -) which would still run several invocations of stat.

You can use that same approach if you have a list of file paths stored as NUL-delimited records in a file. If in another format, you'd need to convert it first. For instance, for a text file containing one path per line (which means you can't store file paths that contain newline characters), you'd do tr '\n' '\0' < list.txt | find....

In my test here, it's still less efficient than letting find find the files by itself, possibly because find ends up calling lstat() on full paths which means the kernel has to do the full look-up for every file.

Also note that it won't be able to cope with file paths longer than PATH_MAX (usually around 4KiB on Linux, see the output of getconf PATH_MAX /mount/point).

In any case, for performance, the last thing you want to do is run an external utility such as GNU date or GNU stat for each file, like in a shell loop. If for some reason, you needed to process files and their timestamp in a loop in a shell such as bash that doesn't have a stat builtin, you'd do something like:

while IFS=/ read -u3 -rd '' timestamp filepath; do
  something with "$timestamp" and "$filepath"
done 3< <(find /filesystem/mount/point -xdev -printf '%T@/%p\0')

We use / as the separator as that's the only character that is guaranteed not to occur at the end of a filepath. An exception to that would be for the directory that you pass to find. For instance, in the output of find / -xdev -printf '%T@/%p\0', the first record (and the first only) would end in /. It would contain <timestamp>//, and read would store the empty string instead of / in $filepath. You could work around that by using zsh instead of bash (where $IFS is truly considered as an internal field separator and not delimiter) or use ${filepath:-/} when referencing the filepath.

Note that the read itself if quite inefficient as it needs to read the input one byte at a time. See Why is using a shell loop to process text considered bad practice? for more details on that. It's likely you'd be better of using a proper programming language if performance is a concern.

Shells with builtin support for retrieving the modification time of a file (and avoid the prohibitive cost of running a separate utility for each file) that I know are tcsh, zsh, ksh93 and busybox sh.

tcsh is not really usable for scripting.

For ksh93, you need it to have been built with the date or ls builtins included which is rarely the case. And for busybox, while its sh applet can invoke its stat applet without reexecuting itself, it still does it in child process and forking a process is quite expensive. Busybox stat (with a similar API as GNU stat) also doesn't support subsecond precision¹. Also, neither busybox sh nor ksh93 can process NUL-delimited records.

With zsh with the list file containing the filepaths NUL-delimited:

zmodload zsh/stat || exit
for filepath (${(0)"$(<list)"})
  stat -LF %s.%9. -A timestamp +mtime -- $filepath &&
    something with $filepath and $timestamp

For a list that contains one (newline-free) filepath per line, replace (0) with (f).

With ksh93 with its builtin ls and list with one filepath per line:

builtin ls || exit
while IFS= read -ru3 filepath; do
  timestamp=${ ls -dZ '%(mtime:%s.%N)s' -- "$filepath"; } &&
    something with "$filepath" and "$timestamp"
done 3< list

You can also use builtin date; date -f %s.%N -m -- "$filepath" there but beware it does a stat() (as if passing -L to ls), not lstat().


¹ Its date applet can be configured at build time to support nanosecond precision though it's not enabled in its default build

0

The first one that comes to mind:

for file in `head -10000 files.txt`; do stat -c "%n %z $file; done

Took 1m3.546s for 10,000 files the first time I ran it. Subsequent runs took 0m33.597s 0m22.127s, 0m25.038s, 0m19.810s, and 0m25.246s

And just to make sure I'm not wasting much time on the head and for etc., replacing stat with echo finishes in about 0.270s.

It feels weird to use find on a single file, but surprisingly this runs a bit faster:

for file in `head -10000 files.txt`; do find $file -printf '%p %T+\n'; done;

Finishing in 0m29.357s, 0m20.185s, 0m30.540s, 0m31.000s, and 0m44.836s on various runs.

And then a third option:

for file in `head -10000 files.txt`; do echo "$file $(date -In -r $file)"; done;

Finishing times were 0m25.828s, 0m12.649s, 0m23.695s, 0m12.789s, 0m43.782s, 0m28.396s, 0m15.800s, 0m15.510s for various runs.

Obviously to get a more definitive result, I should do more runs, try different systems, but I'm getting the sense that probably the bulk of time spent on all three are the same operation and they would converge on fairly similar numbers, but with that said, date -r does seem consistently a bit faster based on this limited sample.

  • 1
    You should use $() instead of backticks. See https://unix.stackexchange.com/questions/126927/have-backticks-i-e-cmd-in-sh-shells-been-deprecated – Vilinkameni Apr 14 '22 at 16:48