Why is filesystem time always some msecs behind system time in Linux?

Question

In Linux, it seems that filesystem time is always some milliseconds behind system time, leading to inconsistencies if you want to check if a file has been modified before or after a given time in very narrow time ranges (milliseconds).

In any Linux system with a filesystem that supports nanosecond resolution (I tried with ext4 with 256-byte inodes and ZFS), if you try to do something like:

date +%H:%M:%S.%N; echo "hello" > test1; stat -c %y test1 | cut -d" " -f 2

the second output value (file modification time) is always some milliseconds behind the first one (system time), e.g.:

17:26:42.400823099
17:26:42.395348462

while it should be the other way around, since the file test1 is modified after calling the date command.

You can get the same result in python:

import os, time
def test():
    print(time.time())
    with open("test1", "w") as f:
        f.write("hello")
        print(os.stat("test1").st_mtime)
test()

1698255477.3125281
1698255477.3070245

Why is it so, and is there a way to avoid it, so that system time is consistent with filesystem time? The only workaround I found so far is to get filesystem "time" (whatever that means in practice) by creating a dummy temporary file and getting its modification time, like this:

def get_filesystem_time():
    """
    get the current filesystem time by creating a temporary file and getting
    its modification time.
    """
    with tempfile.NamedTemporaryFile() as f:
        return os.stat(f.name).st_mtime

but I wonder if there is a cleaner solution.

See also How to change modification date for files in dir to match name order? for some other implications of that. — Stéphane Chazelas, Oct 27 '23 at 13:50

Stephen Kitt · Accepted Answer · 2023-10-27T11:29:05.437

The time used for file timestamps is the time at the last timer tick, which is always slightly in the past. The current_time function in inode.c calls ktime_get_coarse_real_ts64:

/**
 * current_time - Return FS time
 * @inode: inode.
 *
 * Return the current time truncated to the time granularity supported by
 * the fs.
 *
 * Note that inode and inode->sb cannot be NULL.
 * Otherwise, the function warns and returns time without truncation.
 */
struct timespec64 current_time(struct inode *inode)
{
    struct timespec64 now;
ktime_get_coarse_real_ts64(&amp;now);

if (unlikely(!inode-&gt;i_sb)) {
    WARN(1, &quot;current_time() called with uninitialized super_block in the inode&quot;);
    return now;
}

return timestamp_truncate(now, inode);

}

and the latter is part of a family of functions documented as follows:

These are quicker than the non-coarse versions, but less accurate, corresponding to CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE in user space, along with the equivalent boottime/tai/raw timebase not available in user space.

The time returned here corresponds to the last timer tick, which may be as much as 10ms in the past (for CONFIG_HZ=100), same as reading the 'jiffies' variable. These [functions] are only useful when called in a fast path and one still expects better than second accuracy, but can't easily use 'jiffies', e.g. for inode timestamps. Skipping the hardware clock access saves around 100 CPU cycles on most modern machines with a reliable cycle counter, but up to several microseconds on older hardware with an external clocksource.

Note the specific mention of inode timestamps.

I’m not aware of any way of avoiding this entirely, short of modifying the kernel. You can reduce the impact by increasing CONFIG_HZ. There has been a recent proposal to improve this, which is still being worked on.

wow, would have thought that on "expensive" operations like storing metadata, the linux kernel would have actually used the CPU HRT, but yeah, makes sense, this has to work on all machines, not just modern x86/aarch64, and enjoying the interestingness of different code paths for different architectures in file systems is something that I can see people concerned with reliability might not think too highly off! — Marcus Müller, Oct 26 '23 at 11:00
Investigated this a bit more in my answer, which is really just to be seen as an add-on to yours. — Marcus Müller, Oct 26 '23 at 11:26
"The time returned here corresponds to the last timer tick [. . .] same as reading the 'jiffies' variable. These are only useful when...". The "these" in that sentence refers to the void ktime_get_coarse* family of functions. This is a bit clearer if one reads the documentation page in context. Commenting here because it was confusing me. — terdon, Oct 26 '23 at 21:14
Funny thing is that POSIX allows for a bit of uncertainty in the mtime, but requires that the value be at or after the time of the last write, for reasons hinted at in the question. They should have added one tick. — hobbs, Oct 27 '23 at 03:35
@stephen-kitt but this means that in practice any filesystem with ns resolution does not really have ns resolution but a lower one, based on CONFIG_HZ (eg. 10ms if CONFIG_HZ=100). If this is the case, this should be made clearer in the documentation, IMHO — Alberto Pianon, Oct 30 '23 at 14:33

Marcus Müller · Answer 2 · 2023-10-26T12:10:22.420

Stephen Kitt's answer seems to be spot-on.

We can reproduce this very nicely by actually getting the same "coarse" clock that the filesystem uses, at least on my kernel configuration; a C program that always gets the coarse realtime clock before accessing the file takes either exactly the timestamp of the file, or (rarely) a timestamp one system tick earlier:

// excerpt from the program linked above, not a relicensing
// …
    clock_gettime(CLOCK_REALTIME_COARSE, &now_coarse);
    clock_gettime(CLOCK_REALTIME, &now);
    int fd = open("temp", O_WRONLY | O_CREAT);
    write(fd, data, length);
    close(fd);
    clock_gettime(CLOCK_REALTIME, &now_after);
stat(&quot;temp&quot;, &amp;props);

printf(&quot;Differences relative to coarse clock before:\n&quot;
       &quot;Fine Realtime before:   %+8jd ns\n&quot;
       &quot;File Modification Time: %+8jd ns\n&quot;
       &quot;Realtime clock after:   %+8jd ns\n&quot;,
       ns_difference(&amp;now_coarse, &amp;now),
       ns_difference(&amp;now_coarse, &amp;props.st_mtim),
       ns_difference(&amp;now_coarse, &amp;now_after));

yields something like

Differences relative to coarse clock before:
Fine Realtime before:   +1551810 ns
File Modification Time:       +0 ns
Realtime clock after:   +1626199 ns

with the aforementioned rare occurrence of a circa tick-duration delta, which happens when the system tick falls just between the getting of the now_coarse and the modification of the file:

Differences relative to coarse clock before:
Fine Realtime before:   +1497562 ns
File Modification Time:  +999992 ns
Realtime clock after:   +1609943 ns

By the way, statistics show that if we do the above until we collected 10,000 occurrences where this "tick jump" happened, that the range of possible delays is quite small:

Total processed:     777618
Observed tick progressions:      10000
Percentage:                 0.013
Minimum tick delta:     999992
Maximum tick delta:     999993
Average tick delta: 999992.905900

In other words, we have extremely close timing when it comes to the deltas between ticks.

@ilkkachu why, thank you! (Originally, wanted to do this in C++, then decided against it, because I'd just end up calling C functions, then regretted C as soon as there was data to be handled. The usual.). — Marcus Müller, Oct 26 '23 at 13:12
Now I wonder how badly I hammered my XFS filesystem by running the above "open (create)", "close" "unlink" cycle a couple million times… — Marcus Müller, Oct 26 '23 at 13:34
@MarcusMüller: A negligible amount, I'm sure. Running that program in a shell loop on an XFS filesystem on an otherwise-idle disk, basically zero bytes written (like 68 bytes / sec on average in the second minute, probably just one write to update the timestamp on the directory I was in.) XFS does lazy allocation of file data, and apparently lazy-enough updates of directory metadata and inodes. (My mount options were rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota. If I'd had non-lazy atime, then there'd have been writes, although still probably only on writeback intervals.) — Peter Cordes, Oct 27 '23 at 01:14

Why is filesystem time always some msecs behind system time in Linux?

2 Answers2

Linked