0

While browsing folders via ncdu, I noticed that the apparent size of a file was sometimes much larger than the actual disk usage. Example via ncdu, then a to toggle between showing disk usage and showing apparent, then i to show more details:

enter image description here

I was told this may be due to some automatic process that only keeps a small portion of the data in a "fast" layer and and keeps the rest on slower place such as AWS S3. How can I check that?


As suggested by Chris Down, here is part of the output of hexdump run on that file:

enter image description here

It seems to indicate the file isn't sparse.

As suggested by Artem S. Tashkinov, the file system is Lustre (checked with sudo df -T).

1 Answers1

4

This is usual (but non-obvious) behaviour on hierarchical storage systems such as Lustre. ncdu and other sparseness-measuring tools generally rely on the value given in st_blocks in response to a stat call. On historical, non-copy-on-write file systems, the obvious behaviour is the one we’ve grown to expect: each file occupies on disk at least the exact amount of non-zero data it contains, so st_blocks indicates at least the amount of actual non-zero data stored in the file. In Lustre, st_blocks represents the storage used on the frontend system, not the overall amount of data stored in the file.

There is just one slight exception to this: “released” files (whose contents are entirely removed from frontend storage) indicate that they occupy 512 bytes, not 0. This was implemented as a workaround for an issue with tools such as tar which will skip reading files with 0 st_blocks entirely, resulting in data loss on Lustre-based file systems (archiving a released file with sparse file detection, which is common in backup scenarios, would end up storing no data at all). When a file indicates that it occupies 512 bytes, tools have to read it (or use fiemap ioctls etc.) to determine exactly what it contains; on Lustre such actions prompt the file’s data to be retrieved from wherever it is stored in the hierarchy.

With huge files, it’s unusual for the entire file to be restored to frontend storage, which is why you only end up with a partial “occupied block count” in some scenarios.

Stephen Kitt
  • 434,908