Why does reading a file with hexdump sometimes change find's sparseness value (%S)?

Question

I use a Lustre file system. I've noticed that if I look at the find's sparseness value (%S) for a file, then print the file with hexdump, then look at the find's sparseness value again, then sometimes find's sparseness value (%S) has changed. Why does it change?

Command to look at the find's sparseness value (%S) for the file myvideo.mp4:

find myvideo.mp4 -printf "%S"

Command to read the file myvideo.mp4 with hexdump:

hexdump myvideo.mp4

I noticed that behavior on several files. Examples of changes of find's sparseness values (%S):

0.000135559 to 0.631297
0.00466808 to 0.228736

Is it because the file is being cached partly locally when reading with hexdump? I noticed that this change isn't specific to hexdump, e.g. the same happens with nano (and likely any other program that read the file):

dernoncourt@server:/videos$ find myvideo.mp4 -printf "%S"
0.00302331
dernoncourt@server:/videos$ nano myvideo.mp4 
dernoncourt@server:/videos$ find myvideo.mp4 -printf "%S"
0.486752

From the manual about %S: *This is calculated as '(BLOCKSIZE*st_blocks / st_size)'*. (diskusage/apparentsize) So it's going to be the same reasons as in your other question. — Stéphane Chazelas, Mar 10 '24 at 08:33
@StéphaneChazelas thanks, I also voted to close the question but in hindsight a short answer repeating the first sentence of your comment+ a link to the other questions explaining the apparent size variations would be useful I think. — Franck Dernoncourt, Mar 10 '24 at 15:27

Franck Dernoncourt · Answer 1 · 2024-03-10T22:33:49.627

(All the information in this answer comes from Stéphane Chazelas's comment. Thanks!)

From find's man page (formatting is mine):

%S: File's sparseness. This is calculated as (BLOCKSIZE*st_blocks / st_size). The exact value you will get for an ordinary file of a certain length is system-dependent. However, normally sparse files will have values less than 1.0, and files which use indirect blocks may have a value which is greater than 1.0. The value used for BLOCKSIZE is system-dependent, but is usually 512 bytes. If the file size is zero, the value printed is undefined. On systems which lack support for st_blocks, a file's sparseness is assumed to be 1.0.

Note that:

BLOCKSIZE*st_blocks corresponds to the disk usage,
st_size corresponds to the apparent size.

Therefore, %S = BLOCKSIZE*st_blocks / st_size = diskusage/apparentsize.

Furthermore, this answer addresses why the apparent size of a file is sometimes much larger than the actual disk usage on hierarchical storage systems such as Lustre even though it may not be sparse at all. It also mentions that it’s unusual for a large file to be entirely restored to frontend storage.

Since files are sometimes being partly restored to frontend storage when read with some programs such as hexdump or nano, it increases the st_blocks value, which decreases %S (=BLOCKSIZE*st_blocks / st_size), which means the file's sparseness reported by find (%S) decreases.

Since find myvideo.mp4 -printf "%S" is useless to know if a file is sparse on a Lustre file system, one can use the following solution by frostschutz to know if a file is sparse:

file = "myvideo.mp4"
if [ "$((`stat -c '%b*%B-%s' -- "$file"`))" -lt 0 ]
then
    echo "$file" is sparse
else
    echo "$file" is not sparse
fi

Note that this solution has a few limitations, as noted in the comment section.

Also, note that Lustre does not directly support the SEEK_HOLE/SEEK_DATA interface currently, so we can't use solutions using that to know if a file sparse on Lustre.

Why does reading a file with hexdump sometimes change find's sparseness value (%S)?

1 Answers1

Linked