effect of file organization on efficiency of file access

Question

The organization of files in a directory structure can markedly impact the efficiency of file access (ref). For the sake of example, consider two directories, A and B, each containing 10^6 files, organized in one directory in the former case and in 10^3 subdirectories in the latter. Counting or listing all files is reproducibly slower in the former case. On my system:

Create files:

$ mkdir A; pushd A; seq -w 1E6 | xargs touch; popd
$ mkdir B; pushd B; seq -w 1E3 | xargs mkdir; for d in *; do pushd "$d"; seq -w 1E3 | xargs touch; popd; done; popd

List files:

$ for d in A B; do
    time for i in {1..100}; do
        {
            echo 3 | sudo tee /proc/sys/vm/drop_caches
            find "$d" -type f
        }
    done >/dev/null
done

# case A (0 subdirectories)
real    4m49.805s
user    1m43.696s
sys     1m13.540s

# case B (1000 subdirectories)
real    3m32.338s
user    0m40.824s
sys     1m13.176s

The difference is reproducible, independent of disk caching, and not specific to the find command (i.e., a difference of the same magnitude is found using ls -RU). The amount of time in kernel space is the same in both cases, exonerating the kernel (and likely filesystem mechanics as well). Though I haven't done any profiling, the main syscalls being made are almost certainly readdir() and getdents() and since the number of inodes is the same in both cases (to within 0.1%), as is the size of the files, the amount of time required for the execution of these calls by the kernel would not be expected to differ. The difference in execution time is therefore attributable to user space code.

Thread support has been added to some GNU coreutils (e.g., sort). As my system has four hardware threads and I am not sure whether GNU find (my system's version is 4.7.0-git) has been imbued with multithreaded capabilities, I verified persistence of the discrepancy with processes that were explicitly bound to a single hardware thread:

$ cat find.sh
#!/bin/bash
for i in {1..100}; do
  {
    echo 3 | sudo tee /proc/sys/vm/drop_caches
    find "$1" -type f
  }
done >/dev/null

$ for d in A B; do time taskset -c 0 ./find.sh "$d"; done

# case A (0 subdirectories)
real    4m7.827s
user    1m31.080s
sys     0m55.660s

# case B (1000 subdirectories)
real    2m53.059s
user    0m33.256s
sys     0m52.772s

Thus, my original question can be refined as follows: what are the user space inefficiencies that underlie the disparity in access times stemming purely from differences in filesystem organization? Knowledge of these inefficiencies should permit better implementations of file access routines.

Edit: I am using an ext4 filesystem on a machine running linux kernel 4.9.0-4-amd64, though I would be interested to know to what extent the answer depends upon the filesystem chosen.

Two of the close votes are for "primarily opinion-based". I have to object to that. I very firmly believe that the causes of that performance difference are purely based on fact and technical reasons. (Even though I have no idea whatsoever what the reasons might be.) I invite anyone who considers close-voting with that reason to tell the rest of us what, exactly, changes here based on one's opinion. — ilkkachu, Sep 05 '18 at 18:51
What kernel and filesystem did you use? — @ilkkachu I agree that it makes no sense to close this question. — Gilles 'SO- stop being evil', Sep 06 '18 at 21:44
@Gilles: I am running Linux kernel 4.9.0-4-amd64 with ext4 as my filesystem. I also added these details to the question. — user001, Sep 06 '18 at 21:54

score 1 · Answer 1 · answered Sep 07 '18 at 00:54

1

Without much evidence, I'm going to guess that find ends up invoking realloc and copying the directory data several times as it runs of space in an initial allocation in the single-directory case. But in the multi-directory case, the memory backing each sub-directory read does not require as many copies or reallocs. You might check the overall memory usage of find (somehow?) to verify.

answered Sep 07 '18 at 00:54

Alex Tullenhoff

291

That's quite plausible -- I'll have to undertake memory profiling (using, e.g., malloc_count). – user001 Sep 07 '18 at 05:50

effect of file organization on efficiency of file access

1 Answers1

Linked