The organization of files in a directory structure can markedly impact the efficiency of file access (ref). For the sake of example, consider two directories, A
and B
, each containing 10^6 files, organized in one directory in the former case and in 10^3 subdirectories in the latter. Counting or listing all files is reproducibly slower in the former case. On my system:
Create files:
$ mkdir A; pushd A; seq -w 1E6 | xargs touch; popd
$ mkdir B; pushd B; seq -w 1E3 | xargs mkdir; for d in *; do pushd "$d"; seq -w 1E3 | xargs touch; popd; done; popd
List files:
$ for d in A B; do
time for i in {1..100}; do
{
echo 3 | sudo tee /proc/sys/vm/drop_caches
find "$d" -type f
}
done >/dev/null
done
# case A (0 subdirectories)
real 4m49.805s
user 1m43.696s
sys 1m13.540s
# case B (1000 subdirectories)
real 3m32.338s
user 0m40.824s
sys 1m13.176s
The difference is reproducible, independent of disk caching, and not specific to the find
command (i.e., a difference of the same magnitude is found using ls -RU
). The amount of time in kernel space is the same in both cases, exonerating the kernel (and likely filesystem mechanics as well). Though I haven't done any profiling, the main syscalls being made are almost certainly readdir()
and getdents()
and since the number of inodes is the same in both cases (to within 0.1%), as is the size of the files, the amount of time required for the execution of these calls by the kernel would not be expected to differ. The difference in execution time is therefore attributable to user space code.
Thread support has been added to some GNU coreutils (e.g., sort
). As my system has four hardware threads and I am not sure whether GNU find
(my system's version is 4.7.0-git) has been imbued with multithreaded capabilities, I verified persistence of the discrepancy with processes that were explicitly bound to a single hardware thread:
$ cat find.sh
#!/bin/bash
for i in {1..100}; do
{
echo 3 | sudo tee /proc/sys/vm/drop_caches
find "$1" -type f
}
done >/dev/null
$ for d in A B; do time taskset -c 0 ./find.sh "$d"; done
# case A (0 subdirectories)
real 4m7.827s
user 1m31.080s
sys 0m55.660s
# case B (1000 subdirectories)
real 2m53.059s
user 0m33.256s
sys 0m52.772s
Thus, my original question can be refined as follows: what are the user space inefficiencies that underlie the disparity in access times stemming purely from differences in filesystem organization? Knowledge of these inefficiencies should permit better implementations of file access routines.
Edit:
I am using an ext4
filesystem on a machine running linux kernel 4.9.0-4-amd64
, though I would be interested to know to what extent the answer depends upon the filesystem chosen.
4.9.0-4-amd64
withext4
as my filesystem. I also added these details to the question. – user001 Sep 06 '18 at 21:54