How to cache or otherwise speed up `du` summaries?

Question

We have a large file system on which a full du (disk usage) summary takes over two minutes. I'd like to find a way to speed up a disk usage summary for arbitrary directories on that file system.

For small branches I've noticed that du results seem to be cached somehow, as repeat requests are much faster, but on large branches the speed up become negligible.

Is there a simple way of speeding up du, or more aggressively caching results for branches that haven't been modified since the previous search?

Or is there an alternative command that can deliver disk usage summaries faster?

Two minutes doesn't seem that long to me. But the real question is: "Do you really want du to cache anything?" Shouldn't du give you exact, as-current-as-possible, real disk block counts? — , Mar 02 '11 at 17:47
I agree that replacing du would be bad, but a faster wrapper script with an identical interface would be very useful for us. Further, I would expect that caching results dependent on last-modified time (and assuming no disk-wide operations, eg. defragmentation) would give exact size results: am I missing something? — Ian Mackinnon, Mar 02 '11 at 18:11
If you are concerned about too much disk usage you might consider implementing a quota. — pyasi, Mar 02 '11 at 19:20
Bruce - you could ask the same question about find. But then there's locate. — Yuval, Jul 19 '13 at 12:16
If you're on Android, take a look at StatFs for a super fast estimate of directory sizes. It was nearly 1000x faster for large, complex directories, compared to du. — Joshua Pinter, Oct 16 '19 at 17:10
I wonder how du does directory traversal: does it follow physical disk layout, similar to fastar? — Nemo, Dec 07 '21 at 10:05

score 31 · Answer 1 · edited Dec 10 '21 at 08:21

Common usage of du can be immensely sped up by using ncdu.

ncdu - NCurses Disk Usage

performs the du, caches the results and shows them in a nice command line GUI, somewhat comparable to du -hc -d 1 | sort -h. The initial indexing takes equally long as du, but looking for the actual "culprit" that fills up precious space is sped up, as all subdirectories have the initially cached du information available.

If needed, subdirectories can be refreshed by pressing R and files/folders can be deleted by pressing D, both of which update stats for all parent directories. Deletion asks for confirmation.

If necessary, further speedup can be achieved by precaching ncdu -1xo- / | gzip >export.gz in a cronjob and later accessing it with zcat export.gz | ncdu -f-, but obviously gives more outdated information.

ncdu -1xo- / | gzip >export.gz in a cronjob and later accessing it with zcat export.gz | ncdu -f-
that's so nice approach — temple, Mar 08 '23 at 19:38

BillThor · Accepted Answer · 2011-05-06T20:30:04.177

What you are seeing when you rerun a du command is the effect of disk buffering. Once you read a block its disk buffer is kept in the buffer cache until that block is needed. For du you need to read the directory and the inode for each file in the directory. The du results are not cached in this case, but can be derived with far less disk IO.

While it would be possible to force the system to cache this information, overall performance would suffer as the required buffer space would not be available for actively accessed files.

The directory itself has no idea how large a file is, so each file's inode needs to be accessed. To keep the cached value up to date every time a file changed size the cached value would need to be updated. As a file can be listed in 0 or more directories this would require each file's inode to know which directories it is listed in. This would greatly complicate the inode structure and reduce IO performance. Also as du allows you to get results assuming different block sizes, the data required in the cache would need to increment or decrement the cached value for each possible block size further slowing performance.

Has anyone tried to implement a filesystem with the feature you describe in the last paragraph? I'd love to read a more elaborate discussion by people who have attempted this, but can't find anything. I also wonder how much worse would IO performance actually be - catastrophically worse or within the same order of magnitude of a "normal" filesystem? — smheidrich, May 19 '21 at 14:33
@smheidrich I believe FAT file systems combine the inode data into the directory entry. However, these file systems don't have the file link capabilities that most modern file systems provide. Implementing the capability into an inode-based file system could lead to deadlocking of the file structures. I/O performance should normally be only slightly slower, but for multiply linked files could become extremely slow. The inode structure would also become more complex, as would the transactions to update file size. — BillThor, May 23 '21 at 01:32

score 23 · Answer 3 · answered Apr 15 '19 at 14:10

23

duc

(see https://duc.zevv.nl) might be what you're looking for.

Duc stores the disk usage in a optimized database, resulting in a fast user interface. No wait times once the index is complete.

Updating the index is very fast for me (less than 10 Sec. for around 950k files in 121k directories, 2.8 TB). Has a GUI and an ncurses UI as well.

Usage e.g.:

duc index /usr
duc ui /usr

From the website:

Duc is built to scale to huge filesystems: it will index and display hundreds of millions of files on petabytes of storage without problems.

answered Apr 15 '19 at 14:10

Peter

231

1

How do you update the index? The duc website doesn't cover this topic – dan carter Nov 04 '22 at 01:10
@dancarter It seems it's just index recreation: https://github.com/zevv/duc/issues/298. – Kirill Bulygin Apr 11 '23 at 13:54

score 10 · Answer 4 · edited Aug 28 '19 at 09:28

10

I prefer to use the agedu

Agedu is a piece of software which attempts to find old and irregularly used files on the presumption that these files are most likely not to be wanted. (e.g. Downloads which have only been viewed once.)

It does basically the same sort of disk scan as du, but it also records the last-access times of everything it scans. Then it builds an index that lets it efficiently generate reports giving a summary of the results for each subdirectory, and then it produces those reports on demand.

edited Aug 28 '19 at 09:28

Anthony Geoghegan

12,950

answered May 05 '11 at 10:42

SHW

14,786
14
66
101

5

Doesn't answer the question, but still +1. Nice tip. – 0xC0000022L May 06 '11 at 23:15
I've edited the question to make it clearer that this does actually answer the question (agedu indexes disk usage as well as access time). – Anthony Geoghegan Aug 28 '19 at 09:30

score 8 · Answer 5 · answered Mar 02 '11 at 19:31

If you can arrange for the different hierarchies of files to belong to different groups, you can set up disk quotas. Don't give an upper limit (or make it the size of the disk) unless you want one. You'll still be able to tell instantly how much of its (effectively infinite) quota the group is using.

This does require that your filesystem supports per-group quotas. Linux's Ext[234] and Solaris/*BSD/Linux's zfs do. It would be nice for your use case if group quotas took ACLs into account, but I don't think they do.

score 7 · Answer 6 · edited Nov 04 '19 at 16:20

As mentioned by SHW, agedu indeed created an index. I thought I'd share another way to create an index, after reading about locatedb. You can create your own version of a locatedb from du output:

du | awk '{print $2,$1}' | /usr/lib/locate/frcode > du.locatedb

awk rearranges the du output to have filenames first, so that frcode works right. Then use locate with this database to quickly report disk usage:

locate --database=du.locatedb pingus

You can expand this to suit your needs. I think it's a nice use of locatedb.

score 2 · Answer 7 · answered May 05 '11 at 13:46

2

I have a cronjob set up to run updatedb every 10 mins. Keeps all the filesystem buffers nice and fresh. Might as well use that cheap RAM for something good. Use slabtop see 'before' and 'after'.

answered May 05 '11 at 13:46

Marcin

1,597

I don't understand how your answer relates to the question. updatedb says nothing about disk usage. If you're doing it just to traverse the disk, you're going to hurt overall performance. – Gilles 'SO- stop being evil' May 05 '11 at 21:54
3

Counting up file sizes for du is slow because you have to access metadata of a potentially large number of files, scattered around the disk. If you run updatedb aggressively, the metadata for all the files is forced to be stored in RAM. The next time you run any other metadata-heavy operation, instead of doing thousands of seeks across the disks, you use the cache. Normally you have a small chance of having that particular portion of the tree's metadata cached. With my 'metadata cache priming' it's highly probable that the data you want is freshly cached. No physical seeks == FAST. – Marcin May 06 '11 at 10:33

score 2 · Answer 8 · edited Apr 01 '16 at 14:44

2

If you only need to know the size of the directory, you can speed it up a lot by simply avoiding writing the information to the screen. Since the grand total is the last line of the du command, you can simply pipe it to tail.

du -hc | tail -n 1

A 2GB directory structure takes over a second for the full listing but less than a 5th of that with this form.

edited Apr 01 '16 at 14:44

sam

22,765

answered Apr 01 '16 at 14:39

Frank

21

4

I think du -hs is more convenient for that purpose. – lepe May 28 '16 at 01:40
1

also --max-depth 1 – stevesliva Jun 12 '19 at 20:36

How to cache or otherwise speed up `du` summaries?

8 Answers8

Linked