3

I am looking for the most efficient method to determine the disk space usage by directory on a 5 petabyte disk.

The directory locations that I am trying to analyse is as follows:

/disk/user1/task1/ /disk/user1/task2/ /disk/user2/task3/ /disk/user100/task1/ etc

I need to find the size of each task and was wondering what is the most efficient command to use.

Currently I have tried ncdu -rx (this looks like it is going to take a couple of days).

Anybody know of a better method?

I am not the most adept at these commands so would appreciate it if the answers are spelt out.

James
  • 31
  • Unfortunately quotas are not in place and I do have the current disk usage so know how much I have left. The problem I am having is identifying the oversized directories to target for removal/cleaning. – James Dec 19 '17 at 17:33
  • Are the "tasks" directories under /disk/user/task/ or are they in /disk/user/task1, /disk/user/task2, etc? – Jeff Schaller Dec 19 '17 at 17:42
  • Yes, the directories I am trying to find the size of are as follows: /disk/user1/task1, /disk/user1/task2, /disk/user2/task1 etc etc... – James Dec 19 '17 at 17:44
  • That ↑ is different to what you said in your question. Please could you update it so that we can aim to solve the question you intended to ask. – Chris Davies Dec 19 '17 at 18:38

2 Answers2

1

In situations like this, the slow part is not the size of the files, but the number of files. ncdu, du, and their ilk require stat()ing every single file, so if there are a lot of them, you're going to have a bad time.

If file size is related to number of files (for example, if file size per file is constrained), you may have some luck counting them up and grouping by directory first to narrow down your list. In the basic case, this doesn't involve issuing stat() at all, mostly just readdir().

Unfortunately, common tools for this like GNU find and friends issue fstat() for each file no matter what, at least on my system. You could fairly easily write a small C program to get around this, just using opendir, readdir, and counting the number of objects returned.

If file size is not related to number of files, you're out of luck, though. Consider in future setting up the filesystem in a way that allows O(1) or similar accounting of disk size using smaller partitions, or using things like btrfs subvolumes (which also have O(1) accounting).

Chris Down
  • 125,559
  • 25
  • 270
  • 266
  • Thank you for the reply. You are correct that the number of files is also an issue. Unfortunately, as you stated, the file size is not related to the number of files. I am therefore in the situation where I need to view the directory sizes and the only method is a very long winded ncdu. – James Dec 20 '17 at 09:44
1

I would use a variation of How do I get the size of a directory on the command line? -

du -sm /disk/user*/task* | sort -n | tee /tmp/disk-usage.rpt

Which does three things:

  • gather the sum total (-s) disk usage for each of the task directories under all of the user directories, in megabytes (-m)
  • sort the output by the first column, numerically; this puts the largest task directories at the bottom; put them at the top by reversing the sort with sort -rn
  • sends a copy of that output to your screen and to a file at /tmp/disk-usage.rpt

The saved copy of the file keeps you from having to re-run the du command, unless you want to, to re-investigate the next-largest task directory.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • If ncdu is choking on this, I doubt this will be much faster. ncdu's traversal looks just as efficient as this, see src/dir_scan.c: https://g.blicky.net/ncdu.git/tree/src/dir_scan.c – Chris Down Dec 19 '17 at 17:56
  • Agreed that nothing may be fast, but if you need the size of everything, as you pointed out, you need to stat everything. I thought saving the results to a file might help a tiny bit. – Jeff Schaller Dec 19 '17 at 18:00