Most accurate disk usage report?

Question

Since disk space is allocated in blocks, is it a more accurate representation of the actual space consumed by a directory to report it in blocks vs bytes?

If a file of size 1,025 bytes resides on a file system where space on the file system is doled out in units of 1,024 byte blocks, that file consumes two whole blocks. That seems more accurate than to say that this file consumes 1,025 bytes of space.

Edit: File system in question is ext4, no dedupe, no compression, fwiw.

This is my attempt:

def getDirUsage(filepath, block_size=1024):  # block_size as reported by os.statvfs()
    '''
    return the number of blocks consumed by a directory
    '''
    total_size = int(math.ceil(os.path.getsize(filepath)/block_size))  # debatable whether this should be included in the size
    allfiles = os.listdir(filepath)
    for f in allfiles:
        p = os.path.join(filepath,f)
        if os.path.isdir(p):
            total_size += getDirUsage(p,block_size)
        else:
            total_size += int(math.ceil(os.stat(p).st_size/block_size))
    return total_size

You should seriously take a look at this excellent answer ==> http://unix.stackexchange.com/q/120311/53092 — Kiwy, Apr 03 '14 at 08:21

Ouki · Answer 1 · 2014-04-03T07:57:00.707

Counting exactly what a file really occupy on the disk is not trivial, as a thorough answer is more complex than just rounded up the file size to the next disk block.

What about:

hard link?
exotic filesystem inner way of allocating disk space (half block allocation)?
compressed filesystem?
fancy feature such as "de-duplication"?

If you add all this up, the task is close to impossible without a proper inner knowledge of how things are really working (so definitely not a plain stat on file).

System tools

There are however, many system tools designed for that, or with "occupied blocks" options. Here are 2 of the most used:

du

From man du (from FreeBSD, as being more verbose):

DESCRIPTION
    The du utility displays the file system block usage for each file
    argument and for each directory in the file hierarchy rooted in each
    directory argument.  If no file is specified, the block usage of the
    hierarchy rooted in the current directory is displayed.

ls -s

From man ls:
```
-s, --size
      print the allocated size of each file, in blocks
```
(it seems ls "blocks" are in fact old-fashioned 1024 Bytes blocks, not the real disk blocks)

Example:

$  dumpe2fs /dev/mapper/vg_centurion-lv_root |head -20 |grep ^Block
dumpe2fs 1.41.12 (17-May-2010)
Block count:              13107200
Block size:               4096

So our root filesystem has 4k blocks.

$ ls -l
total 88
-rw-------. 1 root root  2510 Mar 20 18:00 anaconda-ks.cfg
-rw-r--r--. 1 root root 67834 Mar 20 17:59 install.log
-rw-r--r--. 1 root root 12006 Mar 20 17:57 install.log.syslog

With du:

$ du -h anaconda-ks.cfg
4.0K    anaconda-ks.cfg

And ls:

$ ls -ls
total 88
 4 -rw-------. 1 root root  2510 Mar 20 18:00 anaconda-ks.cfg
72 -rw-r--r--. 1 root root 67834 Mar 20 17:59 install.log
12 -rw-r--r--. 1 root root 12006 Mar 20 17:57 install.log.syslog

From many other posts on SO and real life experience, I've learned it's a bad idea to parse the output of ls, du, df et al. — John Schmitt, Apr 03 '14 at 08:57

Matt · Accepted Answer · 2014-04-03T08:24:47.647

Yes blocks are better as they are physical disk usage but you need to obtain them in a different manner

Use the os.stat field st_blocks * 512. https://docs.python.org/2/library/os.html#os.stat

os.stat(path)
Perform the equivalent of a stat() system call on the given path. (This function follows symlinks; to stat a symlink use lstat().)

On some Unix systems (such as Linux), the following attributes may also be available:

st_blocks - number of 512-byte blocks allocated for file
st_blksize - filesystem blocksize for efficient file system I/O

You will generally end up with a size as a multiple of st_blksize unless you're using one of them new fandangled file systems which can have variable block sizes or de dupe. You'd hope that de-dupe on said file systems could be accounted for in the block count by the file system implementation but thinking about this, de dupe is kind of like soft/hard links but to physical data. Maybe the FS could divide the blocks over all files, or only report blocks for a single file?? Probably not.

You need to account for soft/hard links as well. Currently you will be adding on the size of the target of soft links (try lstat for those instead). One simple way around duplicating hard links is to divide the block size by the number of hard links (st_nlink) so across a total drive you only count the inode once, otherwise you have to track inode numbers.

Unless you're on a learning exercise... as Ouki mentions, just use du as other people have thought about this stuff already.

score 0 · Answer 3 · answered Apr 03 '14 at 07:49

And don't forget sparse files:

$ dd if=/dev/null of=MEAN_FILE bs=1024k seek=1024k
0+0 Datensätze ein
0+0 Datensätze aus
0 Bytes (0 B) kopiert, 1,0298e-05 s, 0,0 kB/s
$ ls -lh MEAN_FILE 
-rw-r--r-- 1 yeti yeti 1,0T Apr  3 09:44 MEAN_FILE
$ du MEAN_FILE 
0       MEAN_FILE

Most accurate disk usage report?

3 Answers3