4

I am writing 4 * 4KB blocks to a file. It is consistently around 50% slower if I have used fallocate() to pre-allocate the file with 9 blocks, instead of only pre-allocating the 4 blocks. Why?

There seems to be a cut-off point between pre-allocating 8 and 9 blocks. I'm also wondering why the 1st and 2nd block writes are consistently slower.

This test is boiled down from some file copy code I'm playing with. Inspired by this question about dd, I am using O_DSYNC writes so that I can measure the real progress of the disk writes. (The full idea was to start copying a small block to measure minimum latency, then adaptively increase block size to improve throughput).

I am testing Fedora 28, on a laptop with a spinning hard disk drive. It was upgraded from an earlier Fedora, so the filesystem is not brand-new. I don't think I've been fiddling with the filesystem defaults.

  • Kernel: 4.17.19-200.fc28.x86_64
  • Filesystem: ext4, on LVM.
  • Mount options: rw,relatime,seclabel
  • Fields from tune2fs -l
    • Default mount options: user_xattr acl
    • Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
    • Filesystem flags: signed_directory_hash
    • Block size: 4096
    • Free blocks: 7866091

Timings from strace -s3 -T test-program.py:

openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000048>
write(3, "\0\0\0"..., 4096)             = 4096 <0.036378>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033380>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033359>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033399>
close(3)                                = 0 <0.000033>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000110>
fallocate(3, 0, 0, 16384)               = 0 <0.016467>
fsync(3)                                = 0 <0.000201>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033062>
write(3, "\0\0\0"..., 4096)             = 4096 <0.013806>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008324>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008346>
close(3)                                = 0 <0.000025>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000070>
fallocate(3, 0, 0, 32768)               = 0 <0.019096>
fsync(3)                                = 0 <0.000311>
write(3, "\0\0\0"..., 4096)             = 4096 <0.032882>
write(3, "\0\0\0"..., 4096)             = 4096 <0.010824>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008188>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008266>
close(3)                                = 0 <0.000012>
openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000050>
fallocate(3, 0, 0, 36864)               = 0 <0.022417>
fsync(3)                                = 0 <0.000260>
write(3, "\0\0\0"..., 4096)             = 4096 <0.032953>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033265>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033317>
write(3, "\0\0\0"..., 4096)             = 4096 <0.033237>
close(3)                                = 0 <0.000019>

test-program.py:

#! /usr/bin/python3
import os

# Required third party module,
# install with "pip3 install --user fallocate".
from fallocate import fallocate

block = b'\0' * 4096

for alloc in [0, 4, 8, 9]:
    # Open file for writing, with implicit fdatasync().
    fd = os.open("out.tmp", os.O_WRONLY | os.O_DSYNC |
                            os.O_CREAT | os.O_TRUNC)

    # Try to pre-allocate space
    if alloc:
        fallocate(fd, 0, alloc * 4096)

    os.write(fd, block)
    os.write(fd, block)
    os.write(fd, block)
    os.write(fd, block)

    os.close(fd)
sourcejedi
  • 50,249
  • Your results do not surprise me at all. For the fallocate, see this https://unix.stackexchange.com/questions/292843/way-to-instantly-fill-up-use-up-lots-of-disk-space/292849#292849 and https://unix.stackexchange.com/questions/362789/files-created-by-fallocate-compared-to-normal-text-files/362805#362805 – Rui F Ribeiro Sep 15 '18 at 18:10
  • is the media SSD or spinning , if spinning is it shingled? – Jasen Sep 15 '18 at 21:17
  • @Jasen "I am testing Fedora 28, on a laptop with a spinning hard disk drive." It's the main drive. No shingling. – sourcejedi Sep 15 '18 at 22:22

2 Answers2

4

The reason for the difference between 8 and 9 4KB blocks is because ext4 has a heuristic when converting an unallocated extent created by fallocate() to an allocated extent. For unallocated extents 32KB or less, it just fills the whole extent with zeroes and rewrites the whole thing, while larger extents are split into two or three smaller extents and written out.

In the 8-block case, the whole 32KB extent is converted to a normal extent, the first 16KB is written with your data and the remainder is zero-filled and written out. In the 9-block case, the 36KB extent is split (because it is over 32KB), and you are left with a 16KB extent for your data and a 20KB unwritten extent.

Strictly speaking, the 20KB unwritten extent should also just be zero filled and written out, but I suspect it doesn't do that. However, that would just change the break-even point a bit (to 16KB+32KB = 12 blocks in your case), but wouldn't change the underlying behavior.

You could use filefrag -v out.tmp after the first write to see the block allocation layout on disk.

That said, you could just avoid fallocate and O_DSYNC completely and let the filesystem do its job to write out the data as quickly as possible instead of making the file layout worse than it needs to be....

sourcejedi
  • 50,249
LustreOne
  • 1,774
  • +1. Re last paragraph. Durability is one of the jobs of a filesystem. "[I used] O_DSYNC writes, so that it can measure the real progress of the disk writes". This is inspired by https://unix.stackexchange.com/questions/467844/dd-issues-as-always/467869 . It is possible that I'm getting into a mess due to starting my testing on a filesystem, when what I'm really targeting is block device writes. – sourcejedi Sep 16 '18 at 08:51
  • That said, very good point about my code making for a bad file layout. – sourcejedi Sep 16 '18 at 09:58
2

This difference might look interesting, but the most important thing to understand is that you are abusing fallocate(). fallocate() is only guaranteed to reserve space on disk. It is not guaranteed to improve the performance of synchronous writes, i.e. avoid writes to filesystem metadata which require a disk seek.

You can illustrate this by modifying test-program.py to pre-write some data blocks instead of using fallocate(). On my ext4 filesystem, this provides the lower "minimum latency" measurement for either size of pre-allocation. I should note that other filesystems will have different performance profiles. Specifically, this would not work if they are implemented using copy-on-write like btrfs.

Code change:

     # Try to pre-allocate space
     if alloc:
-        fallocate(fd, 0, alloc * 4096)
+        os.pwrite(fd, block * alloc, 0)
+        os.fsync(fd)

Results:

openat(AT_FDCWD, "out.tmp", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC|O_CLOEXEC, 0777) = 3 <0.000088>
pwrite64(3, "\0\0\0"..., 36864, 0)      = 36864 <0.035337>
fsync(3)                                = 0 <0.000366>
write(3, "\0\0\0"..., 4096)             = 4096 <0.015217>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008194>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008371>
write(3, "\0\0\0"..., 4096)             = 4096 <0.008299>
close(3)                                = 0 <0.000034>
sourcejedi
  • 50,249