1

I am trying to understand how data is written to the disk. I'm writing data with dd using various block sizes, but it looks like the disk is always getting hit with the same size blocks, according to iostat. For example this command should write 128K blocks.

dd if=/dev/zero of=/dev/sdb bs=128K count=300000

Trimmed output of iostat -dxm 1:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00 129897.00    0.00 1024.00     0.00   512.00  1024.00   142.09  138.81    0.00  138.81   0.98 100.00

My reading of this is that it's writing 512MBps in 1024 operations. This means each write = 512/1024 = 512K.

Another way of calculating the same thing: The avgrq-sz column shows 1024 sectors. According to gdisk the sector size of this Samsung 850 Pro SSD is 512B, therefore each write is 1024 sectors * 512B = 512K.

So my question is, why is it writing 512K blocks instead of 128K as specified with dd? If I change dd to write 4M blocks, the iostat result is exactly the same. The merges number doesn't make sense to me either.

That was writing directly to the block device; but if I format it XFS and write to the filesystem, the numbers are the same except zero merges:

dd if=/dev/zero of=/mnt/ddtest bs=4M count=3000

Now iostat shows

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 1024.00     0.00   512.00  1024.00   142.31  138.92    0.00  138.92   0.98 100.00

I'm using RHEL 7.7 by the way.

sourcejedi
  • 50,249
Elliott B
  • 565
  • Perhaps you could edit to quote what man iostat says about avgrq-sz. I notice on Fedora 30, the man page says "areq-sz The average size (in kilobytes) of the I/O requests that were issued to the device. Note: In previous versions, this field was known as avgrq-sz and was expressed in sectors." – sourcejedi Sep 10 '19 at 10:17

1 Answers1

0

If you do not specify O_DIRECT, then I/O requests go to the page cache. When this write cache is written back to the device, the kernel uses large writes where possible.

(I am not sure whether these are expected to be counted as merged writes, or not.)

The size of writes generated by the kernel memory management can be limited by a number of different variables.

In particular: Why is the size of my IO requests being limited, to about 512K?

The above link has a very long analysis. But the upshot is that when your physical RAM becomes fragmented, the kernel can only find individual pages to use, not physically contiguous runs of multiple pages. Then the size of an individual IO was limited, by the maximum size of the "scatter/gather list".

sourcejedi
  • 50,249
  • Thanks, that's starting to makes sense, I'll have to let that analysis soak in. So if I free up memory with drop_caches then it might stop doing this? I will check with oflag=direct – Elliott B Sep 10 '19 at 14:00
  • @ElliottB yes, good test idea, after drop_caches you might not be limited by fragmentation. The next limit might be /sys/class/block/sdb/queue/max_sectors_kb - it was 1280 on my system. I think oflag=direct might stop iostat showing larger requests than you specified with bs=, but both of the above maximum limits will still apply. – sourcejedi Sep 10 '19 at 15:31