4

I read /dev/sda using a 1MiB block size. Linux seems to limit the IO requests to 512KiB an average size of 512KiB. What is happening here? Is there a configuration option for this behaviour?

$ sudo dd iflag=direct if=/dev/sda bs=1M of=/dev/null status=progress
1545601024 bytes (1.5 GB, 1.4 GiB) copied, 10 s, 155 MB/s
1521+0 records in
1520+0 records out
...

While my dd command is running, rareq-sz is 512.

rareq-sz The average size (in kilobytes) of the read requests that were issued to the device.

-- man iostat

$ iostat -d -x 3
...
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda            309.00    0.00 158149.33      0.00     0.00     0.00   0.00   0.00    5.24    0.00   1.42   511.81     0.00   1.11  34.27
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-3             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
...

The kernel version is 5.1.15-300.fc30.x86_64. max_sectors_kb is 1280.

$ cd /sys/class/block/sda/queue
$ grep -H . max_sectors_kb max_hw_sectors_kb max_segments max_segment_size optimal_io_size logical_block_size chunk_sectors
max_sectors_kb:1280
max_hw_sectors_kb:32767
max_segments:168
max_segment_size:65536
optimal_io_size:0
logical_block_size:512
chunk_sectors:0

By default I use the BFQ I/O scheduler. I also tried repeating the test after echo 0 | sudo tee wbt_lat_usec. I also then tried repeating the test after echo mq-deadline|sudo tee scheduler. The results remained the same.

Apart from WBT, I used the default settings for both I/O schedulers. E.g. for mq-deadline, iosched/read_expire is 500, which is equivalent to half a second.

During the last test (mq-deadline, WBT disabled), I ran btrace /dev/sda. It shows all the requests were split into two unequal halves:

  8,0    0     3090     5.516361551 15201  Q   R 6496256 + 2048 [dd]
  8,0    0     3091     5.516370559 15201  X   R 6496256 / 6497600 [dd]
  8,0    0     3092     5.516374414 15201  G   R 6496256 + 1344 [dd]
  8,0    0     3093     5.516376502 15201  I   R 6496256 + 1344 [dd]
  8,0    0     3094     5.516388293 15201  G   R 6497600 + 704 [dd]
  8,0    0     3095     5.516388891 15201  I   R 6497600 + 704 [dd]
  8,0    0     3096     5.516400193   733  D   R 6496256 + 1344 [kworker/0:1H]
  8,0    0     3097     5.516427886   733  D   R 6497600 + 704 [kworker/0:1H]
  8,0    0     3098     5.521033332     0  C   R 6496256 + 1344 [0]
  8,0    0     3099     5.523001591     0  C   R 6497600 + 704 [0]

X -- split On [software] raid or device mapper setups, an incoming i/o may straddle a device or internal zone and needs to be chopped up into smaller pieces for service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of normal boundary conditions. dm is notably bad at this and will clone lots of i/o.

-- man blkparse

Things to ignore in iostat

Ignore the %util number. It is broken in this version. (`dd` is running at full speed, but I only see 20% disk utilization. Why?)

I thought aqu-sz is also affected due to being based on %util. Although I thought that meant it would be about three times too large here (100/34.27).

Ignore the svtm number. "Warning! Do not trust this field any more. This field will be removed in a future sysstat version."

sourcejedi
  • 50,249
  • 1
    Hmm. Curiously this question has been asked before over on https://stackoverflow.com/questions/10494593/linux-writes-are-split-into-512k-chunks but I'm not entirely satisfied by the answer. Need to keep digging... – Anon Aug 02 '19 at 03:07

1 Answers1

8

Why is the size of my IO requests being limited, to about 512K?

I posit that I/O is being limited to "about" 512 KiB due to the way it is being submitted and various limits being reached (in this case /sys/block/sda/queue/max_segments). The questioner took the time to include various pieces of side information (such as kernel version and the blktrace output) that allows us to take a guess at this mystery so let's see how I came to that conclusion.

Why [...] limited, to about 512K?

It's key to note the questioner carefully said "about" in the title. While the iostat output makes us think we should be looking for values of 512 KiB:

Device         [...] aqu-sz rareq-sz wareq-sz  svctm  %util
sda            [...]   1.42   511.81     0.00   1.11  34.27

the blktrace (via blkparse) gives us some exact values:

  8,0    0     3090     5.516361551 15201  Q   R 6496256 + 2048 [dd]
  8,0    0     3091     5.516370559 15201  X   R 6496256 / 6497600 [dd]
  8,0    0     3092     5.516374414 15201  G   R 6496256 + 1344 [dd]
  8,0    0     3093     5.516376502 15201  I   R 6496256 + 1344 [dd]
  8,0    0     3094     5.516388293 15201  G   R 6497600 + 704 [dd]
  8,0    0     3095     5.516388891 15201  I   R 6497600 + 704 [dd]

(We typically expect a single sector to be 512 bytes in size) So the read I/O from dd for sector 6496256 that was sized 2048 sectors (1 MiByte) was split into two pieces - one read starting at sector 6496256 for 1344 sectors and another read starting at sector 6497600 for 704 sectors. So the max size of a request before it is split is slightly more than 1024 sectors (512 KiB)... but why?

The questioner mentions a kernel version of 5.1.15-300.fc30.x86_64. Doing a Google search for linux split block i/o kernel turns up "Chapter 16. Block Drivers" from Linux Device Drivers, 3rd Edition and that mentions

[...] a bio_split call that can be used to split a bio into multiple chunks for submission to more than one device

While we're not splitting bios because we intend to send them to different devices (in the way md or device mapper might) this still gives us an area to explore. Searching LXR's 5.1.15 Linux kernel source for bio_split includes a link to the file block/blk-merge.c. Inside that file there is blk_queue_split() and for non special I/Os that function calls blk_bio_segment_split().

(If you want to take a break and explore LXR now's a good time. I'll continue the investigation below and try and be more terse going forward)

In blk_bio_segment_split() the max_sectors variable ultimately comes from aligning the value returned blk_max_size_offset() and that looks at q->limits.chunk_sectors and if that's zero then just returns q->limits.max_sectors. Clicking around, we see how max_sectors is derived from max_sectors_kb in queue_max_sectors_store() which is in block/blk-sysfs.c. Back in blk_bio_segment_split(), the max_segs variable comes from queue_max_segments() which returns q->limits.max_segments. Continuing down blk_bio_segment_split() we see the following:

    bio_for_each_bvec(bv, bio, iter) {

According to block/biovecs.txt we're iterating over over multi-page bvec.

        if (sectors + (bv.bv_len >> 9) > max_sectors) {
            /*
             * Consider this a new segment if we're splitting in
             * the middle of this vector.
             */
            if (nsegs < max_segs &&
                sectors < max_sectors) {
                /* split in the middle of bvec */
                bv.bv_len = (max_sectors - sectors) << 9;
                bvec_split_segs(q, &bv, &nsegs,
                        &seg_size,
                        &front_seg_size,
                        &sectors, max_segs);
            }
            goto split;
        }

So if the I/O size is bigger than max_sectors_kb (which is 1280 KiB in the questioner's case) it will be split (if there are spare segments and sector space then we'll fill the current I/O as much as possible before splitting by dividing it into segments and adding as many as possible). But in the questioner's case the I/O is "only" 1 MiB which is smaller than 1280 KiB so we're not in this case... Further down we see:

        if (bvprvp) {
            if (seg_size + bv.bv_len > queue_max_segment_size(q))
                goto new_segment;
        [...]

queue_max_segment_size() returns q->limits.max_segment_size. Given some of what we've seen earlier (if (sectors + (bv.bv_len >> 9) > max_sectors)) bv.bv_len is going to be in terms of bytes (otherwise why do we have to divide it by 512?) and the questioner said /sys/block/sda/queue/max_segment_size was 65336. If only we knew what value bv.bv_len was...

[...]
new_segment:
        if (nsegs == max_segs)
            goto split;

        bvprv = bv;
        bvprvp = &bvprv;

        if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
            nsegs++;
            seg_size = bv.bv_len;
            sectors += bv.bv_len >> 9;
            if (nsegs == 1 && seg_size > front_seg_size)
                front_seg_size = seg_size;
        } else if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
                    &front_seg_size, &sectors, max_segs)) {
            goto split;
        }
    }

    do_split = false;

So for each bv we check to see if it is a single-page or multi-page bvec (by checking whether its size is <= PAGE_SIZE). If it's a single-page bvec we add one to the segment count and do some bookkeeping. If it's a multi-page bvec we check if it needed splitting into smaller segments (the code in bvec_split_segs() does comparisons against get_max_segment_size() which in this case means it will split the segment into multiple segments no bigger than 64 KiB (earlier we said /sys/block/sda/queue/max_segment_size was 65336) but there must be no more than 168 (max_segs) segments. If bvec_split_segs() reached the segment limit and didn't cover all of the bv's length then we will jump to split. However, IF we assume we take the goto split case we only generate 1024 / 64 = 16 segments so ultimately we wouldn't have to submit less than 1 MiB I/O so this is not the path the questioner's I/O went through...

Working backwards, if we assume there were "only single-page sized segments" this means we can deduce bv.bv_offset + bv.bv_len <= 4096 and since bv_offset is an unsigned int then that means 0 <= bv.bv_len <= 4096. Thus we can also deduce we never took the condition body that led to goto new_segment earlier. We then go on to conclude that the original biovec must have had 1024 / 4 = 256 segments. 256 > 168 so we would have caused a jump to split just after new_segment thus generating one I/O of 168 segments and another of 88 segments. 168 * 4096 = 688128 bytes, 88 * 4096 = 360448 bytes but so what? Well:

688128 / 512 = 1344

360448 / 512 = 704

Which are the numbers we saw in the blktrace output:

[...]   R 6496256 + 2048 [dd]
[...]   R 6496256 / 6497600 [dd]
[...]   R 6496256 + 1344 [dd]
[...]   R 6496256 + 1344 [dd]
[...]   R 6497600 + 704 [dd]
[...]   R 6497600 + 704 [dd]

So I propose that the dd command line you're using is causing I/O to be formed into single-page bvecs and because the maximum number of segments is being reached, splitting of I/O happens at a boundaries of 672 KiB for each I/O.

I suspect if we'd submitted I/O a different way (e.g. via buffered I/O) such that multi-page bvecs were generated then we would have seen a different splitting point.

Is there a configuration option for this behaviour?

Sort of - /sys/block/<block device>/queue/max_sectors_kb is a control on the maximum size that a normal I/O submitted through the block layer can be before it is split but it is only one of many criteria - if other limits are reached (such as the maximum segments) then a block based I/O may be split at a smaller size. Also, if you use raw SCSI commands it's possible to submit an I/O up to /sys/block/<block device>/queue/max_hw_sectors_kb in size but then you're bypassing the block layer and bigger I/Os will just be rejected.

In fact you can Ilya Dryomov describing this max_segments limitation in a June 2015 Ceph Users thread "krbd splitting large IO's into smaller IO's" and a fix later went in for rbd devices (which itself was later fixed).

Further validation of the above come via a document titled "When 2MB turns into 512KB" by kernel block layer maintainer Jens Axboe, which has a section titled "Device limitations" covering the maximum segments limitation more succinctly.

Anon
  • 3,794
  • ohh :). This would explain why I have sometimes seen 1MiB IO's. I.e. this happens because the buffer is not physically contiguous. But if it happens to get allocated contiguously, for whatever reason, we can do IO up to 65536 * 168 (in this case). – sourcejedi Aug 04 '19 at 22:22
  • I was able to follow almost all of this, but not this one line: " We then go on to conclude that the original biovec must have had 1024 / 4 = 256 segments." Can you elaborate on how that follows? – aggieNick02 Mar 03 '22 at 22:05
  • @aggieNick02 The dd I/O was 1 megabyte which is 1024 kilobytes and the page size on x86 is 4096 bytes or 4 kilobytes. Given we're in the single page segment case that means the largest a biovec segment can be is a page size thus 1024 / 4 = 256 segments is the minimum that could be generated. – Anon Mar 05 '22 at 10:58
  • Ah, got it. Thanks @Anon! – aggieNick02 Mar 07 '22 at 19:55