Why is the size of my IO requests being limited, to about 512K?
I posit that I/O is being limited to "about" 512 KiB due to the way it is being submitted and various limits being reached (in this case /sys/block/sda/queue/max_segments
). The questioner took the time to include various pieces of side information (such as kernel version and the blktrace
output) that allows us to take a guess at this mystery so let's see how I came to that conclusion.
Why [...] limited, to about 512K?
It's key to note the questioner carefully said "about" in the title. While the iostat
output makes us think we should be looking for values of 512 KiB:
Device [...] aqu-sz rareq-sz wareq-sz svctm %util
sda [...] 1.42 511.81 0.00 1.11 34.27
the blktrace
(via blkparse
) gives us some exact values:
8,0 0 3090 5.516361551 15201 Q R 6496256 + 2048 [dd]
8,0 0 3091 5.516370559 15201 X R 6496256 / 6497600 [dd]
8,0 0 3092 5.516374414 15201 G R 6496256 + 1344 [dd]
8,0 0 3093 5.516376502 15201 I R 6496256 + 1344 [dd]
8,0 0 3094 5.516388293 15201 G R 6497600 + 704 [dd]
8,0 0 3095 5.516388891 15201 I R 6497600 + 704 [dd]
(We typically expect a single sector to be 512 bytes in size) So the read I/O from dd
for sector 6496256 that was sized 2048 sectors (1 MiByte) was split into two pieces - one read starting at sector 6496256 for 1344 sectors and another read starting at sector 6497600 for 704 sectors. So the max size of a request before it is split is slightly more than 1024 sectors (512 KiB)... but why?
The questioner mentions a kernel version of 5.1.15-300.fc30.x86_64
. Doing a Google search for linux split block i/o kernel turns up "Chapter 16. Block Drivers" from Linux Device Drivers, 3rd Edition and that mentions
[...] a bio_split
call that can be used to split a bio
into multiple chunks for submission to more than one device
While we're not splitting bio
s because we intend to send them to different devices (in the way md or device mapper might) this still gives us an area to explore. Searching LXR's 5.1.15 Linux kernel source for bio_split
includes a link to the file block/blk-merge.c
. Inside that file there is blk_queue_split()
and for non special I/Os that function calls blk_bio_segment_split()
.
(If you want to take a break and explore LXR now's a good time. I'll continue the investigation below and try and be more terse going forward)
In blk_bio_segment_split()
the max_sectors
variable ultimately comes from aligning the value returned blk_max_size_offset()
and that looks at q->limits.chunk_sectors
and if that's zero then just returns q->limits.max_sectors
. Clicking around, we see how max_sectors
is derived from max_sectors_kb
in queue_max_sectors_store()
which is in block/blk-sysfs.c
. Back in blk_bio_segment_split()
, the max_segs
variable comes from queue_max_segments()
which returns q->limits.max_segments
. Continuing down blk_bio_segment_split()
we see the following:
bio_for_each_bvec(bv, bio, iter) {
According to block/biovecs.txt
we're iterating over over multi-page bvec.
if (sectors + (bv.bv_len >> 9) > max_sectors) {
/*
* Consider this a new segment if we're splitting in
* the middle of this vector.
*/
if (nsegs < max_segs &&
sectors < max_sectors) {
/* split in the middle of bvec */
bv.bv_len = (max_sectors - sectors) << 9;
bvec_split_segs(q, &bv, &nsegs,
&seg_size,
&front_seg_size,
§ors, max_segs);
}
goto split;
}
So if the I/O size is bigger than max_sectors_kb
(which is 1280 KiB in the questioner's case) it will be split (if there are spare segments and sector space then we'll fill the current I/O as much as possible before splitting by dividing it into segments and adding as many as possible). But in the questioner's case the I/O is "only" 1 MiB which is smaller than 1280 KiB so we're not in this case... Further down we see:
if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
[...]
queue_max_segment_size()
returns q->limits.max_segment_size
. Given some of what we've seen earlier (if (sectors + (bv.bv_len >> 9) > max_sectors)
) bv.bv_len
is going to be in terms of bytes (otherwise why do we have to divide it by 512?) and the questioner said /sys/block/sda/queue/max_segment_size
was 65336. If only we knew what value bv.bv_len
was...
[...]
new_segment:
if (nsegs == max_segs)
goto split;
bvprv = bv;
bvprvp = &bvprv;
if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
nsegs++;
seg_size = bv.bv_len;
sectors += bv.bv_len >> 9;
if (nsegs == 1 && seg_size > front_seg_size)
front_seg_size = seg_size;
} else if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
&front_seg_size, §ors, max_segs)) {
goto split;
}
}
do_split = false;
So for each bv
we check to see if it is a single-page or multi-page bvec (by checking whether its size is <= PAGE_SIZE
). If it's a single-page bvec we add one to the segment count and do some bookkeeping. If it's a multi-page bvec we check if it needed splitting into smaller segments (the code in bvec_split_segs()
does comparisons against get_max_segment_size()
which in this case means it will split the segment into multiple segments no bigger than 64 KiB (earlier we said /sys/block/sda/queue/max_segment_size
was 65336) but there must be no more than 168 (max_segs
) segments. If bvec_split_segs()
reached the segment limit and didn't cover all of the bv
's length then we will jump to split
. However, IF we assume we take the goto split
case we only generate 1024 / 64 = 16 segments so ultimately we wouldn't have to submit less than 1 MiB I/O so this is not the path the questioner's I/O went through...
Working backwards, if we assume there were "only single-page sized segments" this means we can deduce bv.bv_offset + bv.bv_len
<= 4096 and since bv_offset
is an unsigned int
then that means 0 <= bv.bv_len
<= 4096. Thus we can also deduce we never took the condition body that led to goto new_segment
earlier. We then go on to conclude that the original biovec must have had 1024 / 4 = 256 segments. 256 > 168 so we would have caused a jump to split
just after new_segment
thus generating one I/O of 168 segments and another of 88 segments. 168 * 4096 = 688128 bytes, 88 * 4096 = 360448 bytes but so what? Well:
688128 / 512 = 1344
360448 / 512 = 704
Which are the numbers we saw in the blktrace
output:
[...] R 6496256 + 2048 [dd]
[...] R 6496256 / 6497600 [dd]
[...] R 6496256 + 1344 [dd]
[...] R 6496256 + 1344 [dd]
[...] R 6497600 + 704 [dd]
[...] R 6497600 + 704 [dd]
So I propose that the dd
command line you're using is causing I/O to be formed into single-page bvecs and because the maximum number of segments is being reached, splitting of I/O happens at a boundaries of 672 KiB for each I/O.
I suspect if we'd submitted I/O a different way (e.g. via buffered I/O) such that multi-page bvecs were generated then we would have seen a different splitting point.
Is there a configuration option for this behaviour?
Sort of - /sys/block/<block device>/queue/max_sectors_kb
is a control on the maximum size that a normal I/O submitted through the block layer can be before it is split but it is only one of many criteria - if other limits are reached (such as the maximum segments) then a block based I/O may be split at a smaller size. Also, if you use raw SCSI commands it's possible to submit an I/O up to /sys/block/<block device>/queue/max_hw_sectors_kb
in size but then you're bypassing the block layer and bigger I/Os will just be rejected.
In fact you can Ilya Dryomov describing this max_segments
limitation in a June 2015 Ceph Users thread "krbd splitting large IO's into smaller IO's" and a fix later went in for rbd
devices (which itself was later fixed).
Further validation of the above come via a document titled "When 2MB turns into 512KB" by kernel block layer maintainer Jens Axboe, which has a section titled "Device limitations" covering the maximum segments limitation more succinctly.