Why does a gunzip to dd pipeline slow down at the end?

Question

my command:

gunzip -c serial2udp.image.gz |
sudo dd of=/dev/mmcblk0 conv=fsync,notrunc status=progress bs=4M

my output:

15930949632 bytes (16 GB, 15 GiB) copied, 1049 s, 15.2 MB/s    

0+331128 records in
0+331128 records out
15931539456 bytes (16 GB, 15 GiB) copied, 1995.2 s, 8.0 MB/s

the card: SanDisk Ultra 32GB MicroSDHC Class 10 UHS Memory Card Speed Up To 30MB/s
distribution: 16.0.4 xenial with xfce
kernel version: 4.13.0.37-generic

i understand taking 17 minutes seems reasonable from what I've read. playing with block size doesn't really seem to make much of a difference (bs=100M still exhibits this behaviour with similar timestamps). why do the updates hang and it doesn't produce a finished report for another 16 minutes??

iotop tells me that mmcqd/0 is still running in the background at this point (at 99% IO), so I figure there is a cache somewhere that is holding up the final 5MB but I thought fsync should make sure that doesn't happen iotop shows no traffic crossing at this time either for dd. ctrl-c is all but useless and i don't want to corrupt my drive after writing to it.

updated 4.13.0.37-generic kernel. thanks – cts Sep 10 '18 at 02:59 — cts, Sep 10 '18 at 02:59

sourcejedi · Accepted Answer · 2018-09-24T10:22:52.053

I figure there is a cache somewhere that is holding up the final 5MB but I thought fsync should make sure that doesn't happen

conv=fsync means to write back any caches by calling fsync - after dd has written all the data. Hanging at the end is exactly what it will do.

When the output file is slower than the input file, the data written by dd can pile up in caches. The kernel cache can sometimes fill a significant fraction of system RAM. This makes for very misleading progress information. Your "final 5MB" was just an artefact of how dd shows progress.

If your system was indeed caching about 8GB (i.e. half of the 16GB of written data), then I think you either must have about 32GB of RAM, or have been fiddling with certain kernel options. See the lwn.net link below. I agree that not getting any progress information for 15 minutes is pretty frustrating.

There are alternative dd commands you could use. If you want dd to show more accurate progress, you might have to accept more complexity. I expect the following would work without degrading your performance, though maybe reality has other ideas than I do.

gunzip -c serial2udp.image.gz |
dd iflag=fullblock bs=4M |
sudo dd iflag=fullblock oflag=direct conv=fsync status=progress bs=4M of=/dev/mmcblk0

oflag=direct iflag=fullblock avoids piling up kernel cache, because it bypasses it altogether.
iflag=fullblock is required in such a command AFAIK (e.g. because you are reading from a pipe and writing using direct IO). The effect of missing fullblock is another unfortunate complexity of dd. Some posts on this site use this to argue you should always prefer to use a different command. It's hard to find another way to do direct or sync IO though.
conv=fsync should still be used, to write back the device cache.
I added an extra dd after gunzip, to buffer the decompressed output in parallel with the disk write. This is one of the issues that makes the performance with oflag=direct or oflag=sync a bit complex. Normal IO (non-direct, non-sync) is not supposed to need this, as it is already buffered by the kernel cache. You also might not need the extra buffer if you were writing to a hard drive with 4M of writeback cache, but I don't assume an SD card has that much.

You could alternatively use oflag=direct,sync (and not need conv=fsync). This might be useful for good progress information if you had a weird output device with hundreds of megabytes of cache. But normally I think of oflag=sync as a potential barrier to performance.

There is a 2013 article https://lwn.net/Articles/572911/ which mentions minute-long delays like yours. Many people see this ability to cache minutes worth of writeback data as undesirable. The problem was that the limit on the cache size was applied indiscriminately, to both fast and slow devices. Note that it is non-trivial for the kernel to measure device speed, because it varies depending on the data locations. E.g. if the cached writes are scattered in random locations, a hard drive will take longer from repeatedly moving the write head.

why do the updates hang

The fsync() is a single system call that applies to the entire range of the ~~file~~ device. It does not return any status updates before it is done.

just as a follow up, this code suggested by @sourcejedi worked as in it gave me a much more accurate representation of the progress and the system did not hang at the end. it did take longer to process though: 3798+1 records in 3798+1 records out 15931539456 bytes (16 GB, 15 GiB) copied, 2838.48 s, 5.6 MB/s identical file size and block size used. so this seems to be a trade off of realtime progress versus speed. that being said, I had tried this previously on 4 different attempts and got corrupted writes for the raspi.img, and this time it worked on the first write. — cts, Sep 10 '18 at 08:15
@cts Thank you for feedback! I can't think of any explanation why it would be that much slower for your device. So I don't have any other specific suggestion. In principle direct could want a bigger bs=, but on my system cat /sys/class/block/mmcblk0/queue/max_hw_sectors_kb already suggests it is capped at 512 KB. I have a 16GB Ultra bought in 2017, so I've tested it with 500MB of data. I got ~13.1 MB/s overall with your original command, and 13.5 MB/s with mine. — sourcejedi, Sep 10 '18 at 09:15

RalfFriedl · Answer 2 · 2018-09-09T14:55:45.050

0

You want to use fullblock. notrunc has no effect on device files.

dd of=/dev/mmcblk0 iflag=fullblock status=progress bs=4M

edited Sep 09 '18 at 14:55

answered Sep 09 '18 at 13:09

RalfFriedl

8,981

1

Very impressive if fsync() is not doing anything on the block device and the other issue is causing a 16 minute delay after writing 99.99% of data in 16 minutes. Anyway, if the answer doesn't assert fsync is a no-op then I have no downvote-worthy issue. I found a claim that device writes are synchronous on BSD, maybe there is a difference between OS's. – sourcejedi Sep 09 '18 at 15:08

agc · Answer 3 · 2018-09-24T21:29:11.193

0

Not an answer as such, but more of a diagnostic. Use the pipe view util pv to see how the respective rates of pipe flow are going, e.g.:

pv -c -N raw serial2udp.image.gz |
gunzip |
pv -c -N uncompressed |
sudo dd of=/dev/mmcblk0 conv=fsync,notrunc status=progress bs=4M

Hypotheses:

if it slows down at the end, it might be that the .gz file is most compressed at the end, or is otherwise giving gunzip more work to do.
Another possibility is that the write media itself is slower on some portion of it, perhaps due to a design peculiarity or even bad/marginal blocks. a non-destructive read/write test could be done with: badblocks -nv /dev/mmcblk0

edited Sep 24 '18 at 21:29

answered Sep 23 '18 at 21:15

agc

7,223

1

Why do you run pv under sudo? The OP runs dd under sudo because dd is opening /dev/mmcblk0 (with of=), but, in your command, the shell is opening /dev/mmcblk0. – G-Man Says 'Reinstate Monica' Sep 24 '18 at 21:14
@G-Man, Thanks, that wasn't going to work... See revised answer for fix. – agc Sep 24 '18 at 21:30

Why does a gunzip to dd pipeline slow down at the end?

3 Answers3

Linked