I'm trying to replace a failed disk in a RAID1 btrfs filesystem.
I can still mount the partition rw
(after about 5 minute delay and lots of I/O kernel errors).
I started replace
with -r
in an attempt to have the failed disk not impact the speed of the operation:
-r only read from <srcdev> if no other zero-defect mirror exists. (enable this if your drive has lots of read errors, the access would be very slow)
Still, I'm getting really poor performance. The partition is 3.6TiB, and in 9.25 hours I got:
3.8% done, 0 write errs, 0 uncorr. read errs
At this rate, it will take over 10 days to complete!!!
Due to circumstances beyond my control, this is too long to wait.
I'm seeing kernel errors regarding the failed disk quite commonly, averaging every 5 minutes or so:
Jan 26 09:31:53 tara kernel: print_req_error: I/O error, dev sdc, sector 68044920
Jan 26 09:31:53 tara kernel: BTRFS warning (device dm-3): lost page write due to IO error on /dev/mapper/vg4TBd2-ark
Jan 26 09:31:53 tara kernel: BTRFS error (device dm-3): bdev /dev/mapper/vg4TBd2-ark errs: wr 8396, rd 3024, flush 58, corrupt 0, gen 3
Jan 26 09:31:53 tara kernel: BTRFS error (device dm-3): error writing primary super block to device 2
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 02 eb 9e 23 00 00 04 00
Jan 26 09:32:32 tara kernel: print_req_error: critical medium error, dev sdc, sector 391967000
I'm guessing that the errors are due to btrfs trying to write accounting data to the disk (even though is is completely idle).
Even mounted ro
, btrfs may try to write to a disk. Mount option -o
nologreplay
:
Warning currently, the tree log is replayed even with a read-only mount! To disable that behaviour, mount also with nologreplay.
How can I speed up the process?
This article says that a replace
will continue after reboot.
I'm thinking:
- Cancel the current
replace
- Remove the failed disk
mount -o degraded,rw
- Hope there's no power outage given the gotcha of this one-time-only mount option)
At this point in time, I propose to simultaneously:
- Allow
replace
to continue without the failed disk present (a recentscrub
showed that the good disk has all the data) - Convert the data to
single
to allow mounting againrw
in case of power outage during the process
Is this a sound plan to have the replace
complete earlier?
My calculations say 6.5 hours (not 10 days) would be feasible given disk I/O speeds.
replace
after changing your profile,replace
isn't going to "know" what to do; Since you won't have RAID1 anymorereplace
won't know to complete the steps described above. – Emmanuel Rosa Jan 26 '19 at 03:16convert
after thereplace
(re)starts. I've read that doing areplace
is much faster than doing abalance
. There must have been a reason thatreplace
was created at a later date (as opposed to continue usingadd
thenremove
which would rebalance
as part of theremove
). – Tom Hale Jan 26 '19 at 06:34replace
is 2-3x faster than rebalancing at remove. I also read thatreplace
operates at 90% of the disk's I/O capacity, perhaps this is why. However, things may be different if the failed drive is already removed. – Tom Hale Jan 26 '19 at 07:04