btrfs replace on RAID1 is super slow with failed disk present

Question

I'm trying to replace a failed disk in a RAID1 btrfs filesystem.

I can still mount the partition rw (after about 5 minute delay and lots of I/O kernel errors).

I started replace with -r in an attempt to have the failed disk not impact the speed of the operation:

      -r
           only read from <srcdev> if no other zero-defect mirror exists.
           (enable this if your drive has lots of read errors, the access
           would be very slow)

Still, I'm getting really poor performance. The partition is 3.6TiB, and in 9.25 hours I got:

3.8% done, 0 write errs, 0 uncorr. read errs

At this rate, it will take over 10 days to complete!!!

Due to circumstances beyond my control, this is too long to wait.

I'm seeing kernel errors regarding the failed disk quite commonly, averaging every 5 minutes or so:

Jan 26 09:31:53 tara kernel: print_req_error: I/O error, dev sdc, sector 68044920
Jan 26 09:31:53 tara kernel: BTRFS warning (device dm-3): lost page write due to IO error on /dev/mapper/vg4TBd2-ark
Jan 26 09:31:53 tara kernel: BTRFS error (device dm-3): bdev /dev/mapper/vg4TBd2-ark errs: wr 8396, rd 3024, flush 58, corrupt 0, gen 3
Jan 26 09:31:53 tara kernel: BTRFS error (device dm-3): error writing primary super block to device 2
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error
Jan 26 09:32:32 tara kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 02 eb 9e 23 00 00 04 00
Jan 26 09:32:32 tara kernel: print_req_error: critical medium error, dev sdc, sector 391967000

I'm guessing that the errors are due to btrfs trying to write accounting data to the disk (even though is is completely idle).

Even mounted ro, btrfs may try to write to a disk. Mount option -onologreplay:

        Warning
           currently, the tree log is replayed even with a read-only
           mount! To disable that behaviour, mount also with nologreplay.

How can I speed up the process?

This article says that a replace will continue after reboot.

I'm thinking:

Cancel the current replace
Remove the failed disk
mount -o degraded,rw
- Hope there's no power outage given the gotcha of this one-time-only mount option)

At this point in time, I propose to simultaneously:

Allow replace to continue without the failed disk present (a recent scrub showed that the good disk has all the data)
Convert the data to single to allow mounting again rw in case of power outage during the process

Is this a sound plan to have the replace complete earlier?

My calculations say 6.5 hours (not 10 days) would be feasible given disk I/O speeds.

If you convert the data to single (and metadata to dup), why would you need to bother with the replace? After the profile change you should be able to remove the bad device, add the replacement, change your profile back to RAID1 and then re-balance. I suspect that if you attempt to use replace after changing your profile, replace isn't going to "know" what to do; Since you won't have RAID1 anymore replace won't know to complete the steps described above. — Emmanuel Rosa, Jan 26 '19 at 03:16
@EmmanuelRosa I'm proposing to do the convert after the replace (re)starts. I've read that doing a replace is much faster than doing a balance. There must have been a reason that replace was created at a later date (as opposed to continue using add then remove which would rebalance as part of the remove). — Tom Hale, Jan 26 '19 at 06:34
This says replace is 2-3x faster than rebalancing at remove. I also read that replace operates at 90% of the disk's I/O capacity, perhaps this is why. However, things may be different if the failed drive is already removed. — Tom Hale, Jan 26 '19 at 07:04

score 2 · Answer 1 · answered Oct 21 '22 at 12:25

In this particular case the 5 minutes of waiting that you see are kernel trying to access data on failed disk. In most cases this is intentional, as it increases the chance of recovering complete data. Most likely btrfs is skipping the damaged data, but kernel still tries to access the damaged drive for some time before giving up. According to: https://superuser.com/questions/905811/faster-recovery-from-a-disk-with-bad-sectors You can use TLER limit to make kernel skip errors faster.

To check default values:

# smartctl -l scterc /dev/sda

To set it to 5 seconds on both write and read:

# smartctl -l scterc,50,50 /dev/sda

Be aware, disabled means unlimited time waiting for recovery, and this can reduce amount of data recovered from disks.

Rucent88 · Accepted Answer · 2022-09-21T16:32:27.567

If you have important data on a failing drive, the program you want is ddrescue.

First, copy out anything vital

If there is any data on the filesystem you can't do without for a prolonged time, then do this first.

Disconnect the failing drive.

Mount the filesystem as Read-Only and Degraded.

sudo mount -o degraded,ro /dev/sdX /mount/dir

Copy out the data you need to another location.

Then ddrescue drive

Now to get the rest of the data, we create an image of the drive with ddrescue.

Unmount the Btrfs filesystem. Do not mount it. Do not mount it as read only.
Have a new drive that is formatted Ext4, or Btrfs with Copy On Write disabled.
Run ddrescue to the create the image from the dieing drive to the new drive
```
sudo ddrescue /dev/sdX /path/to/save.img /path/to/save.map
```
Ddrescue could take hours or even days to finish, depending on the drive size and speed. Also, it may not ever complete if the drive is failing too much. The time you allow it to rescue is up to you.
When ddrescue finishes processing, then remove/disconnect the failing drive and do not reconnect it again.

Mount the drive image to a loop device.

sudo losetup -Pf --show /path/to/save.img

Now you should be able to mount your Btrfs raid filesystem with the normal mount command, and not degraded mode. It will automatically use the loop image device in place of the missing drive.
After you have mounted the Btrfs drive, immediately run a scrub on it to repair any data that ddrescue may have been unable to recover.

From there you have 2 options. You can keep running the Btrfs filesystem with the loop device, or you can replace the loop device with another drive.

Tom Hale · Answer 3 · 2019-01-26T10:29:10.263

0

This answer mentions writes to the failed disk causing the replace to grind to a halt.

It suggests to dmsetup to setup a COW device on top of the failed disk such that any writes succeed.

Caution: In this case the filesystem was enclosed within a dmcrypt device. See my comment regarding the "gotcha" and potential data loss if this is not the case.

edited Jan 26 '19 at 10:29

answered Jan 26 '19 at 07:58

Tom Hale

30,455

Tom Hale · Answer 4 · 2019-01-28T11:23:58.623

Given that the replace was crawling, I did the following:

Ensured that the degraded filesystem was noauto in /etc/fstab
Rebooted the machine (which took about 20 minutes due to I/O hangs)
Disabled the LVM VG containing the btrfs fs on the failed drive:
```
sudo vgchange -an <failed-vg>
```

Disabled the failed device:

echo 1 | sudo tee /sys/block/sdb/device/delete

Mounted the filesystem -o ro,degraded (degraded can only be used once)

Checked replace status and saw it was suspended:

Started on 26.Jan 00:36:12, suspended on 26.Jan 10:13:30 at 4.1%, 0 write errs, 0

Mounted -o remount,rw and saw the replace continue:

kernel: BTRFS info (device dm-5): continuing dev_replace from <missing disk> (devid 2) to target /dev/mapper/vg6TBd1-ark @4%

As I'm writing this:

replace status shows a healthy 0.1% progress every 30 seconds or so
iostat -d 1 -m <target-dev> shows about 145MB/s (Seagate advertises 160MB/s)

Update:

After completion, I noticed that btrfs device usage /mountpoint was showing some Data,DUP and Metadata,single, rather than only RAID1, so I rebalanced:

btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mountpoint

Also, consider resizeing if both devices now contain slack:

btrfs filesystem resize max /mountpoint

I would also recommend that you scrub as I had 262016 correctable csum errors seemingly related to the interrupted replace.

@roaima I have some filesystems as RAID1 and some as single, and the ability to dynamically resize the block devices on which they reside. — Tom Hale, Jan 28 '19 at 03:43

btrfs replace on RAID1 is super slow with failed disk present

4 Answers4

First, copy out anything vital

Then ddrescue drive

Linked