2

I hope that this is not a duplicate question. I have seen several similar questions, where the answer was to blacklist the respective device or partition. But in my case, I can't do that (see below). Having said this:

On a debian buster x64 host, I have created a VM (based on QEMU). The VM runs on a block device partition, let's say /dev/sdc1. I have installed the debian system on that partition basically like that (some steps omitted):

#> mkfs.ext4 -j /dev/sdc1
#> mount /dev/sdc1 /mnt/target
#> debootstrap ... bullseye /mnt/target

Then I bind-mounted the necessary directories (/dev, /sys etc.), chrooted into /mnt/target, completed the guest OS installation and booted the VM.

The VM first started without issues. But with every VM reboot, the VM got more problems, which I was repairing at the GRUB and initramfs prompts, until repairing was not possible any more because obviously the ext4 file system had been damaged.

Because I originally thought that I had done something wrong, e.g. forgot to unmount the ext4 partition before starting the VM, I repeated the whole installation from scratch multiple times. The result was the same in every case: After a few restarts, the ext4 file system was so damaged that I couldn't repair it.

Accidentally, I have found the reason for this, but have no idea how to solve the problem. I noticed that e2fsck refused to operate on that partition, claiming that is was in use although it was not mounted and the VM was not running. Further investigation showed that there existed a kernel thread jbd2/sdc.

That means that the host kernel accesses the journal on that partition / file system. When I start the VM, the guest kernel of course does the same. I am nearly sure that the corruption of the file system is due to both kernels accessing the file system, notably the journal, at the same time.

How can I solve the problem?

I cannot blacklist the respective disk or the respective partition on the host, because I need to mount them there to prepare or complete the guest OS installation in a chroot. On the other hand, it doesn't seem possible to tell the host kernel to release the journal as soon as the VM starts.

I have installed a lot of VMs in the past years exactly the same way, but did not turn on the journal when creating their ext4 file system. Consequently, I didn't have that issue with those VMs.

Edit 1

In case it is relevant, when mounting the partition and chrooting into it to complete the guest OS installation, I use the following commands:

cd /mnt
mkdir target

mount /dev/sdc1 target

mount --rbind /dev target/dev mount --make-rslave target/dev mount --rbind /proc target/proc mount --make-rslave target/proc mount --rbind /sys target/sys mount --make-rslave target/sys

LANG=C.UTF-8 chroot target /bin/bash --login

When unmounting, I just do

umount -R target

The umount command does not report any error.

Binarus
  • 3,340
  • 1
    Pass -o norecovery to mount ? – steve Jun 22 '22 at 06:49
  • After installation, make sure to exit chroot, umount the partition you installed the guest OS on and then start QEMU. There is no need for it to be mounted on host after you finish installation. – Vilinkameni Jun 22 '22 at 06:51
  • Of course, I exited the chroot and unmounted the partition before starting the VM. The problem is that the host kernel nevertheless puts its hands on the journal. Therefore, I even can't do e2fsck on that partition after having it unmounted. I even can't create a new ext4 file system on that partition after having it unmounted! – Binarus Jun 22 '22 at 06:55
  • Or just bin the journal altogether? tune2fs -O ^has_journal /dev/sdXY – steve Jun 22 '22 at 07:00
  • Did the umount command give any errors? Basically, what you described shouldn't happen if the partition was successfully unmounted. – Vilinkameni Jun 22 '22 at 07:00
  • 1
    @steve Thanks for your suggestions. The first one looks promising. However, I am still asking myself why the host kernel grabs that journal even after having dismounted the partition. I don't want to dump the journal completely, though. I'll try your first suggestion and report back. – Binarus Jun 22 '22 at 07:05
  • Also, run e2fsck -p /dev/sdc1, where /dev/sdc1 is the guest's partition, on the host when it is not mounted and not used by QEMU. – Vilinkameni Jun 22 '22 at 07:05
  • @Vilinkameni The weird thing is that umount did not give any errors. Maybe the problem is due to the rbinds? I'll update my question to show them exactly. – Binarus Jun 22 '22 at 07:06
  • @Vilinkameni The problem is that I can't even run e2fsck on that partition because e2fsck claims that it is in use after unmounting it. That was what put me on the right track regarding the cause of the problem ... – Binarus Jun 22 '22 at 07:09
  • You can try using lsof(1) to find out what process is using specific files. – Vilinkameni Jun 22 '22 at 07:11
  • Yes, as described in my post, I did find out that there is a host kernel thread jbd2/sdc even after the file system has been dismounted; I was using lsof to find that out. This means that the host kernel accesses the journal of that ext4 file system even after it has been unmounted. – Binarus Jun 22 '22 at 07:18
  • 1
    Is there a reason why --rbind and --make-rslave and double calls of mount for each mountpoint are necessary instead of single mount calls per mountpoint with a simple -B? That might be causing the issue. – Vilinkameni Jun 22 '22 at 09:05
  • 1
    Also the comments in this question seem to suggest this might be a kernel bug or related to mount namespaces. What does the /proc/self/mountinfo show? – Vilinkameni Jun 22 '22 at 09:20
  • @Vilinkameni mount as well as /proc/self/mountinfo as well as /proc/self/mounts did not output / contain an entry relating to the partition or directory in question. Neither did lsof or fuser. I believe that I am a victim of that kernel bug. debian buster comes with 4.19 with debian patches. – Binarus Jun 22 '22 at 10:18
  • @Vilinkameni There are multiple reasons to not use mount -B: It does not not bind recursively, and nowadays it does a shared bind (instead of the usual private bind) due to some changes in systemd, which is fatal in use cases like the one described above. For more information, see https://wiki.debian.org/systemd#Known_Issues_and_Workarounds – Binarus Jun 22 '22 at 10:24
  • 1
    @steve Your first suggestion seems to have mitigated the problem. Thank you very much! When I consequently use -o norecovery, the host kernel does not put its hands on the ext4 partition's journal, and there are no jbd2/sdc entries any more in the output of lsof. If you make your comment an answer, I'll accept it. Besides that, I guess that the debian kernel is buggy: I still even can't e2fsck that partitions as soon as I have mounted and unmounted it, but at least it doesn't damage the file system any more. – Binarus Jun 22 '22 at 10:30
  • 1
    why do you think you need to install debian in a VM using debootstrap? it's better AND easier to just install from a CD/DVD image .iso file. debootstrap's fine for a chroot or container, but a VM should have a proper install just like a physical machine. – cas Jun 22 '22 at 11:42
  • @cas I am not sure whether the sort of installation you propose wouldn't produce the same problem. But thanks for the tip, I'll try it out of curiosity. – Binarus Jun 22 '22 at 15:59

1 Answers1

1

By passing -o norecovery to mount, you could mount the filesystem without making use of the journal at all.

Man page for mount, ext3 section:

norecovery/noload

Don't load the journal on mounting. Note that if the filesystem was not unmounted cleanly, skipping the journal replay will lead to the filesystem containing inconsistencies that can lead to any number of problems.

steve
  • 21,892