grub "out of memory" error due to too many disks?

Question

I'm runnnig Debian Stable (Debian 12, Bookworm) on a new Dell PowerEdge R760xd2 server with 24 disks and 256 GByte of RAM. The initial installation (including a reboot into the newly installed OS) worked fine, but now grub fails to start.

error: no such device: [some UUID].
Loading Linux 6.1.0-17-amd64
error: out of memory.
Loading initial ramdisk ...
error: you need to load the kernel first.

As you can see, grub is unable to load the kernel, unrelated to possible ramdisk (initrd) issues.

I also observed:

"Welcome to GRUB!" takes around a minute
when I remove a (virtual) bootable CD while this happens, I see error messages related to several disk
ls (hd22,gpt1)/ gives out of memory (in the recovery console)
enabling/disabling safe boot does not change any of this
with a bootable image (grml) in the virtual CD drive, data is read from the device while "Welcome to GRUB!" is shown: 297 MByte for an image of size 493 MByte. With the CD available, the "Welcome to GRUB!" phase takes much longer

I'm using UEFI and added a 500 MByte UEFI partition (using Debian's installer). The boot device is a hardware RAID1 using two of the disks.

In between the previous successful reboot and the failure I configured ZFS on 22 of the 24 disks. Furthermore, the remaining storage from the boot RAID1 is now also used as a second zpool (ZFS). I think each of the 22 disks has two (GPT?) partitions, but I don't know why as I use the whole disks for ZFS.

My gut feeling is that grub scans all disks and is a bit overwhelmed by the sheer number of disks/partitions.

How can I get the system to boot again?

You could try your luck with grub 2.12? One of the mentioned changes is Support for dynamic GRUB runtime memory addition using firmware calls which could be what you need - and I'm not sure if I remember correctly but there may have been a few other memory related bugfixes recently as well. Otherwise you can avoid scanning all devices for UUID by providing a --hint in your grub configuration, or just boot off a static device directly not using UUID at all...? — frostschutz, Jan 18 '24 at 22:24
Another option might be to remove functionality (modules) from Grub that are not boot related, so if your kernel/initramfs is not on ZFS but on the EFI partition, remove zfs and other unrelated filesystem modules? — frostschutz, Jan 18 '24 at 22:29
The ZFS data is not necessary to boot. I only saw GPT and ext2 modules in grub. Currently, I fail to boot any kernel (including using a recovery disk), and I'll experiment with your ideas once I get into some kind of running system. — C-Otto, Jan 18 '24 at 22:32
Does the grub shell still work? You can lsmod check loaded modules and rmmod remove them. Not sure if that would free up memory, if it's like the zfs module using it... — frostschutz, Jan 18 '24 at 22:34
What's the maximum disk size GRUB can use? I don't think your issue is the number of disks. Disconnect the 22 disks that aren't being used and see if the system boots. Once you get the system booting with 2 disks, rebuild the storage array using the power edge UEFI config tool and configure your ZFS Pools. The UEFI partition only needs to be type FAT32 with the boot flag set, on the FIRST DISK ONLY. I believe the UUID's are assigned to individual zpools, not the actual disks. — eyoung100, Jan 18 '24 at 22:59

score 2 · Accepted Answer · answered Jan 18 '24 at 23:15

I got it working.

Change boot to "BIOS" (instead of UEFI)
Boot grml (or some other kind of recovery disk), this wasn't possible with UEFI as I wasn't able to figure out how to change the boot order.
Add --hint hd22,gpt2 to grub.cfg on the UEFI partition
Reboot and change back to UEFI

Aside from not throwing errors, grub was also MUCH faster. This makes me believe that looking for the UUID is the issue, and providing a hint fixed that. This doesn't sound like a longterm solution, though.

grub "out of memory" error due to too many disks?

1 Answers1