MAAS+LXD Network booting

Hi folks,

I’m trying to get an LXD environment with MAAS and VMs booting from it. I was able to set the networking properly and MAAS is receiving the boot request from the LXD VMs, but the VMs get stopped without a clear sign of what can be stopping them:

LXC_IMAGE=6bc6c743ff33 # ubuntu/focal/cloud
root@buneary:/home/ubuntu# sudo lxc init --vm images:${LXC_IMAGE} vm --profile vms -c security.secureboot=false
root@buneary:/home/ubuntu# sudo lxc config device override vm root size=${DATA_DISK_SIZE}
root@buneary:/home/ubuntu# lxc start vm
root@buneary:/home/ubuntu# lxc info --show-log vm
Name: vm
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2022/03/22 18:39 UTC
Last Used: 2022/03/22 18:50 UTC

Log:

warning: tap: open vhost char device failed: Permission denied
warning: tap: open vhost char device failed: Permission denied

I got this same setup working a few days ago after changing the lxc image from ‘images:ubuntu/20.04’ → ‘images:ubuntu/20.04/cloud’, but it seems that image was deleted. And I can’t get it to work again. I still have one VM running in one of the envs if I need to get any information about its config.

The output of the console also is not too helpful: lxd.boot.log · GitHub

A few info from the setup:
lxc config show vm --expanded
qemu.conf
MAAS: 3.1/stable
LXD: 4.24/stable

Any idea on what can be the problem or how to get more info about this?

Do you see anything if you start the VM with the --console flag?

Also can you run lxc monitor in one window and then try starting the VM and submit the output here as it may reveal where in the process its going wrong.

Yes, I can open the console right after and get this:

>>Start PXE over IPv4.
  Station IP address is 10.10.20.228

  Server IP address is 10.10.20.2
  NBP filename is bootx64.efi
  NBP filesize is 955656 Bytes
 Downloading NBP file...

  NBP file downloaded successfully.
BdsDxe: loading Boot0002 "UEFI PXEv4 (MAC:00163E6E6DCA)" from PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163E6E6DCA,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
BdsDxe: starting Boot0002 "UEFI PXEv4 (MAC:00163E6E6DCA)" from PciRoot(0x0)/Pci(0x1,0x4)/Pci(0x0,0x0)/MAC(00163E6E6DCA,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
Fetching Netboot Image
Booting under MAAS direction...

In another host with debug:

Are you saying that the gist is from another working server?

No the gist is from the one that is in error too, but that one have low level debugs. The working VM was one that I was able to boot while ago, so I didn’t deleted it just in case I needed to get some info.

Wel I’m certainly no expert in booting from MAAS, but this look suspicious:

kern/disk.c:196: Opening `http,10.10.20.2:5248'...
disk/efi/efidisk.c:482: opening http
kern/disk.c:281: Opening `http,10.10.20.2:5248' failed.
kern/disk.c:295: Closing `http'.

Any ideas @stgraber ?

tcpdump output between the VM and MAAS may be useful to figure out what’s going on here.

Our own CI heavily relies on MAAS VMs running on LXD and have been working fairly well recently. We’ve had issues in the past when using HTTP boot rather than PXE boot, but as you’re using PXE in this case, it should work.

What kind of info you would be wanting to see from that? From the debugs on the console seems that the image was successfully transferred over the internet and therefore it died after qemu starts its execution.

Is there a way to attach GDB to that process or a way to manually execute the steps as LXD is doing?

Regarding your environment, can you tell what versions you are running?

Mine are:

Host Version: 20.04.4 LTS (Focal Fossa)
Host Kernel: 5.4.0-105-generic
QEMU Version: 4.2.1 (Debian 1:4.2-3ubuntu6.21)
LXD: 4.23
LXD image: 6bc6c743ff33 # ubuntu/focal/cloud
MAAS: 3.1.0-10901-g.f1f8f1505
MAS Enlisting Kernel Version: Focal ga-20.04

Host is 20.04, kernel is 5.16, LXD is 4.24, QEMU is 6.1.1.
Image shouldn’t matter because the VM is PXE booting from MAAS.

The fact that you’re not using the snap may explain some of it though.
The snap package of LXD includes a build of the EDK2 UEFI firmware which is likely much more recent than what’s in Debian and is built with specific config tweaks that we found makes it work best with QEMU.

It is lxd from the snap. Just not the last one. I’ll upgrade to those versions and try again.

Ah, okay, but even LXD 4.23 should be on QEMU 6.1.1 then.
Can you show lxc info as that should report the version of QEMU from the snap.

Not that it should really matter, our CI environment has been working for a couple of years so a newer QEMU or LXD shouldn’t really change anything.

So, yeah, those versions should work indeed, I even got it working once, but the only thing I changed to make work by then was to change it to use images:ubuntu/20.04/cloud instead of images:ubuntu/20.04, but after a few days, I run the very same script again and that image had vanished from the image list and only an arm image was being downloaded, so I had to pin to that other I pointed previously. But in the end, I doesn’t make sense to me even that change because at this early stage I understand that the MAAS image+initrd should be used. At least some output I should be able to get if the image was the deal.

Our VMs here are all using effectively:

  • lxc init vm1 --empty --vm -c limits.cpu=4 -c limits.memory=8GiB -c security.secureboot=false
  • lxc config device override vm1 root size=50GiB
  • lxc config device override eth0 boot.priority=10

So they’re not created from an image at all, they’re completely blank and just have CPU, memory and disk set properly and the NIC configured as the boot priority.