How to debug VM boot on LXD?

,

Hi,

Since a power outage on my LXD cluster two days ago, I’m not able to boot existing or new VM with ubuntu/focal/cloud image.
I’m not sure if this is due to power outage or just reboot but after trying many things, I have no solutions.

When I boot up a ubuntu VM the console is just dark, no text, no error, nothing.
It is working with other images like debian/10 (why? :man_shrugging: ).

Is there a way to enable debug on the VM output like lxc.loglevel?

Details of my installation

Hardware: 4x Raspberry Pi 4 8GB
OS: Ubuntu 20.04
LXD version: 4.12 (from snap)
VM storage pool: btrfs

Thanks,
Élie

You can try starting it with the --console flag to see the boot messages.

Perhaps the image volume that is the basis of both new and existing instances has been corrupted.

Can you launch VMs using a different image?

Hi Thomas,
Thanks for your help

$ lxc start vm-ubuntu --console
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

Yes I can launch VM using debian/10 without problem.

Debian 10 console
$ lxc console debian10
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
System BootOrder not found.  Initializing defaults.
Creating boot entry "Boot0008" with label "debian" for file "\EFI\debian\shimaa64.efi"
Loading Linux 4.19.0-14-arm64 ...
Loading initial ramdisk ...

Debian GNU/Linux 10 distrobuilder-47bec3f2-aaef-416a-9548-54435aad19ed ttyAMA0

distrobuilder-47bec3f2-aaef-416a-9548-54435aad19ed login:

Perhaps the image volume that is the basis of both new and existing instances has been corrupted.

I have deleted the ubuntu/focal/cloud image before recreating new VMs.

What storage pool type are you using?

I’m using btrfs

$ lxc storage ls
+---------+-------------+--------+---------+---------+
|  NAME   | DESCRIPTION | DRIVER |  STATE  | USED BY |
+---------+-------------+--------+---------+---------+
| default |             | btrfs  | CREATED | 8       |
+---------+-------------+--------+---------+---------+

Do you see any messages in dmesg output about btrfs corruption (perhaps csum errors)?

No, as you can see I have only 6 messages (at vm start):

[Thu Mar 25 15:33:41 2021] br0: port 2(tap69c8c68e) entered blocking state
[Thu Mar 25 15:33:41 2021] br0: port 2(tap69c8c68e) entered disabled state
[Thu Mar 25 15:33:41 2021] device tap69c8c68e entered promiscuous mode
[Thu Mar 25 15:33:41 2021] audit: type=1400 audit(1616686420.917:82): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-hass_</var/snap/lxd/common/lxd>" pid=140313 comm="apparmor_parser"
[Thu Mar 25 15:33:42 2021] br0: port 2(tap69c8c68e) entered blocking state
[Thu Mar 25 15:33:42 2021] br0: port 2(tap69c8c68e) entered forwarding state

I was able to export the VM and mount the filesystem from the image, there is no corruption i guess

Oh I found this:

BTRFS error (device mmcblk0p3): qgroup generation mismatch, marked as inconsistent

I’ll move all my VM and containers to another node and try to repair it

I still have errors on other nodes:

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi02 -c limits.cpu=4 -c limits.memory=4GB --console
Creating ubuntu-focal
Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

(and then black screen)

There is no BTRFS errors on my 3 other nodes.

Can you try creating a new BTRFS storage pool on the original node and try launching a VM again, that way you can rule out any issues with the current storage pool. Are you using a loopback file for the storage pools or using a directory on the host?

I use a dedicated partition.
I created a new storage pool with a new profile and my vm still not booting up :cry:

$ lxc storage ls --format yaml
- config:
    source: /dev/mmcblk0p3
    volatile.initial_source: /dev/mmcblk0p3
  description: ""
  name: default
  driver: btrfs
  used_by:
  - /1.0/profiles/default
  status: Created
  locations:
  - rpi01
  - rpi03
  - rpi02
  - rpi04
- config:
    size: 5GB
    source: /var/snap/lxd/common/lxd/disks/test1.img
  description: ""
  name: test1
  driver: btrfs
  used_by:
  - /1.0/instances/ubuntu-focal
  - /1.0/profiles/test1
  status: Created
  locations:
  - rpi01
  - rpi03
  - rpi02
  - rpi04

After waiting for a while attached to the console, i can see this error :

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 --console -p test1
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.
[  242.666848] INFO: task sfdisk:338 blocked for more than 120 seconds.
[  242.675632]       Not tainted 5.4.0-70-generic #78-Ubuntu
[  242.681203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

PS: I solved my BTRFS problems, my dmesg is now OK

Can you launch new VMS of the same image in the new storage pool. Trying to work out if the vm itself and the source image on the original pool are corrupt or whether its something else on your system.

@stgraber any ideas about repairing a potentially damaged btrfs filesystem after power cut?

How did you solve the btrfs dmesg issue?

Your VM size looks a bit small, have you tried specifying a larger size, say 10gb?

Can you launch new VMS of the same image in the new storage pool. Trying to work out if the vm itself and the source image on the original pool are corrupt or whether its something else on your system.

I created it on the new storage in my previous message (storage is attached to my profile):

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 --console -p test1
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.
[  242.666848] INFO: task sfdisk:338 blocked for more than 120 seconds.
[  242.675632]       Not tainted 5.4.0-70-generic #78-Ubuntu
[  242.681203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

How did you solve the btrfs dmesg issue?

I did this (twice):

nsenter -t $(pgrep daemon.start) -m -- umount /var/snap/lxd/common/lxd/storage-pools/default
nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs check --repair /dev/mmcblk0p3
nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs quota rescan /var/snap/lxd/common/lxd/storage-pools/default

Your VM size looks a bit small, have you tried specifying a larger size, say 10gb?

I recreated my pool with 10GB and it stills the same:

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 -p test1 -s test1 --console
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.
$ lxc profile show test1
config: {}
description: ""
devices:
  root:
    path: /
    pool: test1
    type: disk
name: test1
used_by:
- /1.0/instances/ubuntu-focal

$ lxc storage info test1 --target=rpi01
info:
  description: ""
  driver: btrfs
  name: test1
  space used: 4.39GB
  total space: 10.00GB
used by:
  images:
  - 0b37fd286f09bfba70ffb9c9a7ba020a779945fd94dc3d29e326c4f3a47753ca
  instances:
  - ubuntu-focal
  profiles:
  - test1

I tested on a dir storage and it doesn’t work either :cry:

I found this message in /var/snap/lxd/common/lxd/logs/ubuntu-focal/qemu.log:

root@rpi01:~# cat /var/snap/lxd/common/lxd/logs/ubuntu-focal/qemu.log
qemu-system-aarch64: Failed to read from slave.

FYI: I can boot on ubuntu/focal but not on ubuntu/focal/cloud:

$ lxc launch images:ubuntu/focal ubuntu-focal2 --vm --target=rpi01 --console
Creating ubuntu-focal2
Starting ubuntu-focal2
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

Ubuntu 20.04.2 LTS distrobuilder-1ba1b7b0-ec4b-42f4-847e-f756e59b0bf4 ttyAMA0

distrobuilder-1ba1b7b0-ec4b-42f4-847e-f756e59b0bf4 login:

Interesting.

Can you remove all instances from your new test storage pool, to ensure any cached image volumes are removed. And then delete the focal cloud image.

Then launch a new vm on that storage pool so it downloads a fresh image, and then show me the output of ‘lxc image list’

Thanks

I have deleted my storage pool and recreated it:

$ lxc image ls
+-------+-------------+--------+-------------+--------------+------+------+-------------+
| ALIAS | FINGERPRINT | PUBLIC | DESCRIPTION | ARCHITECTURE | TYPE | SIZE | UPLOAD DATE |
+-------+-------------+--------+-------------+--------------+------+------+-------------+

$ lxc storage create test1 dir --target=rpi01
Storage pool test1 pending on member rpi01
$ lxc storage create test1 dir --target=rpi02
Storage pool test1 pending on member rpi02
$ lxc storage create test1 dir --target=rpi03
Storage pool test1 pending on member rpi03
$ lxc storage create test1 dir --target=rpi04
Storage pool test1 pending on member rpi04
$ lxc storage create test1 dir
Storage pool test1 created

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal-cloud --vm --target=rpi01 -s test1 --console

$ lxc image ls
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+
| ALIAS | FINGERPRINT  | PUBLIC |             DESCRIPTION             | ARCHITECTURE |      TYPE       |   SIZE   |         UPLOAD DATE          |
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+
|       | be1517ac26ad | no     | Ubuntu focal arm64 (20210328_08:04) | aarch64      | VIRTUAL-MACHINE | 261.13MB | Mar 29, 2021 at 9:31am (UTC) |
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+

could it be a cloud-init problem with this image? can I clear a cache or something like this?