How to debug VM boot on LXD?

elie.deloumeau · March 24, 2021, 9:44pm

Hi,

Since a power outage on my LXD cluster two days ago, I’m not able to boot existing or new VM with ubuntu/focal/cloud image.
I’m not sure if this is due to power outage or just reboot but after trying many things, I have no solutions.

When I boot up a ubuntu VM the console is just dark, no text, no error, nothing.
It is working with other images like debian/10 (why? ).

Is there a way to enable debug on the VM output like lxc.loglevel?

Details of my installation

Hardware: 4x Raspberry Pi 4 8GB
OS: Ubuntu 20.04
LXD version: 4.12 (from snap)
VM storage pool: btrfs

Thanks,
Élie

tomp · March 25, 2021, 8:50am

You can try starting it with the --console flag to see the boot messages.

Perhaps the image volume that is the basis of both new and existing instances has been corrupted.

Can you launch VMs using a different image?

elie.deloumeau · March 25, 2021, 12:22pm

Hi Thomas,
Thanks for your help

$ lxc start vm-ubuntu --console
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

Yes I can launch VM using debian/10 without problem.

Debian 10 console

$ lxc console debian10
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
System BootOrder not found.  Initializing defaults.
Creating boot entry "Boot0008" with label "debian" for file "\EFI\debian\shimaa64.efi"
Loading Linux 4.19.0-14-arm64 ...
Loading initial ramdisk ...

Debian GNU/Linux 10 distrobuilder-47bec3f2-aaef-416a-9548-54435aad19ed ttyAMA0

distrobuilder-47bec3f2-aaef-416a-9548-54435aad19ed login:

Perhaps the image volume that is the basis of both new and existing instances has been corrupted.

I have deleted the ubuntu/focal/cloud image before recreating new VMs.

tomp · March 25, 2021, 2:25pm

What storage pool type are you using?

elie.deloumeau · March 25, 2021, 2:41pm

I’m using btrfs

$ lxc storage ls
+---------+-------------+--------+---------+---------+
|  NAME   | DESCRIPTION | DRIVER |  STATE  | USED BY |
+---------+-------------+--------+---------+---------+
| default |             | btrfs  | CREATED | 8       |
+---------+-------------+--------+---------+---------+

tomp · March 25, 2021, 2:50pm

Do you see any messages in dmesg output about btrfs corruption (perhaps csum errors)?

elie.deloumeau · March 25, 2021, 3:38pm

No, as you can see I have only 6 messages (at vm start):

[Thu Mar 25 15:33:41 2021] br0: port 2(tap69c8c68e) entered blocking state
[Thu Mar 25 15:33:41 2021] br0: port 2(tap69c8c68e) entered disabled state
[Thu Mar 25 15:33:41 2021] device tap69c8c68e entered promiscuous mode
[Thu Mar 25 15:33:41 2021] audit: type=1400 audit(1616686420.917:82): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-hass_</var/snap/lxd/common/lxd>" pid=140313 comm="apparmor_parser"
[Thu Mar 25 15:33:42 2021] br0: port 2(tap69c8c68e) entered blocking state
[Thu Mar 25 15:33:42 2021] br0: port 2(tap69c8c68e) entered forwarding state

I was able to export the VM and mount the filesystem from the image, there is no corruption i guess

elie.deloumeau · March 25, 2021, 3:46pm

Oh I found this:

BTRFS error (device mmcblk0p3): qgroup generation mismatch, marked as inconsistent

I’ll move all my VM and containers to another node and try to repair it

elie.deloumeau · March 25, 2021, 4:48pm

I still have errors on other nodes:

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi02 -c limits.cpu=4 -c limits.memory=4GB --console
Creating ubuntu-focal
Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

(and then black screen)

There is no BTRFS errors on my 3 other nodes.

tomp · March 25, 2021, 4:52pm

Can you try creating a new BTRFS storage pool on the original node and try launching a VM again, that way you can rule out any issues with the current storage pool. Are you using a loopback file for the storage pools or using a directory on the host?

elie.deloumeau · March 25, 2021, 6:03pm

I use a dedicated partition.
I created a new storage pool with a new profile and my vm still not booting up

$ lxc storage ls --format yaml
- config:
    source: /dev/mmcblk0p3
    volatile.initial_source: /dev/mmcblk0p3
  description: ""
  name: default
  driver: btrfs
  used_by:
  - /1.0/profiles/default
  status: Created
  locations:
  - rpi01
  - rpi03
  - rpi02
  - rpi04
- config:
    size: 5GB
    source: /var/snap/lxd/common/lxd/disks/test1.img
  description: ""
  name: test1
  driver: btrfs
  used_by:
  - /1.0/instances/ubuntu-focal
  - /1.0/profiles/test1
  status: Created
  locations:
  - rpi01
  - rpi03
  - rpi02
  - rpi04

After waiting for a while attached to the console, i can see this error :

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 --console -p test1
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.
[  242.666848] INFO: task sfdisk:338 blocked for more than 120 seconds.
[  242.675632]       Not tainted 5.4.0-70-generic #78-Ubuntu
[  242.681203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

PS: I solved my BTRFS problems, my dmesg is now OK

tomp · March 25, 2021, 6:54pm

Can you launch new VMS of the same image in the new storage pool. Trying to work out if the vm itself and the source image on the original pool are corrupt or whether its something else on your system.

@stgraber any ideas about repairing a potentially damaged btrfs filesystem after power cut?

tomp · March 25, 2021, 6:54pm

How did you solve the btrfs dmesg issue?

tomp · March 25, 2021, 6:56pm

Your VM size looks a bit small, have you tried specifying a larger size, say 10gb?

elie.deloumeau · March 25, 2021, 7:33pm

Can you launch new VMS of the same image in the new storage pool. Trying to work out if the vm itself and the source image on the original pool are corrupt or whether its something else on your system.

I created it on the new storage in my previous message (storage is attached to my profile):

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 --console -p test1
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0001 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.
[  242.666848] INFO: task sfdisk:338 blocked for more than 120 seconds.
[  242.675632]       Not tainted 5.4.0-70-generic #78-Ubuntu
[  242.681203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

How did you solve the btrfs dmesg issue?

I did this (twice):

nsenter -t $(pgrep daemon.start) -m -- umount /var/snap/lxd/common/lxd/storage-pools/default
nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs check --repair /dev/mmcblk0p3
nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs quota rescan /var/snap/lxd/common/lxd/storage-pools/default

Your VM size looks a bit small, have you tried specifying a larger size, say 10gb?

I recreated my pool with 10GB and it stills the same:

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal --vm --target=rpi01 -p test1 -s test1 --console
Creating ubuntu-focal

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu-focal
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

$ lxc profile show test1
config: {}
description: ""
devices:
  root:
    path: /
    pool: test1
    type: disk
name: test1
used_by:
- /1.0/instances/ubuntu-focal

$ lxc storage info test1 --target=rpi01
info:
  description: ""
  driver: btrfs
  name: test1
  space used: 4.39GB
  total space: 10.00GB
used by:
  images:
  - 0b37fd286f09bfba70ffb9c9a7ba020a779945fd94dc3d29e326c4f3a47753ca
  instances:
  - ubuntu-focal
  profiles:
  - test1

elie.deloumeau · March 26, 2021, 5:09pm

I tested on a dir storage and it doesn’t work either

elie.deloumeau · March 26, 2021, 5:51pm

I found this message in /var/snap/lxd/common/lxd/logs/ubuntu-focal/qemu.log:

root@rpi01:~# cat /var/snap/lxd/common/lxd/logs/ubuntu-focal/qemu.log
qemu-system-aarch64: Failed to read from slave.

elie.deloumeau · March 26, 2021, 6:31pm

FYI: I can boot on ubuntu/focal but not on ubuntu/focal/cloud:

$ lxc launch images:ubuntu/focal ubuntu-focal2 --vm --target=rpi01 --console
Creating ubuntu-focal2
Starting ubuntu-focal2
To detach from the console, press: <ctrl>+a q
BdsDxe: loading Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
BdsDxe: starting Boot0002 "UEFI QEMU QEMU HARDDISK " from PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/Scsi(0x0,0x1)
error: file `/boot/' not found.
error: no such device: /.disk/info.
error: no such device: /.disk/mini-info.

Ubuntu 20.04.2 LTS distrobuilder-1ba1b7b0-ec4b-42f4-847e-f756e59b0bf4 ttyAMA0

distrobuilder-1ba1b7b0-ec4b-42f4-847e-f756e59b0bf4 login:

tomp · March 26, 2021, 7:09pm

Interesting.

Can you remove all instances from your new test storage pool, to ensure any cached image volumes are removed. And then delete the focal cloud image.

Then launch a new vm on that storage pool so it downloads a fresh image, and then show me the output of ‘lxc image list’

Thanks

elie.deloumeau · March 29, 2021, 9:38am

I have deleted my storage pool and recreated it:

$ lxc image ls
+-------+-------------+--------+-------------+--------------+------+------+-------------+
| ALIAS | FINGERPRINT | PUBLIC | DESCRIPTION | ARCHITECTURE | TYPE | SIZE | UPLOAD DATE |
+-------+-------------+--------+-------------+--------------+------+------+-------------+

$ lxc storage create test1 dir --target=rpi01
Storage pool test1 pending on member rpi01
$ lxc storage create test1 dir --target=rpi02
Storage pool test1 pending on member rpi02
$ lxc storage create test1 dir --target=rpi03
Storage pool test1 pending on member rpi03
$ lxc storage create test1 dir --target=rpi04
Storage pool test1 pending on member rpi04
$ lxc storage create test1 dir
Storage pool test1 created

$ lxc launch images:ubuntu/focal/cloud ubuntu-focal-cloud --vm --target=rpi01 -s test1 --console

$ lxc image ls
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+
| ALIAS | FINGERPRINT  | PUBLIC |             DESCRIPTION             | ARCHITECTURE |      TYPE       |   SIZE   |         UPLOAD DATE          |
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+
|       | be1517ac26ad | no     | Ubuntu focal arm64 (20210328_08:04) | aarch64      | VIRTUAL-MACHINE | 261.13MB | Mar 29, 2021 at 9:31am (UTC) |
+-------+--------------+--------+-------------------------------------+--------------+-----------------+----------+------------------------------+

could it be a cloud-init problem with this image? can I clear a cache or something like this?