Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported

Hi there,

The issue

After doing some changes to my LXD setup today, a VM has started misbehaving. When trying to start my VM I get the following:

Error: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/21902/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console)
: Process exited with non-zero value 1
Try `lxc info --show-log mediastack` for more info

Running lxc info --show-log mediastack as suggested tells me there is no log file. Looking in /var/log/syslog I noticed the following:

Feb  7 16:48:58 aphrodite lxd.daemon[1049302]: t=2022-02-07T16:48:58+0000 lvl=warn msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=mediastack instanceType=virtual-machine project=default
Feb  7 16:49:00 aphrodite kernel: [1209265.648306] audit: type=1400 audit(1644252540.323:356): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd-mediastack_</var/snap/lxd/common/lxd>" pid=1988729 comm="apparmor_parser"

Especially the first line seems odd to me, I’m not doing any stateful migrations I don’t think?

The reason, maybe

The LXD host used to be a seperate node, but as of today it is part of an LXD cluster. This was done as follows:

  • Export all instances with lxc export --instance-only <name>
  • Reinstall LXD snap
  • Run lxd init and join the cluster
  • Reimport the exported instances

Environment

  • lxc --version: 4.20
  • uname -a: Linux aphrodite 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Storage: ZFS on a mergerfs ext4 filesystem
  • Networking: bridge

Everything appeared fine at first, however now it is no longer.

I’m honestly not sure what would cause the issue, or how to solve it. Hopefully someone here does.

Thanks!

Can you show zfs list -t all?

Of course:

NAME                                             USED  AVAIL     REFER  MOUNTPOINT
storage-pool                                    22.3G  12.1M       24K  none
storage-pool/containers                         19.3G  12.1M       24K  none
storage-pool/containers/jellyfin                10.9G  12.1M     10.9G  none
storage-pool/containers/kc-server               8.36G  12.1M     8.36G  none
storage-pool/custom                               24K  12.1M       24K  none
storage-pool/deleted                             120K  12.1M       24K  none
storage-pool/deleted/containers                   24K  12.1M       24K  none
storage-pool/deleted/custom                       24K  12.1M       24K  none
storage-pool/deleted/images                       24K  12.1M       24K  none
storage-pool/deleted/virtual-machines             24K  12.1M       24K  none
storage-pool/images                               24K  12.1M       24K  none
storage-pool/virtual-machines                   2.99G  12.1M       24K  none
storage-pool/virtual-machines/mediastack        5.51M  12.1M     5.51M  none
storage-pool/virtual-machines/mediastack.block  2.98G  12.1M     2.98G  -

Note that the pool was auto created by LXD when joining it to the cluster.

And mediastack is the one that’s got the issue?
If so, can you show lxc config --show-expanded mediastack

And it may also be useful for you to run lxc monitor --pretty in a terminal while running lxc start mediastack in another to get more debug output.

Correct, mediastack is the one with issues. I’ve updated to 4.22 in the meantime as well.

Here’s the config:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20211228_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20211228_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "4"
  limits.memory: 4096MB
  migration.stateful: "false"
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
    - openssh-server
    - curl
    - wget
    - nano
    - htop
    phone_home:
      url: http://10.10.0.1:4040/phone-home
      post:
      - hostname
      tries: 10
  volatile.base_image: 3909be5d5c59409c001c40805c86bcb29ac787e10618a3c10ddfd425300d7adb
  volatile.eth0.hwaddr: 00:16:3e:a1:cb:60
  volatile.last_state.power: RUNNING
  volatile.uuid: 1ef955a7-dd26-4c11-9932-6626651f4dd9
  volatile.vsock_id: "125"
devices:
  eth0:
    nictype: bridged
    parent: br1
    type: nic
  root:
    path: /
    pool: storage-pool
    size: 10GB
    size.state: 5GB
    type: disk
ephemeral: false
profiles:
- default
- cloud-init
- limits.medium
stateful: false
description: ""

The fields config.migration.stateful and devices.root.size.state were added to get it to work with 4.22, otherwhise it would fail with Error: Stateful start requires that the instance limits.memory is less than size.state on the root disk device (Note that passing --stateless does not fix this)

The logs from lxc monitor --pretty:

DBUG[02-07|18:39:17] New task Operation: 11446d3a-2052-43dd-a4b2-c1ee8440fc90 
DBUG[02-07|18:39:17] Started task operation: 11446d3a-2052-43dd-a4b2-c1ee8440fc90 
INFO[02-07|18:39:17] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Pending StatusCode=Pending Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err= Location=aphrodite
INFO[02-07|18:39:17] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Pending StatusCode=Pending Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err= Location=aphrodite
DBUG[02-07|18:39:17] Instance operation lock created          reusable=false action=start instance=mediastack project=default
DBUG[02-07|18:39:17] Start started                            instanceType=virtual-machine project=default stateful=false instance=mediastack
INFO[02-07|18:39:17] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Running StatusCode=Running Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err= Location=aphrodite
INFO[02-07|18:39:17] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Running StatusCode=Running Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err= Location=aphrodite
DBUG[02-07|18:39:17] Handling API request                     ip=@ method=GET protocol=unix url=/1.0/operations/11446d3a-2052-43dd-a4b2-c1ee8440fc90 username=tobias
DBUG[02-07|18:39:17] MountInstance started                    driver=zfs instance=mediastack pool=storage-pool project=default
DBUG[02-07|18:39:17] MountInstance finished                   driver=zfs instance=mediastack pool=storage-pool project=default
DBUG[02-07|18:39:17] Skipping lxd-agent install as unchanged  project=default srcPath=/snap/lxd/22340/bin/lxd-agent installPath=/var/snap/lxd/common/lxd/virtual-machines/mediastack/config/lxd-agent instance=mediastack instanceType=virtual-machine
DBUG[02-07|18:39:17] MountInstance started                    driver=zfs instance=mediastack pool=storage-pool project=default
DBUG[02-07|18:39:17] MountInstance finished                   pool=storage-pool project=default driver=zfs instance=mediastack
DBUG[02-07|18:39:17] UnmountInstance started                  project=default driver=zfs instance=mediastack pool=storage-pool
DBUG[02-07|18:39:17] UnmountInstance finished                 driver=zfs instance=mediastack pool=storage-pool project=default
DBUG[02-07|18:39:17] Skipping unmount as in use               volName=mediastack driver=zfs pool=storage-pool refCount=1
DBUG[02-07|18:39:18] Starting device                          device=eth0 instance=mediastack instanceType=virtual-machine project=default type=nic
DBUG[02-07|18:39:18] Database error: api.StatusError{status:404, msg:"Network not found"} 
DBUG[02-07|18:39:18] Scheduler: network: tapd201eb3e has been added: updating network priorities 
DBUG[02-07|18:39:18] Starting device                          device=root instance=mediastack instanceType=virtual-machine project=default type=disk
DBUG[02-07|18:39:19] UpdateInstanceBackupFile started         project=default driver=zfs instance=mediastack pool=storage-pool
DBUG[02-07|18:39:19] Skipping unmount as in use               driver=zfs pool=storage-pool refCount=1 volName=mediastack
DBUG[02-07|18:39:19] UpdateInstanceBackupFile finished        pool=storage-pool project=default driver=zfs instance=mediastack
DBUG[02-07|18:39:19] Instance operation lock finished         action=start err="Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/22340/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console)\n: Process exited with non-zero value 1" instance=mediastack project=default reusable=false
DBUG[02-07|18:39:19] Stopping device                          device=root instance=mediastack instanceType=virtual-machine project=default type=disk
DBUG[02-07|18:39:19] Stopping device                          instanceType=virtual-machine project=default type=nic device=eth0 instance=mediastack
DBUG[02-07|18:39:19] Database error: api.StatusError{status:404, msg:"Network not found"} 
DBUG[02-07|18:39:20] Clearing instance firewall static filters parent=br1 project=default dev=eth0 host_name=tapd201eb3e hwaddr=00:16:3e:a1:cb:60 instance=mediastack ipv4=0.0.0.0 ipv6=::
DBUG[02-07|18:39:20] UnmountInstance started                  instance=mediastack pool=storage-pool project=default driver=zfs
lxc DBUG[02-07|18:39:30] Start finished                           instance=mediastack instanceType=virtual-machine project=default stateful=false
DBUG[02-07|18:39:30] UnmountInstance finished                 driver=zfs instance=mediastack pool=storage-pool project=default
DBUG[02-07|18:39:30] Failure for task operation: 11446d3a-2052-43dd-a4b2-c1ee8440fc90: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/22340/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console)
: Process exited with non-zero value 1 
INFO[02-07|18:39:30] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Failure StatusCode=Failure Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err="Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/22340/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console)\n: Process exited with non-zero value 1" Location=aphrodite
INFO[02-07|18:39:30] ID: 11446d3a-2052-43dd-a4b2-c1ee8440fc90, Class: task, Description: Starting instance CreatedAt=2022-02-07T18:39:17+0000 UpdatedAt=2022-02-07T18:39:17+0000 Status=Failure StatusCode=Failure Resources=map[instances:[/1.0/instances/mediastack]] Metadata=map[] MayCancel=false Err="Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/22340/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console)\n: Process exited with non-zero value 1" Location=aphrodite

The only thing I noticed is a repeated 404 on a Network not found, I’m not sure why this caused, as the config lists br1 as the nic, which is present (lxc network list):

+------+----------+---------+------+------+-------------+---------+-------+
| NAME |   TYPE   | MANAGED | IPV4 | IPV6 | DESCRIPTION | USED BY | STATE |
+------+----------+---------+------+------+-------------+---------+-------+
| br1  | bridge   | NO      |      |      |             | 25      |       |
+------+----------+---------+------+------+-------------+---------+-------+
| eno1 | physical | NO      |      |      |             | 0       |       |
+------+----------+---------+------+------+-------------+---------+-------+
| eno2 | physical | NO      |      |      |             | 0       |       |
+------+----------+---------+------+------+-------------+---------+-------+

What do you have in /var/snap/lxd/common/lxd/logs/mediastack/?

I’ll have to get back to you on that later, seeing as I’m currently not able to attempt to start the VM due to some clustering issues I’m experiencing.

With the clustering fixed, I was able to continue this endeavor. The node mediastack ran on was reset, so the mediastack VM was reimported from a VM. Interestingly enough, it ran into this same issue once more.

total 24
-rw-r----- 1 root root 5100 Feb  8 11:32 qemu.conf
-rw-r--r-- 1 root root   53 Feb  8 11:32 qemu.early.log
-rw-r--r-- 1 root root  136 Feb  8 11:32 qemu.log
-rw-r--r-- 1 root root  252 Feb  8 08:54 qemu.log.old
srwxr-x--- 1 root root    0 Feb  8 11:32 qemu.monitor
srwxr-x--- 1 root root    0 Feb  8 11:32 qemu.spice
-rw-r--r-- 1 root root  199 Feb  8 11:32 virtiofsd.pid

Looking in qemu.log:

qemu-system-x86_64:/var/snap/lxd/common/lxd/logs/mediastack/qemu.conf:128: vhost-vsock: unable to set guest cid: Address already in use

And indeed, looking at the output of ps aux | grep qemu | grep mediastack:

lxd       293627  8.9 10.2 5027204 3371060 ?     Sl   08:54  14:19 /snap/lxd/22340/bin/qemu-system-x86_64 -S -name mediastack -uuid 1ef955a7-dd26-4c11-9932-6626651f4dd9 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/mediastack/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/mediastack/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/mediastack/qemu.pid -D /var/snap/lxd/common/lxd/logs/mediastack/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd

That seems odd to me, as LXD reports the VM as stopped. Killing the qemu process, followed by an lxc start mediastack works. However I don’t yet know for how long this’ll continue working, would you have any idea what causes this ‘desync’ between qemu and lxd?

Can you show latest:
lxc config --show-expanded mediastack

Are you using any other technologies that may be using vsock on your host, or are you running nested VMs by any chance?

Of course, the latest config is as follows:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20211228_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20211228_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "4"
  limits.memory: 4096MB
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
    - openssh-server
    - curl
    - wget
    - nano
    - htop
    users:
    - name: tobias
      sudo: ALL=(ALL) NOPASSWD:ALL
      shell: /bin/bash
      lock_passwd: false
      ssh_authorized_keys:
      - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJvgqqT0ZmhYxcq1ZbOkGG53Pe5GDlWNXBdiZc8zodLQ t.debruijn@array21.dev
    phone_home:
      url: http://10.10.0.1:4040/phone-home
      post:
      - hostname
      tries: 10
  volatile.base_image: 3909be5d5c59409c001c40805c86bcb29ac787e10618a3c10ddfd425300d7adb
  volatile.eth0.host_name: tap01d55bf2
  volatile.eth0.hwaddr: 00:16:3e:a1:cb:60
  volatile.last_state.power: RUNNING
  volatile.uuid: 1ef955a7-dd26-4c11-9932-6626651f4dd9
  volatile.vsock_id: "133"
devices:
  eth0:
    nictype: bridged
    parent: br1
    type: nic
  root:
    path: /
    pool: storage-pool
    size: 10GB
    type: disk
ephemeral: false
profiles:
- default
- cloud-init
- limits.medium
stateful: false
description: ""

To my knowledge LXD is the only thing using vsock. No nested VMs either. LXD is the only thing running on the host, really. Installed as snap install lxd --channel 4.22/stable

Can you run lxc config get <instance> volatile.vsock_id for each of your VMs and check none have the same ID as another.