Unable to start/launch VMs

zotan · November 29, 2022, 11:40am

On a Focal server with 4.0 LTS I’m suddenly unable to start pre-existing VMs. I also can’t launch new VMs, no problems with containers. Running lxc info --show-log host tells me that the log is empty, similar to Fail to start VM / saving config failed (disk full) But in my case there are no storage problems. Starting with debug I get:

DBUG[11-29|11:21:36] Got response struct from LXD 
DBUG[11-29|11:21:36] 
        {
                "id": "13076d4b-e948-4c4e-9a6f-a71de7532172",
                "class": "task",
                "description": "Starting instance",
                "created_at": "2022-11-29T11:21:36.991844668Z",
                "updated_at": "2022-11-29T11:21:36.991844668Z",
                "status": "Running",
                "status_code": 103,
                "resources": {
                        "instances": [
                                "/1.0/instances/docker-host"
                        ]
                },
                "metadata": null,
                "may_cancel": false,
                "err": "",
                "location": "none"
        } 
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 fd=5 -- /snap/lxd/23991/bin/qemu-system-x86_64 -S -name docker-host -uuid 3f9e8cd9-186c-44db-b58f-331485b00a96 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/docker-host/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/docker-host/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/docker-host/qemu.pid -D /var/snap/lxd/common/lxd/logs/docker-host/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : Process exited with non-zero value -1
Try `lxc info --show-log docker-host` for more info

Empty log:

$ lxc info --show-log docker-host
Name: docker-host
Location: none
Remote: unix://
Architecture: x86_64
Created: 2022/03/17 19:23 UTC
Status: Stopped
Type: virtual-machine
Profiles: default, default-net, docker-net-80, docker-net-81, docker-net-82, docker-net-83, storage-net
Snapshots:
  snap121 (taken at 2022/11/25 03:14 UTC) (expires at 2022/11/30 03:14 UTC) (stateless)
  snap122 (taken at 2022/11/26 03:14 UTC) (expires at 2022/12/01 03:14 UTC) (stateless)
  snap123 (taken at 2022/11/27 03:14 UTC) (expires at 2022/12/02 03:14 UTC) (stateless)
  snap124 (taken at 2022/11/28 03:14 UTC) (expires at 2022/12/03 03:14 UTC) (stateless)
  snap125 (taken at 2022/11/29 03:14 UTC) (expires at 2022/12/04 03:14 UTC) (stateless)

Log:

And storage:

$ lxc storage info default
info:
  description: ""
  driver: zfs
  name: default
  space used: 348.30GiB
  total space: 5.07TiB
used by:
  images:
  - 2cb17b544f303fefdfae6bcbe70cece933b924019b8d25f3236ce75c2dd6c301
  - 672583a05778fdd42770408068c5f932d8383539b4cdf20f11b92a0bf3b24d45
  instances:
  - docker-host
  profiles:
  - default

Any ideas? I suspect that this is a qemu problem due to a kernel update.

$ uname -a
Linux plopp 5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

tomp · November 29, 2022, 12:57pm

What LXD version?
Please show lxc info output, thanks.

zotan · November 29, 2022, 1:01pm

lxd     4.0.9-eb5e237  23991  4.0/stable/…   canonical✓  -

$ lxc info
config:
  core.https_address: 192.168.20.15:8443
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- resources_system
- usedby_consistency
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- storage_rsync_compression
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_state_vlan
- gpu_sriov
- migration_stateful
- disk_state_quota
- storage_ceph_features
- gpu_mig
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- network_counters_errors_dropped
- image_source_project
- database_leader
- instance_all_projects
- ceph_rbd_du
- qemu_metrics
- gpu_mig_uuid
- event_project
- instance_allow_inconsistent_copy
- image_restrictions
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 192.168.20.15:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    -----END CERTIFICATE-----
  certificate_fingerprint: 
  driver: lxc | qemu
  driver_version: 4.0.12 | 7.1.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-132-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: plopp
  server_pid: 3116
  server_version: 4.0.9
  storage: zfs | lvm
  storage_version: 0.8.3-1ubuntu12.14 | 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30)
    / 4.41.0
  storage_supported_drivers:
  - name: zfs
    version: 0.8.3-1ubuntu12.14
    remote: false
  - name: ceph
    version: 15.2.16
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: cephfs
    version: 15.2.16
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.41.0
    remote: false

tomp · November 29, 2022, 1:01pm

Do you see the same issue if you switch to the current LXD 5.0 LTS series?

sudo snap refresh --channel=5.0/stable

zotan · November 29, 2022, 1:02pm

I haven’t tried that yet. I’m trying to find some sort of error message before I resort to stabs in the dark.

tomp · November 29, 2022, 1:04pm

We’ve been improving the error messages in LXD 5.0 so it may help identify the issue (if it still occurs).

Also the normal places to look for issues would be via sudo journalctl -b and sudo dmesg, look for “DENIED” occurrences as it may be AppArmor.

zotan · November 29, 2022, 1:06pm

Trying to launch a new VM I see nothing too alarming:

Nov 29 13:03:47 plopp systemd[16336]: Started snap.lxd.lxc.87aff40a-216d-42ff-b608-1242f46b1863.scope.
Nov 29 13:03:50 plopp zed: eid=99 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=100 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=101 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=102 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp multipathd[1667]: zd0: failed to get udev uid: Invalid argument
Nov 29 13:03:50 plopp multipathd[1667]: zd0: failed to get unknown uid: Invalid argument
Nov 29 13:03:50 plopp zed: eid=103 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=104 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=105 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=106 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=107 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=108 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=109 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=110 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:50 plopp zed: eid=111 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp multipathd[1667]: zd0: path already removed
Nov 29 13:03:51 plopp zed: eid=112 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp zed: eid=113 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp zed: eid=114 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp zed: eid=115 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp zed: eid=116 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp zed: eid=117 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:51 plopp kernel: [ 8081.938266] audit: type=1400 audit(1669727031.912:299): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-ubuntu-test_</var/snap/lxd/common/lxd>" pid=537002 comm="apparmor_parser"
Nov 29 13:03:51 plopp kernel: [ 8082.013431] audit: type=1326 audit(1669727031.988:300): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=537005 comm="qemu-system-x86" exe="/snap/lxd/23991/bin/qemu-system-x86_64" sig=31 arch=c000003e syscall=56 compat=0 ip=0x7fb6a61b9f3f code=0x80000000
Nov 29 13:03:52 plopp multipathd[1667]: zd0: failed to get udev uid: Invalid argument
Nov 29 13:03:52 plopp multipathd[1667]: zd0: failed to get unknown uid: Invalid argument
Nov 29 13:03:52 plopp systemd[16336]: snap.lxd.lxc.87aff40a-216d-42ff-b608-1242f46b1863.scope: Succeeded.
Nov 29 13:03:52 plopp zed: eid=118 class=history_event pool_guid=0x2D40D0DB19BFF7F7  
Nov 29 13:03:53 plopp multipathd[1667]: zd0: path already removed

tomp · November 29, 2022, 1:07pm

Can you launch a new VM and container on the same storage pool?

zotan · November 29, 2022, 1:09pm

The machine has one storage pool. Launching a new VM is shown above and fails. Launching a container succeeds.

tomp · November 29, 2022, 1:19pm

If you create a dir pool using lxc storage create dir dir can you launch a VM then using the -s dir flag appended to lxc launch command.

zotan · November 29, 2022, 2:24pm

After updating to 5.0 the problem is gone. The VM starts successfully.

tomp · November 29, 2022, 2:40pm

Ah excellent. It contains a new version of QEMU too, so perhaps that fixed it.

zotan · November 29, 2022, 2:42pm

This is frustrating, having no idea what went wrong. I don’t have the resources to try to recreate elsewhere, so I’ll just have to live with it. Thanks for your help @tomp