Lxc restore report this error: "Error: In use" LXD 4.0.7

Hello,

Required information

  • Distribution: Ubuntu
  • Distribution version: 20.04.2 LTS
  • The output of “lxc info” or if that fails:
config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- resources_system
- usedby_consistency
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- storage_rsync_compression
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_state_vlan
- gpu_sriov
- migration_stateful
- disk_state_quota
- storage_ceph_features
- gpu_mig
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.XX.XX.XX:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB9TCCAXugAwIBAgIRAJTLnHpytJsUJNw0OX1s23swCgYIKoZIzj0EAwMwNTEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBzcnYt
    MzA0MB4XDTE5MTAwMzE0MzEyNFoXDTI5MDkzMDE0MzEyNFowNTEcMBoGA1UEChMT
    bGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBzcnYtMzA0MHYwEAYH
    KoZIzj0CAQYFK4EEACIDYgAEs/Yanxasioy7YGzZIJRNAP0+TQBuX5BAF/UNI+o0
    abNC1S1BOwEK2g4P0hRA5NxOFPnXhHfJNoaLyXoO30j7IdQOP+Pevf347bjV1gFM
    CipXsIFuvIaoV9PNYT7F44Jyo08wTTAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAww
    CgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAYBgNVHREEETAPggdzcnYtMzA0hwQK
    JFpNMAoGCCqGSM49BAMDA2gAMGUCMQDmmjFku+mDKOQLDDNY6chVjsB28wDsbtf3
    baT2zbQbkFTzzkI9qgPi6n0KbXjdjz4CMBe+RnIQF853bwaAsWXCOMivR9qsX+oZ
    GEat/mPOEGoTgZEfLf1gpJLQVZxrHbDxLA==
    -----END CERTIFICATE-----
  certificate_fingerprint: 1830fb98b7c3920f0e3f3ad1762b02905d2c98418bee29fb3cf3d99eb60f5497
  driver: lxc | qemu
  driver_version: 4.0.10 | 5.2.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-74-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: srv-304
  server_pid: 7836
  server_version: 4.0.7
  storage: zfs
  storage_version: 0.8.3-1ubuntu12.7
  storage_supported_drivers:
  - name: btrfs
    version: 4.15.1
    remote: false
  - name: cephfs
    version: 12.2.13
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.02.176(2) (2017-11-03) / 1.02.145 (2017-11-03) / 4.41.0
    remote: false
  - name: zfs
    version: 0.8.3-1ubuntu12.7
    remote: false
  - name: ceph
    version: 12.2.13
    remote: true

Issue description

When I try to restore a container I get this error: “Error: In use”. I’ve been doing this rollback for about 2 weeks for internal testing and I haven’t changed anything.

The problem seems to be strongly related to the transition from version 4.0.6 to version 4.0.7:

Steps to reproduce

  1. lxc snapshot container snapshot
  2. lxc restore container snapshot
  3. Return this error: “Error: In use” and the container is stopped

Information to attach

  • Container log (lxc info NAME --show-log)
lxc info container --show-log
Name: container
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/06/17 15:44 UTC
Status: Stopped
Type: container
Profiles: net-vlan301, cpu-4-cores, mem-4GB
Snapshots:
  after_ansible_2 (taken at 2021/06/21 12:47 UTC) (stateless)

Log:

lxc container 20210722131513.485 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131513.485 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing
lxc container 20210722131513.487 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131513.487 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing
lxc container 20210722131513.488 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1296 - No such file or directory - Failed to fchownat(43, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLIN
K_NOFOLLOW )
lxc container 20210722131643.667 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131643.668 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing
root@srv-XXX:~ # grep containers/container /proc/*/mountinfo
/proc/7247/mountinfo:624 1580 0:93 / /var/snap/lxd/common/lxd/storage-pools/default/containers/container rw,noatime shared:342 - zfs nvme/lxd/containers/container rw,xattr,posixacl
/proc/7836/mountinfo:624 1580 0:93 / /var/snap/lxd/common/lxd/storage-pools/default/containers/container rw,noatime shared:342 - zfs nvme/lxd/containers/container rw,xattr,posixacl

Does reloading LXD solve it out of interest?

What is the command to use to reload lxd with snap installation ?

sudo systemctl reload snap.lxd.daemon

Also, you’ve not advised what storage pool you’re using, please can you post output of:

lxc storage show default

And

lxc config show container --expanded

I’m assuming zfs from the mount output though.

I’ve not been able to reproduce immediately:

snap install lxd --channel=4.0/stable
lxd init --auto
lxc storage create zfs zfs
lxc launch images:ubuntu/focal c1 -s zfs
lxc snapshot c1
lxc restore c1 snap0 # container is running 

Also can you give output of:

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- mount | grep container

After reload LXD and restarting the container.

Thanks

Ok the problem seems to be solved by performing the reload of lxd:

root@srv-XXX:~ # systemctl reload snap.lxd.daemon
root@srv-XXX:~ # 
root@srv-XXX:~ # lxc restore container after_ansible_2 
root@srv-XXX:~ # 

We can see that the mounting points are no longer present either:

root@srv-XXX:~ # grep containers/container /proc/*/mountinfo
grep: /proc/157278/mountinfo: No such file or directory
grep: /proc/157280/mountinfo: No such file or directory
grep: /proc/157281/mountinfo: No such file or directory
root@srv-XXX:~ # nsenter --mount=/run/snapd/ns/lxd.mnt -- mount | grep container
root@srv-XXX:~ # 

root@srv-XXX:~ # lxc storage show default
config:
  source: nvme/lxd
  volatile.initial_source: nvme/lxd
  zfs.pool_name: nvme/lxd
description: ""
name: default
driver: zfs
used_by:
- /1.0/images/c53a85d3e035eb0bc6f3e54c92fedcbe823b112026631d0c11df9211e43d3215
- /1.0/instances/container
...
- /1.0/profiles/default
status: Created
locations:
- none
root@srv-XXX:~ # lxc config show container --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20210616_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20210616_07:42"
  image.type: squashfs
  image.variant: default
  limits.cpu: "4"
  limits.memory: 4GB
  security.nesting: "true"
  volatile.base_image: 2ecdb74f3277ad3c0385fe52a6f869b04d2eb98e46fa0084fbefdca45d95f08e
  volatile.eth301.hwaddr: 00:16:3e:58:53:86
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: b1c705e1-2d6f-4848-96d1-aa644f1b78f6
devices:
  eth301:
    name: eth301
    nictype: bridged
    parent: br_vlan301
    type: nic
  root:
    path: /
    pool: default
    size: 10GB
    type: disk
ephemeral: false
profiles:
- net-vlan301
- cpu-4-cores
- mem-4GB
stateful: false
description: ""

Great.

I would suggest enabling debug logging temporarily using:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

If you see it reoccur, then send over the contents of /var/snap/lxd/common/lxd/logs/lxd.log for helping with debugging the order of actions that causes it.

Thanks

I am coming back to you because the problem was still present on another server. I did a first reload of lxd and it still didn’t work. I then activated the debug mode with the reload and the problem was solved directly.

I will keep these commands aside if the problem occurs again. Thanks a lot for the quick help.

Thanks. We would like to know the order of the reproducer steps so if you see it occur again that would be useful so we can fix it.

1 Like

I have the exact same problem on LXD 5.10 compiled from source.
The debug option is activated, and I have a log file of 1.9Go, are you still interested in these logs to be able to understand/debug this problem?

Yes that would be useful to see how it got into that state.

Do you have the commands you ran too?

Here is a link to the log file. The first occurrence of the message appears on line 2296594 (yes, it is a very large file because the error does not appear very quickly)

time="2023-03-07T08:27:46Z" level=debug msg="Instance operation lock finished" action=restore err="In use" instance=ynh-appci-bullseye-amd64-stable-test-6 project=default reusable=false

For context, we use LXD in an internal CI/CD, which runs application installations and tries different actions on them before restoring the container to run others.

The error occurs after stopping a container with lxc stop $container, and running lxc restore $container snap0 (in this case “ynh-appci-bullseye-amd64-stable-test-6”).

We have had this error for quite a few versions already, and indeed a reload of the LXD service can fix the error, but if we can help to find a solution it is even better.

What storage driver is the instance on? It sounds like restoring from the snapshot is taking a long time.

We are using dir

On dir driver doing snapshot restores can take a long time.

This PR may help in this situation, by keeping the operation lock held until its finished:

Thank you for your time. I will wait for this PR to get merged, and I will come back if I still have the problem.