Lxc restore report this error: "Error: In use" LXD 4.0.7

Floflobel · July 22, 2021, 1:48pm

Hello,

Required information

Distribution: Ubuntu
Distribution version: 20.04.2 LTS
The output of “lxc info” or if that fails:

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- resources_system
- usedby_consistency
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- storage_rsync_compression
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_state_vlan
- gpu_sriov
- migration_stateful
- disk_state_quota
- storage_ceph_features
- gpu_mig
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.XX.XX.XX:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB9TCCAXugAwIBAgIRAJTLnHpytJsUJNw0OX1s23swCgYIKoZIzj0EAwMwNTEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBzcnYt
    MzA0MB4XDTE5MTAwMzE0MzEyNFoXDTI5MDkzMDE0MzEyNFowNTEcMBoGA1UEChMT
    bGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBzcnYtMzA0MHYwEAYH
    KoZIzj0CAQYFK4EEACIDYgAEs/Yanxasioy7YGzZIJRNAP0+TQBuX5BAF/UNI+o0
    abNC1S1BOwEK2g4P0hRA5NxOFPnXhHfJNoaLyXoO30j7IdQOP+Pevf347bjV1gFM
    CipXsIFuvIaoV9PNYT7F44Jyo08wTTAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAww
    CgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAYBgNVHREEETAPggdzcnYtMzA0hwQK
    JFpNMAoGCCqGSM49BAMDA2gAMGUCMQDmmjFku+mDKOQLDDNY6chVjsB28wDsbtf3
    baT2zbQbkFTzzkI9qgPi6n0KbXjdjz4CMBe+RnIQF853bwaAsWXCOMivR9qsX+oZ
    GEat/mPOEGoTgZEfLf1gpJLQVZxrHbDxLA==
    -----END CERTIFICATE-----
  certificate_fingerprint: 1830fb98b7c3920f0e3f3ad1762b02905d2c98418bee29fb3cf3d99eb60f5497
  driver: lxc | qemu
  driver_version: 4.0.10 | 5.2.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.4.0-74-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: srv-304
  server_pid: 7836
  server_version: 4.0.7
  storage: zfs
  storage_version: 0.8.3-1ubuntu12.7
  storage_supported_drivers:
  - name: btrfs
    version: 4.15.1
    remote: false
  - name: cephfs
    version: 12.2.13
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.02.176(2) (2017-11-03) / 1.02.145 (2017-11-03) / 4.41.0
    remote: false
  - name: zfs
    version: 0.8.3-1ubuntu12.7
    remote: false
  - name: ceph
    version: 12.2.13
    remote: true

Issue description

When I try to restore a container I get this error: “Error: In use”. I’ve been doing this rollback for about 2 weeks for internal testing and I haven’t changed anything.

The problem seems to be strongly related to the transition from version 4.0.6 to version 4.0.7:

Steps to reproduce

lxc snapshot container snapshot
lxc restore container snapshot
Return this error: “Error: In use” and the container is stopped

Information to attach

Container log (lxc info NAME --show-log)

lxc info container --show-log
Name: container
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/06/17 15:44 UTC
Status: Stopped
Type: container
Profiles: net-vlan301, cpu-4-cores, mem-4GB
Snapshots:
  after_ansible_2 (taken at 2021/06/21 12:47 UTC) (stateless)

Log:

lxc container 20210722131513.485 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131513.485 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing
lxc container 20210722131513.487 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131513.487 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing
lxc container 20210722131513.488 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1296 - No such file or directory - Failed to fchownat(43, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLIN
K_NOFOLLOW )
lxc container 20210722131643.667 WARN     conf - conf.c:lxc_map_ids:3389 - newuidmap binary is missing
lxc container 20210722131643.668 WARN     conf - conf.c:lxc_map_ids:3395 - newgidmap binary is missing

root@srv-XXX:~ # grep containers/container /proc/*/mountinfo
/proc/7247/mountinfo:624 1580 0:93 / /var/snap/lxd/common/lxd/storage-pools/default/containers/container rw,noatime shared:342 - zfs nvme/lxd/containers/container rw,xattr,posixacl
/proc/7836/mountinfo:624 1580 0:93 / /var/snap/lxd/common/lxd/storage-pools/default/containers/container rw,noatime shared:342 - zfs nvme/lxd/containers/container rw,xattr,posixacl

tomp · July 22, 2021, 2:11pm

Does reloading LXD solve it out of interest?

Floflobel · July 22, 2021, 2:19pm

What is the command to use to reload lxd with snap installation ?

tomp · July 22, 2021, 2:20pm

sudo systemctl reload snap.lxd.daemon

tomp · July 22, 2021, 2:24pm

Also, you’ve not advised what storage pool you’re using, please can you post output of:

lxc storage show default

And

lxc config show container --expanded

I’m assuming zfs from the mount output though.

tomp · July 22, 2021, 2:28pm

I’ve not been able to reproduce immediately:

snap install lxd --channel=4.0/stable
lxd init --auto
lxc storage create zfs zfs
lxc launch images:ubuntu/focal c1 -s zfs
lxc snapshot c1
lxc restore c1 snap0 # container is running

tomp · July 22, 2021, 2:32pm

Also can you give output of:

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- mount | grep container

After reload LXD and restarting the container.

Thanks

Floflobel · July 22, 2021, 2:37pm

Ok the problem seems to be solved by performing the reload of lxd:

root@srv-XXX:~ # systemctl reload snap.lxd.daemon
root@srv-XXX:~ # 
root@srv-XXX:~ # lxc restore container after_ansible_2 
root@srv-XXX:~ #

We can see that the mounting points are no longer present either:

root@srv-XXX:~ # grep containers/container /proc/*/mountinfo
grep: /proc/157278/mountinfo: No such file or directory
grep: /proc/157280/mountinfo: No such file or directory
grep: /proc/157281/mountinfo: No such file or directory

root@srv-XXX:~ # nsenter --mount=/run/snapd/ns/lxd.mnt -- mount | grep container
root@srv-XXX:~ #

root@srv-XXX:~ # lxc storage show default
config:
  source: nvme/lxd
  volatile.initial_source: nvme/lxd
  zfs.pool_name: nvme/lxd
description: ""
name: default
driver: zfs
used_by:
- /1.0/images/c53a85d3e035eb0bc6f3e54c92fedcbe823b112026631d0c11df9211e43d3215
- /1.0/instances/container
...
- /1.0/profiles/default
status: Created
locations:
- none

root@srv-XXX:~ # lxc config show container --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20210616_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20210616_07:42"
  image.type: squashfs
  image.variant: default
  limits.cpu: "4"
  limits.memory: 4GB
  security.nesting: "true"
  volatile.base_image: 2ecdb74f3277ad3c0385fe52a6f869b04d2eb98e46fa0084fbefdca45d95f08e
  volatile.eth301.hwaddr: 00:16:3e:58:53:86
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: b1c705e1-2d6f-4848-96d1-aa644f1b78f6
devices:
  eth301:
    name: eth301
    nictype: bridged
    parent: br_vlan301
    type: nic
  root:
    path: /
    pool: default
    size: 10GB
    type: disk
ephemeral: false
profiles:
- net-vlan301
- cpu-4-cores
- mem-4GB
stateful: false
description: ""

tomp · July 22, 2021, 2:48pm

Great.

I would suggest enabling debug logging temporarily using:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

If you see it reoccur, then send over the contents of /var/snap/lxd/common/lxd/logs/lxd.log for helping with debugging the order of actions that causes it.

Thanks

Floflobel · July 22, 2021, 10:37pm

I am coming back to you because the problem was still present on another server. I did a first reload of lxd and it still didn’t work. I then activated the debug mode with the reload and the problem was solved directly.

I will keep these commands aside if the problem occurs again. Thanks a lot for the quick help.

tomp · July 23, 2021, 8:11am

Thanks. We would like to know the order of the reproducer steps so if you see it occur again that would be useful so we can fix it.

kay0u · March 7, 2023, 2:16pm

I have the exact same problem on LXD 5.10 compiled from source.
The debug option is activated, and I have a log file of 1.9Go, are you still interested in these logs to be able to understand/debug this problem?

tomp · March 17, 2023, 9:59am

Yes that would be useful to see how it got into that state.

Do you have the commands you ran too?

kay0u · March 17, 2023, 11:58am

Here is a link to the log file. The first occurrence of the message appears on line 2296594 (yes, it is a very large file because the error does not appear very quickly)

time="2023-03-07T08:27:46Z" level=debug msg="Instance operation lock finished" action=restore err="In use" instance=ynh-appci-bullseye-amd64-stable-test-6 project=default reusable=false

For context, we use LXD in an internal CI/CD, which runs application installations and tries different actions on them before restoring the container to run others.

The error occurs after stopping a container with lxc stop $container, and running lxc restore $container snap0 (in this case “ynh-appci-bullseye-amd64-stable-test-6”).

We have had this error for quite a few versions already, and indeed a reload of the LXD service can fix the error, but if we can help to find a solution it is even better.

tomp · April 27, 2023, 8:54am

What storage driver is the instance on? It sounds like restoring from the snapshot is taking a long time.

kay0u · April 28, 2023, 9:04am

We are using dir

tomp · April 28, 2023, 9:16am

On dir driver doing snapshot restores can take a long time.

This PR may help in this situation, by keeping the operation lock held until its finished:

github.com/lxc/lxd

Instance: Remove automatic timeout concept from operation locks

lxc:master ← tomponline:tp-operationlock-timeouts

opened 08:06AM - 28 Apr 23 UTC

tomponline

+38 -120

Operation locks have previously had an automatic timeout of 30s. However many… of the tasks that used operation locks take an indeterminate amount of time, and so it required the task to periodically reset the operation lock so that it didn't lapse, because if it did, then it would open the door for conflicting tasks to be started whilst the original task was ongoing (which was the purpose of the lock to prevent this). Most of the instance tasks that use locks are dependent on external factors (such as system load, image size and I/O performance) which varies the time they take, so the automatic timeout concept was not helpful, and in some cases damaging. Rather than requiring tasks to keep resetting the lock timeout for an operation that takes an unknown amount of time, this PR removes the automatic timeout concept and instead requires the task to manage its own timeout (if appropriate) which will include cleaning up any ongoing commands and cancelling the operation lock. If a task can get stuck indefinitely then that should be considered a bug in the task rather than in the operation lock itself. Fixes #11501

kay0u · April 28, 2023, 3:32pm

Thank you for your time. I will wait for this PR to get merged, and I will come back if I still have the problem.