Container can not be stopped

Host: Ubuntu 20.04
Incus 0.1
zfs storage pool

Steps to reproduce:
incus launch images:ubuntu/jammy jammy1
incus stop --force jammy1
takes forever, ipv4 vanishes from list, but still shows as ‘RUNNING’

incus info --show-log jammy1

Name: jammy1
Status: RUNNING
Type: container
Architecture: x86_64
PID: 291865
Created: 2023/10/25 12:31 CEST
Last Used: 2023/10/25 12:32 CEST

Resources:
Processes: 1
Disk usage:
root: 3.49MiB
CPU usage:
CPU usage (in seconds): 3
Memory usage:
Memory (current): 48.13MiB
Memory (peak): 77.12MiB

Log:

lxc jammy1 20231025103204.136 WARN conf - …/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc jammy1 20231025103204.136 WARN conf - …/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc jammy1 20231025103204.138 WARN conf - …/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc jammy1 20231025103204.138 WARN conf - …/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc jammy1 20231025103204.139 WARN cgfsng - …/src/lxc/cgroups/cgfsng.c:fchowmodat:1619 - No such file or directory - Failed to fchownat(40, memory.oom.group, 65536, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc jammy1 20231025103204.325 WARN mainloop - …/src/lxc/mainloop.c:__lxc_mainloop_io_uring:290 - Received unexpected return value -95 in cqe for “signal_handler” handler
lxc jammy1 20231025103204.325 WARN mainloop - …/src/lxc/mainloop.c:__lxc_mainloop_io_uring:290 - Received unexpected return value -22 in cqe for “lxc_terminal_ptx_io_handler” handler

incus config show jammy1 --expanded

architecture: x86_64
config:
image.architecture: amd64
image.description: Ubuntu jammy amd64 (20231024_07:42)
image.os: Ubuntu
image.release: jammy
image.serial: “20231024_07:42”
image.type: squashfs
image.variant: default
volatile.base_image: a65818d75b6d83703084aeda3e8f1e5e4d820f343583d6a8e24844433956bf02
volatile.cloud-init.instance-id: db241416-91a7-47de-95cf-dead15e8c4ba
volatile.eth0.host_name: veth7dac77cc
volatile.eth0.hwaddr: 00:16:3e:89:87:a1
volatile.idmap.base: “0”
volatile.idmap.current: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536}]’
volatile.idmap.next: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536}]’
volatile.last_state.idmap: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:65536}]’
volatile.last_state.power: RUNNING
volatile.uuid: 3a90561b-71b7-46fa-ac84-d926a0869f80
volatile.uuid.generation: 3a90561b-71b7-46fa-ac84-d926a0869f80
devices:
eth0:
name: eth0
nictype: bridged
parent: lxdbr0
type: nic
root:
path: /
pool: pl
size: 1GB
type: disk
ephemeral: false
profiles:

  • default
    stateful: false
    description: “”

Appreciate any help or hint what to look for.

Can you show the output of dmesg | tail -n 300?

dmesg

Nothing obviously wrong in there that would explain the hang, what does ps fauxww look like on the host?

ps fauxww

I migrated by lxd-to-incus which was not very smooth, some old config stuff in profiles & instance config caused errors, such as
limits.network.priority which had to be changed to limits.priority
and individual

storage.backups_volume   
storage.images_volume

had to be unmounted manually and folders deleted.
Not sure, the process of migration went through completely considering those interruptions.
Also got a lot of stuff in sysctl.conf, over time it stacked up as lxd evolved and bug fixed.

Just to confirm, the limits.network.priority did give you a pre-migration error, right?

For storage.backups_volume and storage.images_volume. I’ll put down an item to do extra testing on those as I definitely can see how they may be problematic.
There has been a fix in that area since Incus 0.1, basically making sure the two don’t exist as a folder on the target so they can be symlinked again but that probably deserves some more testing.

Can you show uname -a and also cat /proc/279418/stack?

The odd hang looks to me like it could be io_uring related as that’s been facing some kernel issues on and off on some Ubuntu kernels.

Can you also show zfs version?
That should help me replicate this environment.

Lastly, I’d say to try kill -9 279418 to see if that properly stops the container.

Yes, it did.
It has exited the migration process, I have fixed those issues manually, restarted the incus daemon, as I didn’t know how else to deal with a premature exit.

uname -a
Linux srv4 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/279418/stack
[<0>] io_cqring_wait+0x17c/0x1c0
[<0>] __x64_sys_io_uring_enter+0x19a/0x2c0
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1

zfs version
zfs-0.8.3-1ubuntu12.15
zfs-kmod-0.8.3-1ubuntu12.14

kill -9 279418
It terminated the lxc monitor of container alp2 and indeed now i could stop it immediately.

Okay, so that’s definitely an io_uring kernel issue then…

Your kernel is quite outdated, you’re running 5.4.0-137 when the current Ubuntu 20.04 kernel is 5.4.0-165. Ubuntu 20.04 also has a HWE kernel of 5.15.0-87 available as an alternative.

For now, I’d recommend you apply updates on your system and reboot, then let us know if the issue still occurs with the 5.4.0-165 kernel. If it does, then it’d be great if you could install linux-generic-hwe-20.04 to test that 5.15.0-87 kernel. That would give us a clear picture of the state of the kernel on Ubuntu.

As an alternative, maybe for later, if you want the kernel that we do our testing on, you could use the Ubuntu 20.04 build of GitHub - zabbly/linux: Linux kernel builds, though note that in your case as a ZFS user, you’ll need to also get GitHub - zabbly/zfs: OpenZFS builds in that case.

Indeed, a newer Kernel (upgrade to HWE kernel 5.15.0-87) did the work. All works fine.
Thanks very much for your patience and support.

No worries, glad that worked!

Doing some testing on the upcoming version of lxd-to-incus.

root@incus-migrate:~# ./lxd-to-incus 
=> Looking for source server
==> Detected: snap package
=> Looking for target server
=> Connecting to source server
=> Connecting to the target server
=> Checking server versions
==> Source version: 5.19
==> Target version: 0.1
=> Validating version compatibility
=> Checking that the source server isn't empty
=> Checking that the target server is empty
=> Checking that the source server isn't clustered
=> Validating source server configuration

The migration is now ready to proceed.
At this point, the source server and all its instances will be stopped.
Instances will come back online once the migration is complete.

Proceed with the migration? [default=no]: yes
=> Stopping the source server
=> Stopping the target server
=> Wiping the target server
=> Migrating the data
=> Migrating database
=> Cleaning up target paths
=> Starting the target server
=> Checking the target server
Uninstall the LXD package? [default=no]: yes
=> Uninstalling the source server
root@incus-migrate:~# incus config show
config:
  core.https_address: :8443
  storage.backups_volume: zfs-pool/backups
  storage.images_volume: zfs-pool/images
root@incus-migrate:~# 

So that shows that the handling of storage volumes for images and backups appears to be working now.