Instance locked out, can't start, can't edit

Not sure what’s happened, but I have three ongoing operations against an instance, two restart add one update, all say running, none can be cancelled. I’ve tried a reboot to no avail. Everything else “seems” to be Ok, but I can’d “do” anything to this instance.

Any ideas how I can recover from this?
I could sustain a delete of the Instance as I have backups… but I suspect as nothing else is working delete is going to freeze too.
Nothing of note in the logs so far as I can see … cluster, OVN raft etc all showing 100%.
??

# incus config show matrix-forum
architecture: aarch64
config:
  image.architecture: arm64
  image.description: Debian bookworm arm64 (20250302_05:24)
  image.os: Debian
  image.release: bookworm
  image.serial: "20250302_05:24"
  image.type: squashfs
  image.variant: cloud
  volatile.base_image: 4fd2b4b8284393d029362b1a348dbe5a58018725a8ce57287c7d6b87563aff70
  volatile.cloud-init.instance-id: 723f76c4-b784-453c-9b19-4171bca5ab0a
  volatile.eth-1.hwaddr: 10:66:6a:f5:8b:37
  volatile.eth-1.last_state.ip_addresses: 10.4.0.11
  volatile.eth-1.name: eth0
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 69615e6f-e7d7-45ef-9867-db63530ffb3f
  volatile.uuid.generation: 69615e6f-e7d7-45ef-9867-db63530ffb3f
devices:
  ssh:
    connect: tcp:127.0.0.1:22
    listen: tcp:0.0.0.0:2254
    type: proxy
ephemeral: false
profiles:
- default
stateful: false

Ok, found it. Looks like a.n.other node was experiencing an issue in the background (hung task problem as reported elsewhere) which was causing “some” operations on my node to ‘stuck’. On rebooting the node with the stuck kernel task, my operations cleared.

So a problem on one node seems able to lock up another node doing unrelated things (?)

[23321.971052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23321.971054] task:incusd          state:D stack:0     pid:9144  tgid:2970  ppid:1      flags:0x00000004
[23321.971062] Call trace:
[23321.971064]  __switch_to+0xf0/0x150
[23321.971073]  __schedule+0x38c/0xdd8
[23321.971077]  schedule+0x3c/0x148
[23321.971080]  grab_super+0x158/0x1c0
[23321.971087]  sget+0x150/0x268
[23321.971091]  zpl_mount+0x134/0x2f8 [zfs]
[23321.971340]  legacy_get_tree+0x38/0x70
[23321.971346]  vfs_get_tree+0x30/0x100
[23321.971350]  path_mount+0x410/0xa98
[23321.971355]  __arm64_sys_mount+0x194/0x2c0
[23321.971360]  invoke_syscall+0x50/0x120
[23321.971367]  el0_svc_common.constprop.0+0x48/0xf0
[23321.971372]  do_el0_svc+0x24/0x38
[23321.971377]  el0_svc+0x30/0xd0
[23321.971383]  el0t_64_sync_handler+0x100/0x130
[23321.971388]  el0t_64_sync+0x190/0x198
[23321.971392] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

So a problem on one node seems able to lock up another node doing unrelated things (?)

Yes. Welcome to clustering.

The alternative (which I use) is to run standalone incus instances without clustering. You can add each one as a separate remote, and still have the ability to manage and move instances across all nodes.

However, if you have a shared storage backend, and you want to do live migration of VMs, then I think you’ll still need clustering.

1 Like

Mmm, no, I’m all containers … the reason I want clustering is so that I can use OVN to move traffic transparently between various points, rather than having to use fixed tunnels. So I can have a reverse proxy on my edge (in the cloud) that points directly to an address in the cluster that provides the associated service. When you have hundreds of containers, not having to worry about where they all are becomes a big issue, especially when they migrate between nodes … with OVN they always keep the same address … I can get half-way there with bridging, and indeed I may go back to that, but OVN feels like the “right” way to do it …

Just fyi; seems to be an issue in later kernels, it “looks” like this is the issue I’m seeing;

https://github.com/openzfs/zfs/issues/17138

Looks vaguely like it might be fixed from 6.13-ish onwards … I’m currently trying 6.6 as it’s not something I saw prior to recent upgrades. (I’m currently on 6.12 which is the latest stable for RPi)

Mmm, bad news, also broken on the previous kernel … (6.6.78)

[61263.129520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[61263.129522] task:incusd          state:D stack:0     pid:12347 ppid:1      flags:0x00000005
[61263.129528] Call trace:
[61263.129530]  __switch_to+0xe0/0x120
[61263.129538]  __schedule+0x37c/0xd60
[61263.129542]  schedule+0x64/0x108
[61263.129545]  grab_super_dead+0xec/0x160
[61263.129552]  sget+0x150/0x208
[61263.129556]  zpl_mount+0x134/0x2f8 [zfs]
[61263.129769]  legacy_get_tree+0x38/0x70
[61263.129774]  vfs_get_tree+0x30/0xf8
[61263.129779]  path_mount+0x410/0xa90
[61263.129783]  __arm64_sys_mount+0x1e8/0x2d0
[61263.129787]  invoke_syscall+0x50/0x128
[61263.129793]  el0_svc_common.constprop.0+0x48/0xf0
[61263.129797]  do_el0_svc+0x24/0x38
[61263.129801]  el0_svc+0x38/0xd0
[61263.129806]  el0t_64_sync_handler+0x100/0x130
[61263.129810]  el0t_64_sync+0x190/0x198

As far as I can see this is caused by pinning the static IP address on an OVN network, then restarting the container.