Cannot restore to a container snapshot - Instance "restore" operation timed out after 30s

amartin33 · February 23, 2023, 1:27pm

I’m running LXD 5.0.2 on Ubuntu 22.04 using ZFS as the storage backend. Sometimes, when attempting to restore an LXD container to a snapshot, I get the following timeout:

$ lxc restore mycontainer mysnapshot
Error: Instance "restore" operation timed out after 30s

What can I do to debug this and force reverting to this snapshot? Thanks!

tomp · February 23, 2023, 1:39pm

Is it always the instance and/or snapshot?

tomp · February 23, 2023, 1:39pm

How loaded is your system and storage pool?

amartin33 · February 23, 2023, 1:43pm

Can you clarify what you’re asking here?

Not very - disk usage is low and CPU load was ~15 but on a 64-core system that is not a concern.

tomp · February 23, 2023, 2:33pm

The missing word there was “same”

amartin33 · February 23, 2023, 3:30pm

We’ve tested this 8 times over the course of the past week. We are always using the same instance and trying to lxc restore to the same snapshot. Of the 8 runs, 6 have succeeded, 1 failed with the above error, and the last 1 failed with the following error:

Error: In use

Is there a way to force the restore or otherwise make it more reliably succeed?

tomp · February 23, 2023, 3:36pm

If you’re getting stuck on “In Use” then reloading LXD will clear the reference counters without restarting the instances.

sudo systemctl reload snap.lxd.daemon

Ideally if you can reboot the host machine to get a fresh state though and then try doing a restore in one window whilst in another window run:

lxc monitor --type=logging --pretty

And post the output here, as it sounds like the ZFS restore is starting, but for some reason taking a long time and then LXD is giving up, with it still running in the background.

Also if you could get a sudo ps aux output while you’re running the restore process, that would be helpful too, along with the name of the instance and the snapshot name.

amartin33 · February 23, 2023, 8:40pm

We’re working on reproducing the error so we can gather the above debug info.

amartin33 · February 27, 2023, 12:57pm

We’re still working on reproducing the timeout error but did successfully reproduce the Error: In use error and got the following output from lxc monitor:

DEBUG  [2023-02-24T07:03:07-06:00] Start finished                                instance=mycontainer instanceType=container project=default stateful=false
DEBUG  [2023-02-24T07:03:07-06:00] Instance operation lock finished              action=start err="<nil>" instance=mycontainer project=default reusable=false
DEBUG  [2023-02-24T07:03:07-06:00] Event listener server handler stopped         listener=7d99f914-c88a-4ec4-a332-0d0d2d9433fd local=/var/snap/lxd/common/lxd/unix.socket remote=@
ERROR  [2023-02-24T07:03:07-06:00] balance: Unable to set cpuset                 err="setting cgroup item for the container failed" name=mycontainer2 value="1,15,23,26,30,42,50,56"


DEBUG  [2023-02-24T07:03:55-06:00] Container stopped, cleaning up                instance=mycontainer instanceType=container project=default
DEBUG  [2023-02-24T07:03:55-06:00] Stopping device                               device=root instance=mycontainer instanceType=container project=default type=disk
DEBUG  [2023-02-24T07:03:55-06:00] UnmountInstance started                       instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] UnmountInstance finished                      instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] Skipping unmount as in use                    driver=zfs pool=default refCount=1 volName=mycontainer
INFO   [2023-02-24T07:03:55-06:00] Restoring container                           created="2023-02-23 13:23:42.598150347 +0000 UTC" ephemeral=false instance=mycontainer instanceType=container project=default source=mycontainer/mysnap used="2023-02-24 13:03:06.869029524 +0000 UTC"
DEBUG  [2023-02-24T07:03:55-06:00] Instance operation lock finished              action=restore err="<nil>" instance=mycontainer project=default reusable=false
DEBUG  [2023-02-24T07:03:55-06:00] Scheduler: container mycontainer stopped: re-balancing
DEBUG  [2023-02-24T07:03:55-06:00] Instance operation lock created               action=restore instance=mycontainer project=default reusable=false
DEBUG  [2023-02-24T07:03:55-06:00] Stop finished                                 instance=mycontainer instanceType=container project=default stateful=false
DEBUG  [2023-02-24T07:03:55-06:00] MountInstance started                         instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] Mounting instance to check for CRIU state path existence  instance=mycontainer instanceType=container project=default
DEBUG  [2023-02-24T07:03:55-06:00] UnmountInstance started                       instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] MountInstance finished                        instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] UnmountInstance finished                      instance=mycontainer project=default
DEBUG  [2023-02-24T07:03:55-06:00] Skipping unmount as in use                    driver=zfs pool=default refCount=1 volName=mycontainer
DEBUG  [2023-02-24T07:03:55-06:00] Failure for operation                         class=task description="Restoring snapshot" err="In use" operation=d9b513b0-ec3e-45f3-a3e6-4e43da0842cc project=default
DEBUG  [2023-02-24T07:03:55-06:00] Instance operation lock finished              action=restore err="In use" instance=mycontainer project=default reusable=false
DEBUG  [2023-02-24T07:03:55-06:00] Event listener server handler stopped         listener=61acc074-12b3-4b9e-8adb-9cdd93f49688 local=/var/snap/lxd/common/lxd/unix.socket remote=@
ERROR  [2023-02-24T07:03:55-06:00] balance: Unable to set cpuset                 err="setting cgroup item for the container failed" name=mycontainer2 value="0,11,20,39,42,54,55,6"

tomp · February 27, 2023, 2:11pm

How did you get it into this state? Do you have reproducer steps?

amartin33 · February 27, 2023, 3:10pm

We’re still working on reproducer steps (it isn’t failing frequently enough to get a concrete set of steps). Stay tuned…