One my my production clusters is in a broken state since around 3 days.
All containers are available and working but the lxd daemon and commands are not working.
The cluster has 14 nodes and all of them were tracking latest/stable channel. It uses local storage ( zfs ).
I would like to bring it up without stopping the containers ( production mail servers ).
I have a detailed report of all 14 servers with the output of the following commands.
snap info lxd | egrep ‘(installed|refresh-date|snap-id|tracking):’
snap list --all lxd
snap changes lxd
snap change num ( if the previous cmd produced a result )
sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin “SELECT * FROM nodes;”
@tomp any idea how can I tell snap to run the changes again on ws-vh-05?
Would a reboot of the host help? I am afraid that the containers may not come up after the reboot.
A reboot of this single host is not out of the question though, but I will just need some time to move the data from the containers to somewhere else.
Ok, So current status. I managed to move the most important data from the containers on that host to elsewhere.
snap status
root@ws-vh-05:~# snap list --all
Name Version Rev Tracking Publisher Notes
core18 20210611 2074 latest/stable canonical✓ base,disabled
core18 20210722 2128 latest/stable canonical✓ base
core20 20210702 1081 latest/stable canonical✓ base,disabled
core20 20210928 1169 latest/stable canonical✓ base
lxd 4.18 21497 latest/stable canonical✓ disabled,in-cohort
lxd 4.19 21624 latest/stable canonical✓ disabled,in-cohort
snapd 2.51.7 13170 latest/stable canonical✓ snapd,disabled
snapd 2.52 13270 latest/stable canonical✓ snapd,disabled
root@ws-vh-05:~# snap changes
ID Status Spawn Ready Summary
61 Abort 6 days ago, at 07:57 CEST - Refresh "lxd" snap
62 Doing yesterday at 15:10 CEST - Auto-refresh snap "snapd"
root@ws-vh-05:~# snap tasks 62
Status Spawn Ready Summary
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Ensure prerequisites for "snapd" are available
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Download snap "snapd" (13640) from channel "latest/stable"
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Fetch and check assertions for snap "snapd" (13640)
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Mount snap "snapd" (13640)
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Run pre-refresh hook of "snapd" snap if present
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Stop snap "snapd" services
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Remove aliases for snap "snapd"
Done yesterday at 15:10 CEST yesterday at 15:10 CEST Make current revision for snap "snapd" unavailable
Doing yesterday at 15:10 CEST - Copy snap "snapd" data
Do yesterday at 15:10 CEST - Setup snap "snapd" (13640) security profiles
Do yesterday at 15:10 CEST - Make snap "snapd" (13640) available to the system
Do yesterday at 15:10 CEST - Automatically connect eligible plugs and slots of snap "snapd"
Do yesterday at 15:10 CEST - Set automatic aliases for snap "snapd"
Do yesterday at 15:10 CEST - Setup snap "snapd" aliases
Do yesterday at 15:10 CEST - Run post-refresh hook of "snapd" snap if present
Do yesterday at 15:10 CEST - Start snap "snapd" (13640) services
Do yesterday at 15:10 CEST - Remove data for snap "snapd" (13170)
Do yesterday at 15:10 CEST - Remove snap "snapd" (13170) from the system
Do yesterday at 15:10 CEST - Clean up "snapd" (13640) install
Do yesterday at 15:10 CEST - Run health check of "snapd" snap
Doing yesterday at 15:10 CEST - Handling re-refresh of "snapd" as needed
So even the snap update itself seems a bit broken. Now I try your cmd.
root@ws-vh-05:~# systemctl restart snapd
It hung for a while and ended with the message
Job for snapd.service canceled.
… On syslog I can see the following
Oct 21 13:55:08 ws-vh-05 systemd[1]: Stopping Snap Daemon...
Oct 21 13:55:08 ws-vh-05 snapd[3692394]: main.go:155: Exiting on terminated signal.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Killing process 3692394 (snapd) with signal SIGKILL.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 569368 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 237710 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Starting Snap Daemon...
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: daemon.go:242: started snapd/2.52 (series 16; classic) ubuntu/20.04 (amd64) linux/5.4.0-81-generic.
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: daemon.go:335: adjusting startup timeout by 50s (pessimistic estimate of 30s plus 5s per snap)
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: helpers.go:236: removed stale connections: lxd:lxd-support core:lxd-support, lxd:network core:network, lxd:network-bind core:network-bind, lxd:system-observe core:system-observe
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: main.go:155: Exiting on terminated signal.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Killing process 2166343 (snapd) with signal SIGKILL.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:58:09 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
If you have time it would be good to flag this issue over on the snapd forum https://forum.snapcraft.io/ and see if they have any recommendations on how to solve it without a reboot.
I’d be interested to know the best way to avoid that situation and/or how to get out of it.
I will now read your recommendations at Managing the LXD snap and see which strategy would be best for my client.
I have two other clusters which I have already channelled to 4.19/stable but this problem could not have had been avoided with even that.
It is the first time in running various LXD clusters over few years that I have run into this kind of issue.
I will also have to evaluate to break the cluster and manage each of the nodes separately. Will lose a bit of functionality but not complicated to duplicate in another application level layer.