LXD cluster broken since last snap update: ( lxd 4.19 21723 )

Hi,

One my my production clusters is in a broken state since around 3 days.
All containers are available and working but the lxd daemon and commands are not working.

The cluster has 14 nodes and all of them were tracking latest/stable channel. It uses local storage ( zfs ).

I would like to bring it up without stopping the containers ( production mail servers ).

I have a detailed report of all 14 servers with the output of the following commands.

  • snap info lxd | egrep ‘(installed|refresh-date|snap-id|tracking):’
  • snap list --all lxd
  • snap changes lxd
  • snap change num ( if the previous cmd produced a result )
  • sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin “SELECT * FROM nodes;”

The report is available here. http://159.69.12.22/ws-vh-lxd-report.txt

Thanks for your help and kind regards

Atif

Looks like ws-vh-05 is stuck on an old revision 21624:

.ws-vh-05 > snap info lxd | egrep '(installed|refresh-date|snap-id|tracking):'
-------------------------------------------------------------------------------
snap-id:  J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
installed:          4.19                   (21624) 76MB disabled,in-cohort
-------------------------------------------------------------------------------


ws-vh-05 > snap list --all lxd
--------------------------------
Name  Version  Rev    Tracking       Publisher   Notes
lxd   4.18     21497  latest/stable  canonical*  disabled,in-cohort
lxd   4.19     21624  latest/stable  canonical*  disabled,in-cohort
--------------------------------


ws-vh-05 > snap changes lxd
-----------------------------
ID   Status  Spawn                      Ready  Summary
61   Abort   5 days ago, at 07:57 CEST  -      Refresh "lxd" snap

-----------------------------

Can you try this:

Also ws-vh-00 , ws-vh-01, ws-vh-03, and ws-vh-08 don’t appear to be in the cluster cohort, so please switch them to the cohort using:

snap switch lxd --cohort=+ 

followed by snap refresh lxd

Hi tomp,

I try on two of the servers.

on ws-vh-00

I get the following

root@ws-vh-00:~# snap switch lxd --cohort=+
"lxd" switched to the "+" cohort

root@ws-vh-00:~# snap refresh lxd
snap "lxd" has no updates available

on ws-vh-09, I get the following

root@ws-vh-09:~# snap switch lxd --cohort=+
error: snap "lxd" has "refresh-snap" change in progress

Make sure all nodes are in the cohort and enabled, as the other nodes may be waiting for the others to get up to the same revision.

Appears to be related to

how can I bring them to enabled?
Most of them are set as disabled.
Full list here.
http://159.69.12.22/ws-vh-snap-list-all-lxd.txt

No they are not.

Only ws-vh-05 shows as (21624) 76MB disabled,in-cohort in snap info lxd.

There also are others that show (21723) 76MB - that should show (21723) 76MB in-cohort.

I dont know why ws-vh-05 shows as disabled though I’m afraid.

Any ideas @stgraber

@tomp here is the output of snap switch on all hosts.

 ws-vh-00: snap switch lxd --cohort=+
  --------------------------------------------------
.
e"lxd" switched to the "+" cohort



 ws-vh-01: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "auto-refresh" change in progress


 ws-vh-02: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-03: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-04: snap switch lxd --cohort=+
  --------------------------------------------------
.
e"lxd" switched to the "+" cohort



 ws-vh-05: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-06: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-07: snap switch lxd --cohort=+
  --------------------------------------------------
.
No change switch (no-op)                                                                                                                                                  /
e"lxd" switched to the "+" cohort



 ws-vh-08: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "auto-refresh" change in progress


 ws-vh-09: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-11: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-12: snap switch lxd --cohort=+
  --------------------------------------------------
.
No change switch (no-op)                                                                                                                                                  /
e"lxd" switched to the "+" cohort



 ws-vh-13: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-14: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress
.

@tomp any idea how can I tell snap to run the changes again on ws-vh-05?
Would a reboot of the host help? I am afraid that the containers may not come up after the reboot.
A reboot of this single host is not out of the question though, but I will just need some time to move the data from the containers to somewhere else.

thanks.

A reboot would likely fix it, but I am hesitant to suggest it in case it doesn’t and your containers then don’t start.

I’d like to get snap working without a reboot if possible.

Can you try systemctl restart snapd first?

Ok, So current status. I managed to move the most important data from the containers on that host to elsewhere.

snap status

root@ws-vh-05:~# snap list --all
Name    Version   Rev    Tracking       Publisher   Notes
core18  20210611  2074   latest/stable  canonical✓  base,disabled
core18  20210722  2128   latest/stable  canonical✓  base
core20  20210702  1081   latest/stable  canonical✓  base,disabled
core20  20210928  1169   latest/stable  canonical✓  base
lxd     4.18      21497  latest/stable  canonical✓  disabled,in-cohort
lxd     4.19      21624  latest/stable  canonical✓  disabled,in-cohort
snapd   2.51.7    13170  latest/stable  canonical✓  snapd,disabled
snapd   2.52      13270  latest/stable  canonical✓  snapd,disabled


root@ws-vh-05:~# snap changes
ID   Status  Spawn                      Ready  Summary
61   Abort   6 days ago, at 07:57 CEST  -      Refresh "lxd" snap
62   Doing   yesterday at 15:10 CEST    -      Auto-refresh snap "snapd"


root@ws-vh-05:~# snap tasks 62
Status  Spawn                    Ready                    Summary
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Ensure prerequisites for "snapd" are available
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Download snap "snapd" (13640) from channel "latest/stable"
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Fetch and check assertions for snap "snapd" (13640)
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Mount snap "snapd" (13640)
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Run pre-refresh hook of "snapd" snap if present
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Stop snap "snapd" services
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Remove aliases for snap "snapd"
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Make current revision for snap "snapd" unavailable
Doing   yesterday at 15:10 CEST  -                        Copy snap "snapd" data
Do      yesterday at 15:10 CEST  -                        Setup snap "snapd" (13640) security profiles
Do      yesterday at 15:10 CEST  -                        Make snap "snapd" (13640) available to the system
Do      yesterday at 15:10 CEST  -                        Automatically connect eligible plugs and slots of snap "snapd"
Do      yesterday at 15:10 CEST  -                        Set automatic aliases for snap "snapd"
Do      yesterday at 15:10 CEST  -                        Setup snap "snapd" aliases
Do      yesterday at 15:10 CEST  -                        Run post-refresh hook of "snapd" snap if present
Do      yesterday at 15:10 CEST  -                        Start snap "snapd" (13640) services
Do      yesterday at 15:10 CEST  -                        Remove data for snap "snapd" (13170)
Do      yesterday at 15:10 CEST  -                        Remove snap "snapd" (13170) from the system
Do      yesterday at 15:10 CEST  -                        Clean up "snapd" (13640) install
Do      yesterday at 15:10 CEST  -                        Run health check of "snapd" snap
Doing   yesterday at 15:10 CEST  -                        Handling re-refresh of "snapd" as needed

So even the snap update itself seems a bit broken. Now I try your cmd.

root@ws-vh-05:~# systemctl restart snapd

It hung for a while and ended with the message

Job for snapd.service canceled.

… On syslog I can see the following

Oct 21 13:55:08 ws-vh-05 systemd[1]: Stopping Snap Daemon...
Oct 21 13:55:08 ws-vh-05 snapd[3692394]: main.go:155: Exiting on terminated signal.

Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Killing process 3692394 (snapd) with signal SIGKILL.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 569368 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 237710 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Starting Snap Daemon...
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: daemon.go:242: started snapd/2.52 (series 16; classic) ubuntu/20.04 (amd64) linux/5.4.0-81-generic.
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: daemon.go:335: adjusting startup timeout by 50s (pessimistic estimate of 30s plus 5s per snap)
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: helpers.go:236: removed stale connections: lxd:lxd-support core:lxd-support, lxd:network core:network, lxd:network-bind core:network-bind, lxd:system-observe core:system-observe
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: main.go:155: Exiting on terminated signal.


Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Killing process 2166343 (snapd) with signal SIGKILL.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:58:09 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.

Does anything from these messages helps?

So, I finally reboot and then power cycle the server. Lets see what it gives.

After the powercycle, I get the following state

root@ws-vh-05:~# snap list --all lxd
Name  Version  Rev    Tracking       Publisher   Notes
lxd   4.18     21497  latest/stable  canonical✓  disabled,in-cohort
lxd   4.19     21624  latest/stable  canonical✓  in-cohort

So now I run

root@ws-vh-05:~# snap refresh lxd
lxd 4.19 from Canonical✓ refreshed

and finaly

root@ws-vh-05:~# lxc cluster list
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
|   NAME   |             URL             |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-00 | https://192.168.250.16:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-01 | https://192.168.250.17:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-02 | https://192.168.250.18:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-03 | https://192.168.250.19:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-04 | https://192.168.250.20:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-05 | https://192.168.250.27:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-06 | https://192.168.250.28:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-07 | https://192.168.250.21:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-08 | https://192.168.250.22:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-09 | https://192.168.250.23:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-11 | https://192.168.250.25:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-12 | https://192.168.250.26:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-13 | https://192.168.250.29:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-14 | https://192.168.250.30:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

Thats good to hear.

If you have time it would be good to flag this issue over on the snapd forum https://forum.snapcraft.io/ and see if they have any recommendations on how to solve it without a reboot.

I’d be interested to know the best way to avoid that situation and/or how to get out of it.

Yes, It was scary.

I will now read your recommendations at Managing the LXD snap and see which strategy would be best for my client.

I have two other clusters which I have already channelled to 4.19/stable but this problem could not have had been avoided with even that.

It is the first time in running various LXD clusters over few years that I have run into this kind of issue.
I will also have to evaluate to break the cluster and manage each of the nodes separately. Will lose a bit of functionality but not complicated to duplicate in another application level layer.

kind regards