LXD cluster broken since last snap update: ( lxd 4.19 21723 )

Atif_Ghaffar · October 20, 2021, 9:15am

Hi,

One my my production clusters is in a broken state since around 3 days.
All containers are available and working but the lxd daemon and commands are not working.

The cluster has 14 nodes and all of them were tracking latest/stable channel. It uses local storage ( zfs ).

I would like to bring it up without stopping the containers ( production mail servers ).

I have a detailed report of all 14 servers with the output of the following commands.

snap info lxd | egrep ‘(installed|refresh-date|snap-id|tracking):’
snap list --all lxd
snap changes lxd
snap change num ( if the previous cmd produced a result )
sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin “SELECT * FROM nodes;”

The report is available here. http://159.69.12.22/ws-vh-lxd-report.txt

Thanks for your help and kind regards

–

Atif

tomp · October 20, 2021, 9:21am

Looks like ws-vh-05 is stuck on an old revision 21624:

.ws-vh-05 > snap info lxd | egrep '(installed|refresh-date|snap-id|tracking):'
-------------------------------------------------------------------------------
snap-id:  J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
installed:          4.19                   (21624) 76MB disabled,in-cohort
-------------------------------------------------------------------------------


ws-vh-05 > snap list --all lxd
--------------------------------
Name  Version  Rev    Tracking       Publisher   Notes
lxd   4.18     21497  latest/stable  canonical*  disabled,in-cohort
lxd   4.19     21624  latest/stable  canonical*  disabled,in-cohort
--------------------------------


ws-vh-05 > snap changes lxd
-----------------------------
ID   Status  Spawn                      Ready  Summary
61   Abort   5 days ago, at 07:57 CEST  -      Refresh "lxd" snap

-----------------------------

tomp · October 20, 2021, 9:22am

Can you try this:

tomp · October 20, 2021, 9:25am

Also ws-vh-00 , ws-vh-01, ws-vh-03, and ws-vh-08 don’t appear to be in the cluster cohort, so please switch them to the cohort using:

snap switch lxd --cohort=+

followed by snap refresh lxd

Atif_Ghaffar · October 20, 2021, 9:25am

Hi tomp,

I try on two of the servers.

on ws-vh-00

I get the following

root@ws-vh-00:~# snap switch lxd --cohort=+
"lxd" switched to the "+" cohort

root@ws-vh-00:~# snap refresh lxd
snap "lxd" has no updates available

on ws-vh-09, I get the following

root@ws-vh-09:~# snap switch lxd --cohort=+
error: snap "lxd" has "refresh-snap" change in progress

tomp · October 20, 2021, 9:27am

Make sure all nodes are in the cohort and enabled, as the other nodes may be waiting for the others to get up to the same revision.

tomp · October 20, 2021, 9:30am

Appears to be related to

Atif_Ghaffar · October 20, 2021, 9:37am

how can I bring them to enabled?
Most of them are set as disabled.
Full list here.
http://159.69.12.22/ws-vh-snap-list-all-lxd.txt

tomp · October 20, 2021, 9:40am

No they are not.

Only ws-vh-05 shows as (21624) 76MB disabled,in-cohort in snap info lxd.

There also are others that show (21723) 76MB - that should show (21723) 76MB in-cohort.

tomp · October 20, 2021, 9:40am

I dont know why ws-vh-05 shows as disabled though I’m afraid.

Any ideas @stgraber

Atif_Ghaffar · October 20, 2021, 9:45am

@tomp here is the output of snap switch on all hosts.

 ws-vh-00: snap switch lxd --cohort=+
  --------------------------------------------------
.
e"lxd" switched to the "+" cohort



 ws-vh-01: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "auto-refresh" change in progress


 ws-vh-02: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-03: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-04: snap switch lxd --cohort=+
  --------------------------------------------------
.
e"lxd" switched to the "+" cohort



 ws-vh-05: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-06: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-07: snap switch lxd --cohort=+
  --------------------------------------------------
.
No change switch (no-op)                                                                                                                                                  /
e"lxd" switched to the "+" cohort



 ws-vh-08: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "auto-refresh" change in progress


 ws-vh-09: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-11: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-12: snap switch lxd --cohort=+
  --------------------------------------------------
.
No change switch (no-op)                                                                                                                                                  /
e"lxd" switched to the "+" cohort



 ws-vh-13: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress


 ws-vh-14: snap switch lxd --cohort=+
  --------------------------------------------------
.error: snap "lxd" has "refresh-snap" change in progress
.

Atif_Ghaffar · October 20, 2021, 9:28pm

@tomp any idea how can I tell snap to run the changes again on ws-vh-05?
Would a reboot of the host help? I am afraid that the containers may not come up after the reboot.
A reboot of this single host is not out of the question though, but I will just need some time to move the data from the containers to somewhere else.

thanks.

tomp · October 21, 2021, 10:06am

A reboot would likely fix it, but I am hesitant to suggest it in case it doesn’t and your containers then don’t start.

I’d like to get snap working without a reboot if possible.

Can you try systemctl restart snapd first?

Atif_Ghaffar · October 21, 2021, 11:59am

Ok, So current status. I managed to move the most important data from the containers on that host to elsewhere.

snap status

root@ws-vh-05:~# snap list --all
Name    Version   Rev    Tracking       Publisher   Notes
core18  20210611  2074   latest/stable  canonical✓  base,disabled
core18  20210722  2128   latest/stable  canonical✓  base
core20  20210702  1081   latest/stable  canonical✓  base,disabled
core20  20210928  1169   latest/stable  canonical✓  base
lxd     4.18      21497  latest/stable  canonical✓  disabled,in-cohort
lxd     4.19      21624  latest/stable  canonical✓  disabled,in-cohort
snapd   2.51.7    13170  latest/stable  canonical✓  snapd,disabled
snapd   2.52      13270  latest/stable  canonical✓  snapd,disabled


root@ws-vh-05:~# snap changes
ID   Status  Spawn                      Ready  Summary
61   Abort   6 days ago, at 07:57 CEST  -      Refresh "lxd" snap
62   Doing   yesterday at 15:10 CEST    -      Auto-refresh snap "snapd"


root@ws-vh-05:~# snap tasks 62
Status  Spawn                    Ready                    Summary
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Ensure prerequisites for "snapd" are available
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Download snap "snapd" (13640) from channel "latest/stable"
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Fetch and check assertions for snap "snapd" (13640)
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Mount snap "snapd" (13640)
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Run pre-refresh hook of "snapd" snap if present
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Stop snap "snapd" services
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Remove aliases for snap "snapd"
Done    yesterday at 15:10 CEST  yesterday at 15:10 CEST  Make current revision for snap "snapd" unavailable
Doing   yesterday at 15:10 CEST  -                        Copy snap "snapd" data
Do      yesterday at 15:10 CEST  -                        Setup snap "snapd" (13640) security profiles
Do      yesterday at 15:10 CEST  -                        Make snap "snapd" (13640) available to the system
Do      yesterday at 15:10 CEST  -                        Automatically connect eligible plugs and slots of snap "snapd"
Do      yesterday at 15:10 CEST  -                        Set automatic aliases for snap "snapd"
Do      yesterday at 15:10 CEST  -                        Setup snap "snapd" aliases
Do      yesterday at 15:10 CEST  -                        Run post-refresh hook of "snapd" snap if present
Do      yesterday at 15:10 CEST  -                        Start snap "snapd" (13640) services
Do      yesterday at 15:10 CEST  -                        Remove data for snap "snapd" (13170)
Do      yesterday at 15:10 CEST  -                        Remove snap "snapd" (13170) from the system
Do      yesterday at 15:10 CEST  -                        Clean up "snapd" (13640) install
Do      yesterday at 15:10 CEST  -                        Run health check of "snapd" snap
Doing   yesterday at 15:10 CEST  -                        Handling re-refresh of "snapd" as needed

So even the snap update itself seems a bit broken. Now I try your cmd.

root@ws-vh-05:~# systemctl restart snapd

It hung for a while and ended with the message

Job for snapd.service canceled.

… On syslog I can see the following

Oct 21 13:55:08 ws-vh-05 systemd[1]: Stopping Snap Daemon...
Oct 21 13:55:08 ws-vh-05 snapd[3692394]: main.go:155: Exiting on terminated signal.

Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Killing process 3692394 (snapd) with signal SIGKILL.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 569368 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: snapd.service: Found left-over process 237710 (sync) in control group while starting unit. Ignoring.
Oct 21 13:56:38 ws-vh-05 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 21 13:56:38 ws-vh-05 systemd[1]: Starting Snap Daemon...
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: AppArmor status: apparmor is enabled and all features are available
Oct 21 13:56:38 ws-vh-05 snapd[2166343]: daemon.go:242: started snapd/2.52 (series 16; classic) ubuntu/20.04 (amd64) linux/5.4.0-81-generic.
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: daemon.go:335: adjusting startup timeout by 50s (pessimistic estimate of 30s plus 5s per snap)
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: helpers.go:236: removed stale connections: lxd:lxd-support core:lxd-support, lxd:network core:network, lxd:network-bind core:network-bind, lxd:system-observe core:system-observe
Oct 21 13:56:39 ws-vh-05 snapd[2166343]: main.go:155: Exiting on terminated signal.


Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Killing process 2166343 (snapd) with signal SIGKILL.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Failed with result 'timeout'.
Oct 21 13:58:09 ws-vh-05 systemd[1]: Stopped Snap Daemon.
Oct 21 13:58:09 ws-vh-05 systemd[1]: snapd.service: Triggering OnFailure= dependencies.

Does anything from these messages helps?

Atif_Ghaffar · October 21, 2021, 12:22pm

So, I finally reboot and then power cycle the server. Lets see what it gives.

Atif_Ghaffar · October 21, 2021, 12:36pm

After the powercycle, I get the following state

root@ws-vh-05:~# snap list --all lxd
Name  Version  Rev    Tracking       Publisher   Notes
lxd   4.18     21497  latest/stable  canonical✓  disabled,in-cohort
lxd   4.19     21624  latest/stable  canonical✓  in-cohort

So now I run

root@ws-vh-05:~# snap refresh lxd
lxd 4.19 from Canonical✓ refreshed

and finaly

root@ws-vh-05:~# lxc cluster list
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
|   NAME   |             URL             |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-00 | https://192.168.250.16:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-01 | https://192.168.250.17:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-02 | https://192.168.250.18:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-03 | https://192.168.250.19:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-04 | https://192.168.250.20:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-05 | https://192.168.250.27:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-06 | https://192.168.250.28:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-07 | https://192.168.250.21:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-08 | https://192.168.250.22:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-09 | https://192.168.250.23:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-11 | https://192.168.250.25:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-12 | https://192.168.250.26:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-13 | https://192.168.250.29:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| ws-vh-14 | https://192.168.250.30:8443 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+----------+-----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

tomp · October 21, 2021, 2:46pm

Thats good to hear.

If you have time it would be good to flag this issue over on the snapd forum https://forum.snapcraft.io/ and see if they have any recommendations on how to solve it without a reboot.

I’d be interested to know the best way to avoid that situation and/or how to get out of it.

Atif_Ghaffar · October 21, 2021, 3:38pm

Yes, It was scary.

I will now read your recommendations at Managing the LXD snap and see which strategy would be best for my client.

I have two other clusters which I have already channelled to 4.19/stable but this problem could not have had been avoided with even that.

It is the first time in running various LXD clusters over few years that I have run into this kind of issue.
I will also have to evaluate to break the cluster and manage each of the nodes separately. Will lose a bit of functionality but not complicated to duplicate in another application level layer.

kind regards