Complete cluster failure after snap auto refresh

nateybobo · June 1, 2022, 8:23pm

I have a 5 node cluster that are all now showing:

time="2022-06-01T20:10:37Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"

I’ve rebooted the nodes, had to do some snap abort <id> and on snap refresh lxd it hangs with that error.

I’m not sure what to do at this moment besides complete rebuild.

Any thoughts?

cemzafer · June 1, 2022, 8:43pm

Hi @nateybobo,
I’m not so sure but, you have to check all the versions of the lxd and if I remember correctly cluster node versions of lxd must match.

sudo snap refresh lxd --channel=latest/stable

Regards.

nateybobo · June 1, 2022, 8:50pm

That was the original problem. All my nodes “auto” refreshed at once. During that time, my primary and secondary DNS servers were knocked offline, and then the others failed.

So, a chain of events caused a complete failure of all nodes to update. Now, even if I hardcode 8.8.8.8 DNS, and then run the refresh, I still can’t get snap to start lxc.

It just hangs with:
level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"

cemzafer · June 1, 2022, 8:59pm

Maybe you can check the status of the service systemctl status snap.lxd.daemon and restart may solve the issue.
Regards.

cemzafer · June 1, 2022, 9:00pm

Same issue I suppose.
https://discuss.linuxcontainers.org/t/snap-lxd-auto-refresh-stuck-on-copy-snap-lxd-data/14238/2

nateybobo · June 1, 2022, 9:07pm

Unfortunately for me, a reboot hasn’t solved my issues

stgraber · June 1, 2022, 9:11pm

Can you show the snap list lxd on all systems as well as lxd sql global "SELECT * FROM nodes" ?

nateybobo · June 1, 2022, 10:57pm

All nodes are at:
installed: 5.2-79c3c3b (23155) 106MB -

All nodes reflect:

root@nuc-server-1:~# lxd sql global "SELECT * FROM nodes"
+----+--------------+-------------+--------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| id |     name     | description |      address       | schema | api_extensions |           heartbeat            | state | arch | failure_domain_id |
+----+--------------+-------------+--------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| 30 | nuc-server-4 |             | 192.168.98.20:8443 | 60     | 313            | 2022-06-01T22:55:35.381531796Z | 0     | 2    | 4                 |
| 32 | dellt30      |             | 192.168.98.21:8443 | 60     | 313            | 2022-06-01T22:55:29.238058616Z | 0     | 2    | 2                 |
| 36 | nuc-server-2 |             | 192.168.98.18:8443 | 60     | 312            | 2022-05-30T18:59:20.071988675Z | 0     | 2    | <nil>             |
| 37 | nuc-server-3 |             | 192.168.98.19:8443 | 60     | 313            | 2022-06-01T22:55:34.815804603Z | 0     | 2    | <nil>             |
| 38 | nuc-server-1 |             | 192.168.98.17:8443 | 60     | 313            | 2022-06-01T22:55:35.279264703Z | 0     | 2    | <nil>             |
| 39 | p700         |             | 192.168.98.16:8443 | 60     | 313            | 2022-06-01T22:55:36.036947164Z | 0     | 2    | <nil>             |
+----+--------------+-------------+--------------------+--------+----------------+--------------------------------+-------+------+-------------------+

nateybobo · June 1, 2022, 11:13pm

It looks like refresh of the nodes is actually stuck mid refresh…hmmm

root@nuc-server-3:~# snap changes
ID   Status  Spawn               Ready               Summary
82   Done    today at 16:20 UTC  today at 16:20 UTC  Running service command
83   Done    today at 16:21 UTC  today at 16:21 UTC  Running service command
84   Done    today at 16:36 UTC  today at 16:36 UTC  Change configuration of "core" snap
85   Done    today at 16:36 UTC  today at 16:36 UTC  Change configuration of "core" snap
86   Undone  today at 16:36 UTC  today at 17:36 UTC  Refresh "lxd" snap
87   Done    today at 17:40 UTC  today at 17:40 UTC  Running service command
88   Done    today at 17:41 UTC  today at 17:41 UTC  Running service command
89   Done    today at 17:42 UTC  today at 17:42 UTC  Change configuration of "core" snap
90   Done    today at 17:42 UTC  today at 17:42 UTC  Change configuration of "core" snap
91   Undone  today at 17:44 UTC  today at 19:16 UTC  Refresh "lxd" snap
92   Done    today at 17:57 UTC  today at 17:57 UTC  Change configuration of "core" snap
93   Done    today at 17:57 UTC  today at 17:57 UTC  Change configuration of "core" snap
94   Done    today at 18:42 UTC  today at 18:45 UTC  Auto-refresh snaps "core20", "snapd"
95   Done    today at 19:22 UTC  today at 20:22 UTC  Refresh "lxd" snap
96   Done    today at 20:43 UTC  today at 20:52 UTC  Revert "lxd" snap
97   Done    today at 20:56 UTC  today at 20:56 UTC  Running service command
98   Done    today at 20:56 UTC  today at 20:56 UTC  Running service command
99   Done    today at 20:57 UTC  today at 20:57 UTC  Running service command
100  Done    today at 21:16 UTC  today at 21:16 UTC  Running service command
101  Done    today at 22:11 UTC  today at 22:11 UTC  Running service command
102  Doing   today at 22:19 UTC  -                   Refresh "lxd" snap

tomp · June 1, 2022, 11:14pm

The refresh won’t complete until it’s completed on all members.

tomp · June 1, 2022, 11:15pm

Can you also confirm you don’t have a lxd.debug build binary installed on any members?

nateybobo · June 1, 2022, 11:22pm

It seems that I’m stuck in a chicken-egg situation.

The refresh won’t finish until other cluster nodes upgrade, but I can’t upgrade the other cluster nodes until I can get over this error.

level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"

nateybobo · June 1, 2022, 11:37pm

Would lxd.debug be manually installed? I don’t think I have it, and I never purposely installed it.

tomp · June 1, 2022, 11:40pm

Are you running refresh on them all?

nateybobo · June 1, 2022, 11:42pm

Since half of my nodes auto refreshed (and broke), I manually refreshed the ones that missed it. So, yes?

stgraber · June 2, 2022, 2:26am

It’s normal for the refresh to hang indeed.

In your case, this system is behind:

| 36 | nuc-server-2 |             | 192.168.98.18:8443 | 60     | 312            | 2022-05-30T18:59:20.071988675Z | 0     | 2    | <nil>             |

What’s going on with that system? It looks like it hasn’t been able to reach the others in the past day or so.

nateybobo · June 2, 2022, 3:03am

The update killed my DNS server containers, which reside on my cluster, and when the refresh happened it couldn’t do any DNS lookups…

I’m tempted to hack at the sql database just to get things running again…

At this point, I can’t place that node in maintenance mode.

stgraber · June 2, 2022, 3:14am

What’s the lack of DNS breaking on that machine? It can’t download the updated snap?

stgraber · June 2, 2022, 3:15am

I’m also wondering why the container died in the first place. Updates don’t take down instances.

nateybobo · June 2, 2022, 4:35am

Perhaps this was my issue:

-- Subject: A stop job for unit snap.lxd.daemon.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit snap.lxd.daemon.service has finished.
--
-- The job identifier is 2831 and the job result is done.
Jun 02 04:29:57 nuc-server-2 systemd[1]: Started Service for snap application lxd.daemon.
-- Subject: A start job for unit snap.lxd.daemon.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit snap.lxd.daemon.service has finished successfully.
--
-- The job identifier is 2831.
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: => Preparing the system (23155)
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Loading snap configuration
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Setting up mntns symlink (mnt:[4026532328])
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Setting up kmod wrapper
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Preparing /boot
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Preparing a clean copy of /run
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Preparing /run/bin
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Preparing a clean copy of /etc
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Preparing a clean copy of /usr/share/misc
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Setting up ceph configuration
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Setting up LVM configuration
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Setting up OVN configuration
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4478]: ==> Rotating logs
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4598]: error: Compressing program wrote following message to stderr when compressing log /var/snap/lxd/common/lxd/logs/lxd.log.1:
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4598]: gzip: stdin: warning: file timestamp out of range for gzip format
Jun 02 04:29:57 nuc-server-2 lxd.daemon[4598]: error: failed to compress log /var/snap/lxd/common/lxd/logs/lxd.log.1
Jun 02 04:29:57 nuc-server-2 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE

I moved the logs out and it started up finally:
mv /var/snap/lxd/common/lxd/logs/lxd.log* logs/

Weird