Snap Update to 4.1 Broke my cluster

geodb27 · June 2, 2020, 12:35pm

Hi over there. No need to start over a new discussion, same issue for me.
I guess that the cluster went fine not until the last update since it was fine last friday, but until the last restart which have happened between friday the 29th of may and today, the 2nd of june.
Anyway, the problem now is that none of the lxd daemons on any of the three machine will fire up and so, the “snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon” will reach a usable end.
The daemon still refuses to start on any of the three machines of my cluster…

freeekanayaka · June 2, 2020, 12:37pm

@geodb27 please paste the lxd.log file of all three machines.

geodb27 · June 2, 2020, 12:40pm

The three machines keep displaying :

Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.1
Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.2
Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.3

In a loop. There is nothing more at all…
And, as I said before :
“snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon”
Never gives back the prompt, so they are all locked in an infinite loop.

freeekanayaka · June 2, 2020, 12:41pm

@geodb27 please attach the full lxd.log of all three machines.

freeekanayaka · June 2, 2020, 1:22pm

@geodb27 thanks for sending me the logs by email (as per IRC conversation).

Please can you paste the output of:

sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"

ran on all three nodes?

There shouldn’t be any sensitive information there (you can obfuscate the IPs if that matter).

freeekanayaka · June 2, 2020, 1:27pm

Also, please enable debug logging also on all three nodes (snap set lxd daemon.debug=true), then restart all three nodes and send me again all three logs. Only one of the logs you sent me has debug turned on.

geodb27 · June 2, 2020, 1:31pm

Results of sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
On the first node :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

On the second one :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

And on the third :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

You will receive a mail soon with the logs as requested, if I can turn debugging on besides the quoted command will not give back the prompt.

freeekanayaka · June 2, 2020, 2:51pm

Unfortunately the logs still don’t contain debug messages. I suspect systemctl is not killing the process. Please can you try to kill the process by hand and then restart the daemon? Something like pkill -9 lxd or ps aux|grep lxd to get the pid and the kill -9 PID. Make sure the lxd process is no longer there, then restart and send me the logs.

hobiga · June 2, 2020, 2:57pm

@freeekanayaka

Here are the startup debug logs from all three of my servers.
lxdhome01 - https://pastebin.com/Q1Nng27A
lxdlab01 - https://pastebin.com/Ye2HwbV1
lxdlab02 - https://pastebin.com/uZ1gwJ8Q

geodb27 · June 2, 2020, 3:11pm

In order to be able to restart all three lxd processes, I had to come to the following on each machine :
$ snap stop lxd
(wait a while)
$ ps auxf | grep lx
Kill all the relevant processes.
Thenafter, I could issue on each machine (the ps command above did not report any process) :
$ snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon

On machine 2 and 3, this went “fine” (this set of command gave back the prompt after a while, not on the machine 1).
On each of the three machines, now, here is the output of the above ps command :

root 56857 0.0 0.0 4640 1740 ? Ss 17:00 0:00 /bin/sh /snap/lxd/15223/commands/daemon.start
root 56973 49.0 2.4 1559628 98556 ? Sl 17:00 4:32 _ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug
root 56974 0.0 0.7 1157528 32200 ? Sl 17:00 0:00 _ lxd waitready
root 56975 0.0 0.0 4640 992 ? S 17:00 0:00 _ /bin/sh /snap/lxd/15223/commands/daemon.start
root 56961 0.0 0.0 97808 1644 ? Sl 17:00 0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid

So, I guess that every three of the processes are now in debug mode.

As before, I’ll issue the tar command and bring the three logs by mail.

Again, thanks for your kind help !

freeekanayaka · June 2, 2020, 4:07pm

I’m not entirely sure of that, but probably you have been hit by a bug that should now be fixed in LXD 4.2 and that might affect clusters that had only 2 voters.

You could try to roughly follow:

https://lxd.readthedocs.io/en/latest/clustering/#recover-from-quorum-loss

but with a bit more repairs than what described there.

Specifically:

Backup the /var/snap/lxd/common/lxd/database directory on each of the nodes
Shutdown all three LXD daemons, make sure the process is really gone (e.g. ps aux|grep lxd)
Pick one of the two voters you have (e.g. 192.168.100.91:8443), and then run lxd cluster recover-from-quorum-loss on it.
Start again the LXD daemon on the node you picked then run lxd sql global "DELETE FROM nodes_roles WHERE node_id != (SELECT id FROM nodes WHERE address = '192.168.100.91:8443')", make sure to use the address of the node you picked in step 2 as parameter for that query.
Confirm that the nodes_roles table has just one entry, matching your selected node: lxd sql global "SELECT node_id, role, address FROM nodes_roles JOIN nodes ON node_id=nodes.id"
Remove the /var/snap/lxd/common/lxd/database/global directory on the other two nodes
Restart the other two nodes

Hopefully that should bring the cluster back.

freeekanayaka · June 2, 2020, 4:12pm

From the logs it indeed seems you are in a similar situation as @hobiga. I’d suggest to wait for him to try out the recovery procedure I described above, and if it’s successfull you could try to go through the same steps (you’d only have to replace the address, e.g. use 172.26.194.40:8443 instead of 192.168.100.91:8443).

hobiga · June 2, 2020, 5:32pm

@freeekanayaka

Thank you those instructions did indeed bring back my cluster. My final concern is that I don’t see the database being propagated to the other cluster hosts.

I ran the commands on the .91 host (lxdlab01) and in those logs I’m seeing these messages:

Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=dbug msg=“Starting heartbeat round”
Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:3 Address:192.168.100.91:8443 Role:voter} {ID:1 Address:192.168.100.93:8443 Role:stand-by}]”
Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=info msg=“Upgrading 1 nodes not part of raft configuration”

and on my .93 host (lxdlab02) I see
Jun 2 11:15:05 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:05-0600 lvl=dbug msg=“Rebalance cluster”
Jun 2 11:15:11 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:11-0600 lvl=dbug msg=“Found cert” name=0
Jun 2 11:15:11 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:11-0600 lvl=dbug msg=“Replace current raft nodes with [{ID:3 Address:192.168.100.91:8443 Role:voter} {ID:1 Address:192.168.100.93:8443 Role:stand-by}]”

And it’s just constantly looping those messages. Not sure if those are expected and the database will eventually propagate. Just trying to ensure that I get everything back a normal configuration at this point.

freeekanayaka · June 2, 2020, 5:47pm

This doesn’t look normal. I’ll give it a closer look tomorrow.

geodb27 · June 3, 2020, 7:13am

Well, thank you so much for your kind help @freeekanayaka.
I followed the steps you gave cautiously, and now, my cluster is just fine.
There are just two thinks I’d like to comment on :
1.

This command did not return any result at any point. Neither before nor after having entered the commands from the 3 earlier points. However, now :
lxc cluster list shows things back online :

±----------±---------------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-1 | https://172.26.194.40:8443 | NO | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-2 | https://172.26.194.41:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-3 | https://172.26.194.42:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+

All my three lxd daemons are in debug mode, and on all three, here is the logs I get :

t=2020-06-03T09:10:55+0200 lvl=dbug msg=“Starting heartbeat round”
t=2020-06-03T09:10:55+0200 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:1 Address:172.26.194.40:8443 Role:voter} {ID:2 Address:172.26.194.41:8443 Role:voter} {ID:3 Address:172.26.194.42:8443 Role:voter}]”
t=2020-06-03T09:10:57+0200 lvl=dbug msg=“Rebalance cluster”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.41:8443”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.41:8443”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.41:8443”
t=2020-06-03T09:10:59+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.42:8443”
t=2020-06-03T09:10:59+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.42:8443”
t=2020-06-03T09:11:00+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.42:8443”
t=2020-06-03T09:11:00+0200 lvl=dbug msg=“Completed heartbeat round”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Starting heartbeat round”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:1 Address:172.26.194.40:8443 Role:voter} {ID:2 Address:172.26.194.41:8443 Role:voter} {ID:3 Address:172.26.194.42:8443 Role:voter}]”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Rebalance cluster”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.42:8443”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.42:8443”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.42:8443”
t=2020-06-03T09:11:07+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.41:8443”
t=2020-06-03T09:11:07+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.41:8443”
t=2020-06-03T09:11:08+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.41:8443”
t=2020-06-03T09:11:08+0200 lvl=dbug msg=“Completed heartbeat round”

So, again, thanks a lot for your kind help.

Best regards.

freeekanayaka · June 3, 2020, 8:49am

geodb27:

freeekanayaka:

Confirm that the nodes_roles table has just one entry, matching your selected node: lxd sql global "SELECT node_id, role, address FROM nodes_roles JOIN nodes ON node_id=nodes.id"

This command did not return any result at any point. Neither before nor after having entered the commands from the 3 earlier points. However, now :
lxc cluster list shows things back online :

±----------±---------------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-1 | https://172.26.194.40:8443 | NO | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-2 | https://172.26.194.41:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-3 | https://172.26.194.42:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+

This is quite weird. I don’t see a way where you can get “YES” in the “DATABASE” column above and not have rows in the nodes_roles table.

Please can you try again to run:

lxd sql global "SELECT * FROM nodes; SELECT * FROM nodes_roles"

I would expect 2 rows to be returned, one for 172.26.194.41 and one for 172.26.194.42.

Assuming that is the case, you are missing a row for 172.26.194.40. You can add it manually with:

lxd sql global "INSERT INTO nodes_roles(node_id, role) VALUES((SELECT id FROM nodes WHERE address = '172.26.194.40:8443'), 0)"

and after that you should be back to normal.

geodb27 · June 3, 2020, 8:54am

I’m so sorry…
The command you gave returns the exact expected output AFTER all three daemons are ran !
Here is what I get :

lxd sql global “SELECT * FROM nodes; SELECT * FROM nodes_roles”
=> Query 0:

±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| 1 | vm-lxd1-1 | | 172.26.194.40:8443 | 28 | 189 | 2020-06-03T10:50:22.47597775+02:00 | 0 | 2 |
| 2 | vm-lxd1-2 | | 172.26.194.41:8443 | 28 | 189 | 2020-06-03T10:50:22.476973064+02:00 | 0 | 2 |
| 3 | vm-lxd1-3 | | 172.26.194.42:8443 | 28 | 189 | 2020-06-03T10:50:22.477721488+02:00 | 0 | 2 |
±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+

=> Query 1:

±--------±-----+
| node_id | role |
±--------±-----+
| 3 | 0 |
| 2 | 0 |
±--------±-----+

So, I’ll issue the command you gave and should be fine once for all !

Thank you so much !

freeekanayaka · June 3, 2020, 8:57am

Please can you paste (or mail me) the full lxd.log of all three nodes? Make sure debug mode is turned on (snap set lxd daemon.debug=true).

freeekanayaka · June 5, 2020, 8:17pm

To all users that might hit this: the issue here is very likely the one fixed by https://github.com/canonical/raft/pull/142, which will be included in LXD 4.2 and backported to 4.0.

If you experience it during the upgrade, one thing you can attempt before the more heavy-weight recovery procedure described above is this:

Stop all your lxd daemons (e.g. pkill -9 lxd; systemctl stop snap.lxd.daemon; systemctl stop snap.lxd.daemon.unix.socket)
Run sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes WHERE role=0"
Pick one and only one of the nodes returned by 1. and run mv /var/snap/lxd/common/lxd/database/global/metadata1 /var/snap/lxd/common/lxd/database/global/metadata2 /your/backup/dir on that node
Restart each node (e.g. systemctl start snap.lxd.daemon.unix.socket; systemctl start snap.lxd.daemon)

That should bring your cluster back.

If things are still stuck there might be a pending snap refresh that never completed, you can figure its ID out with snap changes lxd and cancel it with snap abort <ID>. Then run again snap refresh lxd.

stgraber · June 5, 2020, 10:09pm

snap refresh lxd --candidate will get you on the candidate channel with LXD 4.2 as we intend to release it on Monday.