Snap Update to 4.1 Broke my cluster

hobiga · June 2, 2020, 4:02am

I have a 3 host cluster that was running great until the latest snap update. Now I cannot start lxd on any of my 3 hosts. I’ve trying rebooting all 3 servers but still no luck.

Here is my cluster info.

lxdhome01 192.168.100.92
lxdlab01 192.168.100.91
lxdlab02 192.168.100.93

The database is currently on 2 of the 3 hosts (lxdlab01 and lxdlab02)

lxd cluster list-database
±--------------------+
| ADDRESS |
±--------------------+
| 192.168.100.91:8443 |
±--------------------+
| 192.168.100.93:8443 |
±--------------------+

This might have something to do with the raft node role of my lxdhome (.92) server in the database:
CREATE TABLE raft_nodes (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
address TEXT NOT NULL, role INTEGER NOT NULL DEFAULT 0,
UNIQUE (address)
);
INSERT INTO raft_nodes VALUES(1,‘192.168.100.92:8443’,2);
INSERT INTO raft_nodes VALUES(3,‘192.168.100.91:8443’,0);
INSERT INTO raft_nodes VALUES(4,‘192.168.100.93:8443’,0);

From what I’ve read however I should be able to startup the cluster with 2 database nodes. However I believe that something went wrong with the recent snap update. When the lxd daemon is starting I’m seeing these errors.

On lxdlab02 I see this:

Jun 1 20:54:50 lxdlab02 lxd.daemon[23562]: t=2020-06-01T20:54:50-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=192.168.100.92:8443 attempt=2”
Jun 1 20:54:50 lxdlab02 lxd.daemon[23562]: t=2020-06-01T20:54:50-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=192.168.100.93:8443 attempt=2”

On lxdlab01 I see this:

Jun 1 20:21:39 lxdlab01 lxd.daemon[2822]: t=2020-06-01T20:21:39-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=192.168.100.92:8443 attempt=7”
Jun 1 20:21:39 lxdlab01 lxd.daemon[2822]: t=2020-06-01T20:21:39-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=192.168.100.93:8443 attempt=7”

I’ve checked the snaps on all 3 of my hosts and they are all the same.

lxdhome01:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 3 days ago, at 06:31 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -

lxdlab01:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 4 days ago, at 19:18 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -

lxdlab02:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 4 days ago, at 21:54 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -

So the update order looks to have gone lxdlab01, lxdlab02, lxdhome01. Which should be fine as the database is on lxdlab01 and lxdlab02. In looking at the database dump I do see some differences which could be why the log is saying that some nodes are behind. Here are sqlite dumps from my hosts.

lxdhome01: https://pastebin.com/3AsSZJqg
lxdlab01: https://pastebin.com/D5fJUDsp
lxdlab02: https://pastebin.com/Tn04QLEq

I’m really at a loss as to how to further troubleshoot this issue. Hoping someone can provide some assistance so I can get my cluster back up and running.

Thanks

tomp · June 2, 2020, 8:18am

@freeekanayaka are you able to help with this please?

freeekanayaka · June 2, 2020, 8:20am

@hobiga please can you turn on debug logging (snap set lxd daemon.debug=true), then restart all three LXD daemons (systemctl restart snap.lxd.daemon) and paste here the content of the /var/snap/lxd/common/lxd/logs/lxd.log file on each of the nodes?

geodb27 · June 2, 2020, 12:35pm

Hi over there. No need to start over a new discussion, same issue for me.
I guess that the cluster went fine not until the last update since it was fine last friday, but until the last restart which have happened between friday the 29th of may and today, the 2nd of june.
Anyway, the problem now is that none of the lxd daemons on any of the three machine will fire up and so, the “snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon” will reach a usable end.
The daemon still refuses to start on any of the three machines of my cluster…

freeekanayaka · June 2, 2020, 12:37pm

@geodb27 please paste the lxd.log file of all three machines.

geodb27 · June 2, 2020, 12:40pm

The three machines keep displaying :

Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.1
Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.2
Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=x.x.x.3

In a loop. There is nothing more at all…
And, as I said before :
“snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon”
Never gives back the prompt, so they are all locked in an infinite loop.

freeekanayaka · June 2, 2020, 12:41pm

@geodb27 please attach the full lxd.log of all three machines.

freeekanayaka · June 2, 2020, 1:22pm

@geodb27 thanks for sending me the logs by email (as per IRC conversation).

Please can you paste the output of:

sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"

ran on all three nodes?

There shouldn’t be any sensitive information there (you can obfuscate the IPs if that matter).

freeekanayaka · June 2, 2020, 1:27pm

Also, please enable debug logging also on all three nodes (snap set lxd daemon.debug=true), then restart all three nodes and send me again all three logs. Only one of the logs you sent me has debug turned on.

geodb27 · June 2, 2020, 1:31pm

Results of sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
On the first node :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

On the second one :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

And on the third :

1|172.26.194.40:8443|0
2|172.26.194.41:8443|0
3|172.26.194.42:8443|0

You will receive a mail soon with the logs as requested, if I can turn debugging on besides the quoted command will not give back the prompt.

freeekanayaka · June 2, 2020, 2:51pm

Unfortunately the logs still don’t contain debug messages. I suspect systemctl is not killing the process. Please can you try to kill the process by hand and then restart the daemon? Something like pkill -9 lxd or ps aux|grep lxd to get the pid and the kill -9 PID. Make sure the lxd process is no longer there, then restart and send me the logs.

hobiga · June 2, 2020, 2:57pm

@freeekanayaka

Here are the startup debug logs from all three of my servers.
lxdhome01 - https://pastebin.com/Q1Nng27A
lxdlab01 - https://pastebin.com/Ye2HwbV1
lxdlab02 - https://pastebin.com/uZ1gwJ8Q

geodb27 · June 2, 2020, 3:11pm

In order to be able to restart all three lxd processes, I had to come to the following on each machine :
$ snap stop lxd
(wait a while)
$ ps auxf | grep lx
Kill all the relevant processes.
Thenafter, I could issue on each machine (the ps command above did not report any process) :
$ snap set lxd daemon.debug=true && systemctl restart snap.lxd.daemon

On machine 2 and 3, this went “fine” (this set of command gave back the prompt after a while, not on the machine 1).
On each of the three machines, now, here is the output of the above ps command :

root 56857 0.0 0.0 4640 1740 ? Ss 17:00 0:00 /bin/sh /snap/lxd/15223/commands/daemon.start
root 56973 49.0 2.4 1559628 98556 ? Sl 17:00 4:32 _ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug
root 56974 0.0 0.7 1157528 32200 ? Sl 17:00 0:00 _ lxd waitready
root 56975 0.0 0.0 4640 992 ? S 17:00 0:00 _ /bin/sh /snap/lxd/15223/commands/daemon.start
root 56961 0.0 0.0 97808 1644 ? Sl 17:00 0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid

So, I guess that every three of the processes are now in debug mode.

As before, I’ll issue the tar command and bring the three logs by mail.

Again, thanks for your kind help !

freeekanayaka · June 2, 2020, 4:07pm

I’m not entirely sure of that, but probably you have been hit by a bug that should now be fixed in LXD 4.2 and that might affect clusters that had only 2 voters.

You could try to roughly follow:

https://lxd.readthedocs.io/en/latest/clustering/#recover-from-quorum-loss

but with a bit more repairs than what described there.

Specifically:

Backup the /var/snap/lxd/common/lxd/database directory on each of the nodes
Shutdown all three LXD daemons, make sure the process is really gone (e.g. ps aux|grep lxd)
Pick one of the two voters you have (e.g. 192.168.100.91:8443), and then run lxd cluster recover-from-quorum-loss on it.
Start again the LXD daemon on the node you picked then run lxd sql global "DELETE FROM nodes_roles WHERE node_id != (SELECT id FROM nodes WHERE address = '192.168.100.91:8443')", make sure to use the address of the node you picked in step 2 as parameter for that query.
Confirm that the nodes_roles table has just one entry, matching your selected node: lxd sql global "SELECT node_id, role, address FROM nodes_roles JOIN nodes ON node_id=nodes.id"
Remove the /var/snap/lxd/common/lxd/database/global directory on the other two nodes
Restart the other two nodes

Hopefully that should bring the cluster back.

freeekanayaka · June 2, 2020, 4:12pm

From the logs it indeed seems you are in a similar situation as @hobiga. I’d suggest to wait for him to try out the recovery procedure I described above, and if it’s successfull you could try to go through the same steps (you’d only have to replace the address, e.g. use 172.26.194.40:8443 instead of 192.168.100.91:8443).

hobiga · June 2, 2020, 5:32pm

@freeekanayaka

Thank you those instructions did indeed bring back my cluster. My final concern is that I don’t see the database being propagated to the other cluster hosts.

I ran the commands on the .91 host (lxdlab01) and in those logs I’m seeing these messages:

Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=dbug msg=“Starting heartbeat round”
Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:3 Address:192.168.100.91:8443 Role:voter} {ID:1 Address:192.168.100.93:8443 Role:stand-by}]”
Jun 2 11:13:59 lxdlab01 lxd.daemon[28423]: t=2020-06-02T11:13:59-0600 lvl=info msg=“Upgrading 1 nodes not part of raft configuration”

and on my .93 host (lxdlab02) I see
Jun 2 11:15:05 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:05-0600 lvl=dbug msg=“Rebalance cluster”
Jun 2 11:15:11 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:11-0600 lvl=dbug msg=“Found cert” name=0
Jun 2 11:15:11 lxdlab02 lxd.daemon[28349]: t=2020-06-02T11:15:11-0600 lvl=dbug msg=“Replace current raft nodes with [{ID:3 Address:192.168.100.91:8443 Role:voter} {ID:1 Address:192.168.100.93:8443 Role:stand-by}]”

And it’s just constantly looping those messages. Not sure if those are expected and the database will eventually propagate. Just trying to ensure that I get everything back a normal configuration at this point.

freeekanayaka · June 2, 2020, 5:47pm

This doesn’t look normal. I’ll give it a closer look tomorrow.

geodb27 · June 3, 2020, 7:13am

Well, thank you so much for your kind help @freeekanayaka.
I followed the steps you gave cautiously, and now, my cluster is just fine.
There are just two thinks I’d like to comment on :
1.

This command did not return any result at any point. Neither before nor after having entered the commands from the 3 earlier points. However, now :
lxc cluster list shows things back online :

±----------±---------------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-1 | https://172.26.194.40:8443 | NO | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-2 | https://172.26.194.41:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-3 | https://172.26.194.42:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+

All my three lxd daemons are in debug mode, and on all three, here is the logs I get :

t=2020-06-03T09:10:55+0200 lvl=dbug msg=“Starting heartbeat round”
t=2020-06-03T09:10:55+0200 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:1 Address:172.26.194.40:8443 Role:voter} {ID:2 Address:172.26.194.41:8443 Role:voter} {ID:3 Address:172.26.194.42:8443 Role:voter}]”
t=2020-06-03T09:10:57+0200 lvl=dbug msg=“Rebalance cluster”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.41:8443”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.41:8443”
t=2020-06-03T09:10:58+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.41:8443”
t=2020-06-03T09:10:59+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.42:8443”
t=2020-06-03T09:10:59+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.42:8443”
t=2020-06-03T09:11:00+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.42:8443”
t=2020-06-03T09:11:00+0200 lvl=dbug msg=“Completed heartbeat round”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Starting heartbeat round”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Heartbeat updating local raft nodes to [{ID:1 Address:172.26.194.40:8443 Role:voter} {ID:2 Address:172.26.194.41:8443 Role:voter} {ID:3 Address:172.26.194.42:8443 Role:voter}]”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Rebalance cluster”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.42:8443”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.42:8443”
t=2020-06-03T09:11:05+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.42:8443”
t=2020-06-03T09:11:07+0200 lvl=dbug msg=“Sending heartbeat to 172.26.194.41:8443”
t=2020-06-03T09:11:07+0200 lvl=dbug msg=“Sending heartbeat request to 172.26.194.41:8443”
t=2020-06-03T09:11:08+0200 lvl=dbug msg=“Successful heartbeat for 172.26.194.41:8443”
t=2020-06-03T09:11:08+0200 lvl=dbug msg=“Completed heartbeat round”

So, again, thanks a lot for your kind help.

Best regards.

freeekanayaka · June 3, 2020, 8:49am

geodb27:

freeekanayaka:

Confirm that the nodes_roles table has just one entry, matching your selected node: lxd sql global "SELECT node_id, role, address FROM nodes_roles JOIN nodes ON node_id=nodes.id"

This command did not return any result at any point. Neither before nor after having entered the commands from the 3 earlier points. However, now :
lxc cluster list shows things back online :

±----------±---------------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-1 | https://172.26.194.40:8443 | NO | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-2 | https://172.26.194.41:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+
| vm-lxd1-3 | https://172.26.194.42:8443 | YES | ONLINE | fully operational | x86_64 |
±----------±---------------------------±---------±-------±------------------±-------------+

This is quite weird. I don’t see a way where you can get “YES” in the “DATABASE” column above and not have rows in the nodes_roles table.

Please can you try again to run:

lxd sql global "SELECT * FROM nodes; SELECT * FROM nodes_roles"

I would expect 2 rows to be returned, one for 172.26.194.41 and one for 172.26.194.42.

Assuming that is the case, you are missing a row for 172.26.194.40. You can add it manually with:

lxd sql global "INSERT INTO nodes_roles(node_id, role) VALUES((SELECT id FROM nodes WHERE address = '172.26.194.40:8443'), 0)"

and after that you should be back to normal.

geodb27 · June 3, 2020, 8:54am

I’m so sorry…
The command you gave returns the exact expected output AFTER all three daemons are ran !
Here is what I get :

lxd sql global “SELECT * FROM nodes; SELECT * FROM nodes_roles”
=> Query 0:

±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| 1 | vm-lxd1-1 | | 172.26.194.40:8443 | 28 | 189 | 2020-06-03T10:50:22.47597775+02:00 | 0 | 2 |
| 2 | vm-lxd1-2 | | 172.26.194.41:8443 | 28 | 189 | 2020-06-03T10:50:22.476973064+02:00 | 0 | 2 |
| 3 | vm-lxd1-3 | | 172.26.194.42:8443 | 28 | 189 | 2020-06-03T10:50:22.477721488+02:00 | 0 | 2 |
±—±----------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+

=> Query 1:

±--------±-----+
| node_id | role |
±--------±-----+
| 3 | 0 |
| 2 | 0 |
±--------±-----+

So, I’ll issue the command you gave and should be fine once for all !

Thank you so much !