I have a 3 host cluster that was running great until the latest snap update. Now I cannot start lxd on any of my 3 hosts. I’ve trying rebooting all 3 servers but still no luck.
Here is my cluster info.
lxdhome01 192.168.100.92
lxdlab01 192.168.100.91
lxdlab02 192.168.100.93
The database is currently on 2 of the 3 hosts (lxdlab01 and lxdlab02)
lxd cluster list-database
±--------------------+
| ADDRESS |
±--------------------+
| 192.168.100.91:8443 |
±--------------------+
| 192.168.100.93:8443 |
±--------------------+
This might have something to do with the raft node role of my lxdhome (.92) server in the database:
CREATE TABLE raft_nodes (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
address TEXT NOT NULL, role INTEGER NOT NULL DEFAULT 0,
UNIQUE (address)
);
INSERT INTO raft_nodes VALUES(1,‘192.168.100.92:8443’,2);
INSERT INTO raft_nodes VALUES(3,‘192.168.100.91:8443’,0);
INSERT INTO raft_nodes VALUES(4,‘192.168.100.93:8443’,0);
From what I’ve read however I should be able to startup the cluster with 2 database nodes. However I believe that something went wrong with the recent snap update. When the lxd daemon is starting I’m seeing these errors.
On lxdlab02 I see this:
Jun 1 20:54:50 lxdlab02 lxd.daemon[23562]: t=2020-06-01T20:54:50-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=192.168.100.92:8443 attempt=2”
Jun 1 20:54:50 lxdlab02 lxd.daemon[23562]: t=2020-06-01T20:54:50-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=192.168.100.93:8443 attempt=2”
On lxdlab01 I see this:
Jun 1 20:21:39 lxdlab01 lxd.daemon[2822]: t=2020-06-01T20:21:39-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=192.168.100.92:8443 attempt=7”
Jun 1 20:21:39 lxdlab01 lxd.daemon[2822]: t=2020-06-01T20:21:39-0600 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=192.168.100.93:8443 attempt=7”
I’ve checked the snaps on all 3 of my hosts and they are all the same.
lxdhome01:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 3 days ago, at 06:31 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -
lxdlab01:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 4 days ago, at 19:18 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -
lxdlab02:
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 4 days ago, at 21:54 MDT
channels:
latest/stable: 4.1 2020-05-29 (15223) 72MB -
So the update order looks to have gone lxdlab01, lxdlab02, lxdhome01. Which should be fine as the database is on lxdlab01 and lxdlab02. In looking at the database dump I do see some differences which could be why the log is saying that some nodes are behind. Here are sqlite dumps from my hosts.
lxdhome01: https://pastebin.com/3AsSZJqg
lxdlab01: https://pastebin.com/D5fJUDsp
lxdlab02: https://pastebin.com/Tn04QLEq
I’m really at a loss as to how to further troubleshoot this issue. Hoping someone can provide some assistance so I can get my cluster back up and running.
Thanks