I have a cluster with 5 nodes - after autoupdate to 3.21 something is not quite right:
root@gklab-176-219:~# snap set lxd daemon.debug=true
root@gklab-176-219:~# systemctl reload snap.lxd.daemon
root@gklab-176-219:~# tail -f /var/snap/lxd/common/lxd/logs/lxd.log
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
t=2020-02-26T16:21:07+0100 lvl=info msg="LXD 3.21 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-02-26T16:21:07+0100 lvl=info msg="Kernel uid/gid map:"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - u 0 0 4294967295"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - g 0 0 4294967295"
t=2020-02-26T16:21:07+0100 lvl=info msg="Configured LXD uid/gid map:"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-02-26T16:21:07+0100 lvl=info msg="Kernel features:"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - uevent injection: no"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - seccomp listener: no"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - cgroup layout: hybrid"
t=2020-02-26T16:21:07+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - shiftfs support: disabled"
t=2020-02-26T16:21:07+0100 lvl=info msg="Initializing local database"
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Initializing database gateway"
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Start database node" address=172.28.176.219:8443 id=3 role=spare
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Detected stale unix socket, deleting"
t=2020-02-26T16:21:07+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-26T16:21:07+0100 lvl=info msg="REST API daemon:"
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.219:8443
t=2020-02-26T16:21:07+0100 lvl=info msg="Initializing global database"
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=0"
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=0"
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0"
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=1"
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=1"
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=1"
On each node it looks the same (the only difference I noticed is that on db nodes the role of the node is described as “spare”, on the others as “voter”). The IP addresses that report “503 Service Unavailable” in the logs are the nodes without a database.
The logs from the first post are with debug already enabled (I enabled it on all nodes with wishful thinking that maybe one of them tells me something more ), but just in case:
root@gklab-176-016:~# snap set lxd daemon.debug=true
root@gklab-176-016:~# service snap.lxd.daemon reload
root@gklab-176-016:~# tail -f /var/snap/lxd/common/lxd/logs/lxd.log
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
t=2020-02-27T00:27:25+0100 lvl=info msg="LXD 3.21 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-02-27T00:27:25+0100 lvl=info msg="Kernel uid/gid map:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - u 0 0 4294967295"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - g 0 0 4294967295"
t=2020-02-27T00:27:25+0100 lvl=info msg="Configured LXD uid/gid map:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-02-27T00:27:25+0100 lvl=info msg="Kernel features:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - uevent injection: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - seccomp listener: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - cgroup layout: hybrid"
t=2020-02-27T00:27:25+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - shiftfs support: disabled"
t=2020-02-27T00:27:25+0100 lvl=info msg="Initializing local database"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Initializing database gateway"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Start database node" address=172.28.176.16:8443 id=4 role=spare
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Detected stale unix socket, deleting"
t=2020-02-27T00:27:25+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-27T00:27:25+0100 lvl=info msg="REST API daemon:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.16:8443
t=2020-02-27T00:27:25+0100 lvl=info msg="Initializing global database"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Found cert" name=0
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Found cert" name=0
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=3"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=3"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=3"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=4"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=4"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=4"
t=2020-02-27T00:27:28+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=5"
t=2020-02-27T00:27:28+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=5"
And so on (on every node)…
I don’t see any network issues, I also checked connection with telnet/curl and sniffed traffic with tcpdump between nodes - everything seems fine:
root@gklab-176-016:~# telnet 172.28.176.177 8443
Trying 172.28.176.177...
Connected to 172.28.176.177.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@gklab-176-016:~# telnet 172.28.176.231 8443
Trying 172.28.176.231...
Connected to 172.28.176.231.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@gklab-176-016:~# curl -k https://172.28.176.177:8443/internal
{"error":"not found","error_code":404,"type":"error"}
root@gklab-176-016:~# curl -k https://172.28.176.231:8443/internal
{"error":"not found","error_code":404,"type":"error"}
root@gklab-176-016:~#
I think I see the problem, but I’d like to verify the correct path to fix it.
According to the logs (and raft_nodes) I have 2 voters nodes and 3 spare nodes - as I see in dqlient client code, the spare nodes do not participate in quorum, so there is no way to elect the leader. I think it should be 3 voters nodes here and the rest should be the standby nodes?
So my idea is to prepare sql patch to override role in raft_nodes.
I also thought that database is pinned exactly to 3 specific nodes, but if I see correctly, it changed with 3.20 release with introducing additional roles for nodes, is it?
update raft_nodes set role=0 where id=1;
update raft_nodes set role=0 where id=2;
update raft_nodes set role=0 where id=3;
update raft_nodes set role=1 where id=4;
update raft_nodes set role=1 where id=5;
And I called systemctl reload snap.lxd.daemon simultaneously on all nodes. After that, things started to work - lxc list/start/stop/etc. are working.
In the meantime, I created a new lxd cluster aside and looked into the database, I noticed that the new cluster (with the same lxd version from snap) has a different version of schema and api_extensions in the “node” table (25 and 167) - could it be the issue?
Do you have any suggestions what should I do next?
Now the question is, how can I recover from this situation to full health database cluster? The documentation says:
Among the listed members, pick the one that has survived and log into it (if it differs from the one you have run the command on).
Now make sure the LXD daemon is not running and then issue the command:
lxd cluster recover-from-quorum-loss
At this point you can restart the LXD daemon and the database should be back online.
I’m not sure if I understand this correctly - should I disable the lxd daemon only on the node with database? All other nodes should be working? It doesn’t affect any of the containers running on the other nodes in the cluster?
I’m not sure to fully understand the sequence of events here, however if you lxc cluster list-database already returns a single node you should be good, there’s no need to run lxd cluster recover-from-quorum-loss.
However the bad news is that it seems that this is a situation we don’t handle quite well in the code, since we are not promoting new nodes in this case (this actually an unexpected state that I don’t know how you could get to). We’ll need to fix this LXD and roll out a new release, since there’s currently no way to manually promote nodes. Until we release the fix you’ll have to make sure that server does not goes down, otherwise the entire cluster will be unavailable.
Sorry, actually I might be wrong, depending on your actual state. The lxc cluster list-database is basically just sqlite3 /var/snap/lxd/current/lxd/database/local.db "SELECT * FROM raft_nodes" and so it’s reporting just 1 node because that table has just one entry with role 0.
Since the raft_nodes table is just a cache of the actual raft configuration in the raft log, that means your actual node roles might actually be different.
Assuming that you were able to bring the cluster back to be operational (e.g. lxc cluster list works), then if you just let it run for 30 seconds, the raft_nodes table should be refreshed and at that point you can be sure what the actual raft configuration really is.
If it turns out that you do have only one voter node, then you’ll need to wait for the fix in the PR I linked above.
Otherwise, you might actually have 3 voters, and in that case you just need to check if those nodes are still up and healthy, or remove them by hand with lxc cluster remove --force.
Yes, the cluster is operational (it’s been several hours now) with one voters node. The raft_nodes looks exactly the same (with one voter, 2 standbys and 2 spares). I see the following in the log file every few dozen seconds:
Earlier (yesterday) I tried also to bring the stand-by ones to be a voters by applying again sql patch file to override their role to be a voters node and reloading lxd service, but it always ends up with the current state. I think, the most relevant part of the log files on these nodes is this:
t=2020-03-02T18:49:55+0100 lvl=info msg="Initializing local database"
t=2020-03-02T18:49:55+0100 lvl=info msg="Updating the LXD database schema. Backup made as \"local.db.bak\""
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Running pre-update queries from file for local DB schema"
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Initializing database gateway"
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Start database node" address=172.28.176.16:8443 id=4 role=voter
t=2020-03-02T18:49:55+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-03-02T18:49:55+0100 lvl=info msg="REST API daemon:"
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.16:8443
t=2020-03-02T18:49:55+0100 lvl=info msg="Initializing global database"
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Dqlite: connected address=172.28.176.177:8443 id=1 attempt=0"
t=2020-03-02T18:49:56+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:02+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Database error: failed to update node version info: database is locked"
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Retry failed db interaction (failed to update node version info: database is locked)"
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.177:8443 of state changes"
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.215:8443 of state changes"
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.219:8443 of state changes"
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.231:8443 of state changes"
If the fix will fix the issue, I think I just wait.
Honestly, I’m not exactly sure, why it ends up in the actual state. This cluster is operational since lxd, I think, 3.5 or something similar. Usually, upgrades were seamless, however sometimes I needed to do systemctl reload snap.lxd.service on all nodes to unstuck one of the nodes waiting for another to finish upgrade, but that’s all. Last time I did this with upgrade to 3.20.
If you need more logs/debug/database dump/more informations, I’ll do my best.
I upgraded lxd to 3.22 (from “candidate” channel) and it seems it resolved almost all the issues, thanks!
The only issue I see now is the difference between the logs (or lxd cluster list-database or entries in raft_nodes) and lxc cluster list output. According to the logs/raft_nodes there are 3 voters: