Snap update lxd 3.20 to 3.21 hangs

Hi,

I have a cluster with 5 nodes - after autoupdate to 3.21 something is not quite right:

root@gklab-176-219:~# snap set lxd daemon.debug=true
root@gklab-176-219:~# systemctl reload snap.lxd.daemon
root@gklab-176-219:~# tail -f /var/snap/lxd/common/lxd/logs/lxd.log
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
t=2020-02-26T16:21:07+0100 lvl=info msg="LXD 3.21 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-02-26T16:21:07+0100 lvl=info msg="Kernel uid/gid map:" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - u 0 0 4294967295" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - g 0 0 4294967295" 
t=2020-02-26T16:21:07+0100 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-02-26T16:21:07+0100 lvl=info msg="Kernel features:" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - netnsid-based network retrieval: no" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - uevent injection: no" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - seccomp listener: no" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - seccomp listener continue syscalls: no" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - cgroup layout: hybrid" 
t=2020-02-26T16:21:07+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - shiftfs support: disabled" 
t=2020-02-26T16:21:07+0100 lvl=info msg="Initializing local database" 
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Initializing database gateway" 
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Start database node" address=172.28.176.219:8443 id=3 role=spare
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Detected stale unix socket, deleting" 
t=2020-02-26T16:21:07+0100 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-26T16:21:07+0100 lvl=info msg="REST API daemon:" 
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-26T16:21:07+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.219:8443
t=2020-02-26T16:21:07+0100 lvl=info msg="Initializing global database" 
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=0" 
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=0" 
t=2020-02-26T16:21:07+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0" 
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=1" 
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=1" 
t=2020-02-26T16:21:08+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=1" 

On each node it looks the same (the only difference I noticed is that on db nodes the role of the node is described as “spare”, on the others as “voter”). The IP addresses that report “503 Service Unavailable” in the logs are the nodes without a database.

Do you have any suggestions?

Can you make sure all 5 nodes are on the same snap revision and then maybe try systemctl reload snap.lxd.daemon on all of them to unstick things.

If still stuck, it could be caused by a pending node in the cluster table or network issues.

I should mention it, it was one of the first things I already did:

root@gklab-176-177:~# snap refresh
All snaps up to date.
root@gklab-176-177:~# snap list
Name  Version    Rev    Tracking  Publisher   Notes
core  16-2.43.3  8689   stable    canonical✓  core
lxd   3.21       13522  stable    canonical✓  -
root@gklab-176-177:~# systemctl reload snap.lxd.daemon
root@gklab-176-177:~#

Still the same…

I didn’t notice any network issues (it was also one of the first things I checked), but maybe I overlooked something, I’ll check this again…

The only “suspicious” thing I see is this - on every node except one table nodes in global/db.bin looks like this:

root@gklab-176-016:~# sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "select * from nodes;"
1|gklab-176-177||172.28.176.177:8443|24|165|2020-02-17 23:00:40.294363551+01:00|0|2
2|gklab-176-215||172.28.176.215:8443|24|165|2020-02-17 23:00:40.294532247+01:00|0|2
3|gklab-176-016||172.28.176.16:8443|24|165|2020-02-17 23:00:40.293765232+01:00|0|2
4|gklab-176-219||172.28.176.219:8443|24|165|2020-02-17 23:00:40.294009339+01:00|0|2
5|gklab-176-231||172.28.176.231:8443|24|165|2020-02-17 23:00:40.294193208+01:00|0|2

The different one (different time):

root@gklab-176-215:~# sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "select * from nodes;"
1|gklab-176-177||172.28.176.177:8443|24|165|2020-02-17 23:06:01.622361189+01:00|0|2
2|gklab-176-215||172.28.176.215:8443|24|165|2020-02-17 23:06:01.621551008+01:00|0|2
3|gklab-176-016||172.28.176.16:8443|24|165|2020-02-17 23:06:01.621802785+01:00|0|2
4|gklab-176-219||172.28.176.219:8443|24|165|2020-02-17 23:06:01.622000118+01:00|0|2
5|gklab-176-231||172.28.176.231:8443|24|165|2020-02-17 23:06:01.622183674+01:00|0|2

That shouldn’t be a problem, it’s just the last recorded heartbeat.

Can you check local.db for the raft_nodes table see if there’s anything odd there?

@freeekanayaka

Sure:

# sqlite3 /var/snap/lxd/common/lxd/database/local.db "select * from raft_nodes;"
1|172.28.176.177:8443|0
2|172.28.176.215:8443|2
3|172.28.176.219:8443|2
4|172.28.176.16:8443|2
5|172.28.176.231:8443|0

It looks the same on every node.

Okay, I think we’ll need @freeekanayaka to look into this.

One last thing that may help is:

  • snap set lxd daemon.debug=true
  • systemctl reload snap.lxd.daemon

This will make LXD a lot more verbose which may give some hints to @freeekanayaka as to what’s going on and how to recover from it.

The logs from the first post are with debug already enabled (I enabled it on all nodes with wishful thinking that maybe one of them tells me something more :wink: ), but just in case:

root@gklab-176-016:~# snap set lxd daemon.debug=true
root@gklab-176-016:~# service snap.lxd.daemon reload
root@gklab-176-016:~# tail -f /var/snap/lxd/common/lxd/logs/lxd.log
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
t=2020-02-27T00:27:25+0100 lvl=info msg="LXD 3.21 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-02-27T00:27:25+0100 lvl=info msg="Kernel uid/gid map:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - u 0 0 4294967295"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - g 0 0 4294967295"
t=2020-02-27T00:27:25+0100 lvl=info msg="Configured LXD uid/gid map:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-02-27T00:27:25+0100 lvl=info msg="Kernel features:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - uevent injection: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - seccomp listener: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - cgroup layout: hybrid"
t=2020-02-27T00:27:25+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - shiftfs support: disabled"
t=2020-02-27T00:27:25+0100 lvl=info msg="Initializing local database"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Initializing database gateway"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Start database node" address=172.28.176.16:8443 id=4 role=spare
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Detected stale unix socket, deleting"
t=2020-02-27T00:27:25+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-27T00:27:25+0100 lvl=info msg="REST API daemon:"
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-27T00:27:25+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.16:8443
t=2020-02-27T00:27:25+0100 lvl=info msg="Initializing global database"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=1"
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Found cert" name=0
t=2020-02-27T00:27:25+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=2"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Found cert" name=0
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=3"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=3"
t=2020-02-27T00:27:26+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=3"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=4"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=4"
t=2020-02-27T00:27:27+0100 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=4"
t=2020-02-27T00:27:28+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.177:8443 id=1 attempt=5"
t=2020-02-27T00:27:28+0100 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.28.176.231:8443 id=1 attempt=5"

And so on (on every node)…

I don’t see any network issues, I also checked connection with telnet/curl and sniffed traffic with tcpdump between nodes - everything seems fine:

root@gklab-176-016:~# telnet 172.28.176.177 8443
Trying 172.28.176.177...
Connected to 172.28.176.177.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@gklab-176-016:~# telnet 172.28.176.231 8443
Trying 172.28.176.231...
Connected to 172.28.176.231.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@gklab-176-016:~# curl -k https://172.28.176.177:8443/internal
{"error":"not found","error_code":404,"type":"error"}

root@gklab-176-016:~# curl -k https://172.28.176.231:8443/internal
{"error":"not found","error_code":404,"type":"error"}

root@gklab-176-016:~#

I think I see the problem, but I’d like to verify the correct path to fix it.

According to the logs (and raft_nodes) I have 2 voters nodes and 3 spare nodes - as I see in dqlient client code, the spare nodes do not participate in quorum, so there is no way to elect the leader. I think it should be 3 voters nodes here and the rest should be the standby nodes?

So my idea is to prepare sql patch to override role in raft_nodes.

I also thought that database is pinned exactly to 3 specific nodes, but if I see correctly, it changed with 3.20 release with introducing additional roles for nodes, is it?

OK, I have partial success (and hope that I haven’t made it worse).

I’ve prepared /var/snap/lxd/common/lxd/database/patch.local.sql:

update raft_nodes set role=0 where id=1;
update raft_nodes set role=0 where id=2;
update raft_nodes set role=0 where id=3;
update raft_nodes set role=1 where id=4;
update raft_nodes set role=1 where id=5;

And I called systemctl reload snap.lxd.daemon simultaneously on all nodes. After that, things started to work - lxc list/start/stop/etc. are working.

But lxc cluster list looks like this:

# lxc cluster list
+---------------+-----------------------------+----------+--------+-------------------+--------------+
|     NAME      |             URL             | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-016 | https://172.28.176.16:8443  | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-177 | https://172.28.176.177:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-215 | https://172.28.176.215:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-219 | https://172.28.176.219:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-231 | https://172.28.176.231:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+

raft_nodes looks like this:

1|172.28.176.177:8443|0
2|172.28.176.215:8443|1
3|172.28.176.219:8443|2
4|172.28.176.16:8443|1
5|172.28.176.231:8443|2

And nodes from global/db.bin:

1|gklab-176-177||172.28.176.177:8443|24|165|2020-02-17 23:00:40.294363551+01:00|0|2
2|gklab-176-215||172.28.176.215:8443|24|165|2020-02-17 23:00:40.294532247+01:00|0|2
3|gklab-176-016||172.28.176.16:8443|24|165|2020-02-17 23:00:40.293765232+01:00|0|2
4|gklab-176-219||172.28.176.219:8443|24|165|2020-02-17 23:00:40.294009339+01:00|0|2
5|gklab-176-231||172.28.176.231:8443|24|165|2020-02-17 23:00:40.294193208+01:00|0|2

I’m a little afraid where the database is now. :wink:

In the meantime, I created a new lxd cluster aside and looked into the database, I noticed that the new cluster (with the same lxd version from snap) has a different version of schema and api_extensions in the “node” table (25 and 167) - could it be the issue?

Do you have any suggestions what should I do next?

After further investigation (and reading documentation), apparently I was wrong about schema and api_extensions version:

# lxd sql global 'select * from nodes'
+----+---------------+-------------+---------------------+--------+----------------+-------------------------------------+---------+------+
| id |     name      | description |       address       | schema | api_extensions |              heartbeat              | pending | arch |
+----+---------------+-------------+---------------------+--------+----------------+-------------------------------------+---------+------+
| 1  | gklab-176-177 |             | 172.28.176.177:8443 | 25     | 167            | 2020-03-02T18:57:09.246375051+01:00 | 0       | 2    |
| 2  | gklab-176-215 |             | 172.28.176.215:8443 | 25     | 167            | 2020-03-02T18:57:09.246506692+01:00 | 0       | 2    |
| 3  | gklab-176-016 |             | 172.28.176.16:8443  | 25     | 167            | 2020-03-02T18:57:09.246604193+01:00 | 0       | 2    |
| 4  | gklab-176-219 |             | 172.28.176.219:8443 | 25     | 167            | 2020-03-02T18:57:09.246694044+01:00 | 0       | 2    |
| 5  | gklab-176-231 |             | 172.28.176.231:8443 | 25     | 167            | 2020-03-02T18:57:09.246779778+01:00 | 0       | 2    |
+----+---------------+-------------+---------------------+--------+----------------+-------------------------------------+---------+------+

And I found the database!

# lxd cluster list-database
+---------------------+
|       ADDRESS       |
+---------------------+
| 172.28.176.177:8443 |
+---------------------+

And indeed, after running lxd sql global '.sync' the global/db.bin on this node looks more accurate:

# sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "select * from nodes"
1|gklab-176-177||172.28.176.177:8443|25|167|2020-03-02 18:49:49.156011328+01:00|0|2
2|gklab-176-215||172.28.176.215:8443|25|167|2020-03-02 18:49:49.156258816+01:00|0|2
3|gklab-176-016||172.28.176.16:8443|25|167|2020-03-02 18:49:49.155049934+01:00|0|2
4|gklab-176-219||172.28.176.219:8443|25|167|2020-03-02 18:49:49.155410943+01:00|0|2
5|gklab-176-231||172.28.176.231:8443|25|167|2020-03-02 18:49:49.155755194+01:00|0|2

Now the question is, how can I recover from this situation to full health database cluster? The documentation says:

Among the listed members, pick the one that has survived and log into it (if it differs from the one you have run the command on).

Now make sure the LXD daemon is not running and then issue the command:

lxd cluster recover-from-quorum-loss

At this point you can restart the LXD daemon and the database should be back online.

I’m not sure if I understand this correctly - should I disable the lxd daemon only on the node with database? All other nodes should be working? It doesn’t affect any of the containers running on the other nodes in the cluster?

I’m not sure to fully understand the sequence of events here, however if you lxc cluster list-database already returns a single node you should be good, there’s no need to run lxd cluster recover-from-quorum-loss.

However the bad news is that it seems that this is a situation we don’t handle quite well in the code, since we are not promoting new nodes in this case (this actually an unexpected state that I don’t know how you could get to). We’ll need to fix this LXD and roll out a new release, since there’s currently no way to manually promote nodes. Until we release the fix you’ll have to make sure that server does not goes down, otherwise the entire cluster will be unavailable.

Sorry, actually I might be wrong, depending on your actual state. The lxc cluster list-database is basically just sqlite3 /var/snap/lxd/current/lxd/database/local.db "SELECT * FROM raft_nodes" and so it’s reporting just 1 node because that table has just one entry with role 0.

Since the raft_nodes table is just a cache of the actual raft configuration in the raft log, that means your actual node roles might actually be different.

Assuming that you were able to bring the cluster back to be operational (e.g. lxc cluster list works), then if you just let it run for 30 seconds, the raft_nodes table should be refreshed and at that point you can be sure what the actual raft configuration really is.

If it turns out that you do have only one voter node, then you’ll need to wait for the fix in the PR I linked above.

Otherwise, you might actually have 3 voters, and in that case you just need to check if those nodes are still up and healthy, or remove them by hand with lxc cluster remove --force.

Thank you for your answer.

Yes, the cluster is operational (it’s been several hours now) with one voters node. The raft_nodes looks exactly the same (with one voter, 2 standbys and 2 spares). I see the following in the log file every few dozen seconds:

(...)
t=2020-03-03T12:49:11+0100 lvl=dbug msg="Completed heartbeat round" 
t=2020-03-03T12:49:11+0100 lvl=dbug msg="Rebalance cluster" 
t=2020-03-03T12:49:16+0100 lvl=dbug msg="Starting heartbeat round" 
t=2020-03-03T12:49:16+0100 lvl=dbug msg="Heartbeat updating local raft nodes to [{ID:1 Address:172.28.176.177:8443 Role:voter} {ID:2 Address:172.28.176.215:8443 Role:stand-by} {ID:4 Address:172.28.176.16:8443 Role:stand-by} {ID:3 Address:172.28.176.219:8443 Role:spare} {ID:5 Address:172.28.176.231:8443 Role:spare}]" ,

Earlier (yesterday) I tried also to bring the stand-by ones to be a voters by applying again sql patch file to override their role to be a voters node and reloading lxd service, but it always ends up with the current state. I think, the most relevant part of the log files on these nodes is this:

t=2020-03-02T18:49:55+0100 lvl=info msg="Initializing local database" 
t=2020-03-02T18:49:55+0100 lvl=info msg="Updating the LXD database schema. Backup made as \"local.db.bak\"" 
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Running pre-update queries from file for local DB schema" 
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Initializing database gateway" 
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Start database node" address=172.28.176.16:8443 id=4 role=voter
t=2020-03-02T18:49:55+0100 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-03-02T18:49:55+0100 lvl=info msg="REST API daemon:" 
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-03-02T18:49:55+0100 lvl=info msg=" - binding TCP socket" socket=172.28.176.16:8443
t=2020-03-02T18:49:55+0100 lvl=info msg="Initializing global database" 
t=2020-03-02T18:49:55+0100 lvl=dbug msg="Dqlite: connected address=172.28.176.177:8443 id=1 attempt=0" 
t=2020-03-02T18:49:56+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:02+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Database error: failed to update node version info: database is locked" 
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Retry failed db interaction (failed to update node version info: database is locked)" 
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Found cert" name=0
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.177:8443 of state changes" 
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.215:8443 of state changes" 
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.219:8443 of state changes" 
t=2020-03-02T18:50:05+0100 lvl=dbug msg="Notify node 172.28.176.231:8443 of state changes" 

If the fix will fix the issue, I think I just wait.

Honestly, I’m not exactly sure, why it ends up in the actual state. This cluster is operational since lxd, I think, 3.5 or something similar. Usually, upgrades were seamless, however sometimes I needed to do systemctl reload snap.lxd.service on all nodes to unstuck one of the nodes waiting for another to finish upgrade, but that’s all. Last time I did this with upgrade to 3.20.

If you need more logs/debug/database dump/more informations, I’ll do my best.

I upgraded lxd to 3.22 (from “candidate” channel) and it seems it resolved almost all the issues, thanks!

The only issue I see now is the difference between the logs (or lxd cluster list-database or entries in raft_nodes) and lxc cluster list output. According to the logs/raft_nodes there are 3 voters:

t=2020-03-11T12:47:57+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.28.176.177:8443 Role:voter} {ID:2 Address:172.28.176.215:8443 Role:voter} {ID:4 Address:172.28.176.16:8443 Role:voter} {ID:3 Address:172.28.176.219:8443 Role:stand-by} {ID:5 Address:172.28.176.231:8443 Role:stand-by}]"

But lxc cluster list displays only 2 “database nodes”:

# lxc cluster list
+---------------+-----------------------------+----------+--------+-------------------+--------------+
|     NAME      |             URL             | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-016 | https://172.28.176.16:8443  | YES      | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-177 | https://172.28.176.177:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-215 | https://172.28.176.215:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-219 | https://172.28.176.219:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+
| gklab-176-231 | https://172.28.176.231:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------+-----------------------------+----------+--------+-------------------+--------------+