Snap automatic update of cluster 4.0.1 > 4.0.2 failed terribly

Hello everybody !
I hope you are doing well.

I have a cluster of 3 hosts that went through an automatic update in Snap which completely crashed the cluster. This will probably be the 5th time a change of version crashed my cluster. While using some pointers provided in Snap Update to 4.1 Broke my cluster I have been able to restart the cluster (it felt more like magic than anything else)

However I am seeing quite a large amount of unexpected logs, different on each node.
Showing up constantly are Unaccounted raft node or Could not rebalance cluster member roles. I also see an IP address that doesn’t actually belong to any node show up constantly in local raft_nodes table. The

On one node I see this :

t=2020-08-13T18:40:09+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:40:09+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:40:10+0000 lvl=dbug msg="Starting heartbeat round"
t=2020-08-13T18:40:10+0000 lvl=dbug msg="Heartbeat updating local raft nodes to [{ID:3 Address:172.30.2.2:8443 Role:voter} {ID:4 Address:172.30.4.7:8443 Role:spare} {ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:40:10+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[172.30.2.2:8443:{ID:3 Address:172.30.2.2:8443 Role:voter} 172.30.4.7:8443:{ID:4 Address:172.30.4.7:8443 Role:spare}]"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Sending heartbeat to mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Sending heartbeat request to mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Successful heartbeat for mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Sending heartbeat to mgnt-lxd-01.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Sending heartbeat request to mgnt-lxd-01.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Successful heartbeat for mgnt-lxd-01.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T18:40:11+0000 lvl=dbug msg="Completed heartbeat round"
t=2020-08-13T18:40:14+0000 lvl=warn msg="Could not rebalance cluster member roles: Failed to assign role: no connection to remote server available (1)"
t=2020-08-13T18:40:16+0000 lvl=info msg="Found node mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 whose role needs to be changed to voter"
t=2020-08-13T18:40:16+0000 lvl=dbug msg="Connecting to a remote LXD over HTTPs"
t=2020-08-13T18:40:16+0000 lvl=dbug msg="Sending request to LXD" etag= method=POST url=https://mgnt-lxd-02.metal.dsi.ic.ac.uk:8443/internal/cluster/assign
t=2020-08-13T18:40:16+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"172.30.4.7:8443\",\n\t\t\t\t\"role\": 2\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T18:40:16+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:40:16+0000 lvl=dbug msg="Found cert" name=0

On the second:

t=2020-08-13T18:38:18+0000 lvl=dbug msg=Handling ip=172.30.2.2:32916 method=POST url=/internal/cluster/assign user=
t=2020-08-13T18:38:18+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"172.30.4.7:8443\",\n\t\t\t\t\"role\": 2\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T18:38:18+0000 lvl=info msg="Changing dqlite raft role" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=voter
t=2020-08-13T18:38:21+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:21+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:38:27+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:27+0000 lvl=dbug msg=Handling ip=172.30.2.2:32930 method=POST url=/internal/cluster/assign user=
t=2020-08-13T18:38:27+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"172.30.4.7:8443\",\n\t\t\t\t\"role\": 2\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T18:38:27+0000 lvl=info msg="Changing dqlite raft role" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=voter
t=2020-08-13T18:38:35+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:35+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:38:41+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:41+0000 lvl=dbug msg=Handling ip=172.30.2.2:32944 method=POST url=/internal/cluster/assign user=
t=2020-08-13T18:38:41+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"172.30.4.7:8443\",\n\t\t\t\t\"role\": 2\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T18:38:41+0000 lvl=info msg="Changing dqlite raft role" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=voter
t=2020-08-13T18:38:44+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:44+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:38:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:53+0000 lvl=dbug msg=Handling ip=172.30.2.2:32960 method=POST url=/internal/cluster/assign user=
t=2020-08-13T18:38:53+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"172.30.4.7:8443\",\n\t\t\t\t\"role\": 2\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T18:38:53+0000 lvl=info msg="Changing dqlite raft role" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=voter
t=2020-08-13T18:38:55+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:55+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:38:58+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:58+0000 lvl=dbug msg=Handling ip=172.30.5.1:53584 method=GET url="/1.0/instances?instance-type=container&project=default&recursion=1" user=
t=2020-08-13T18:38:58+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:38:58+0000 lvl=dbug msg=Handling ip=172.30.5.1:53590 method=GET url="/1.0/instances?instance-type=container&project=default&recursion=1" user=

The third seemed to be more behaved with:

t=2020-08-13T18:53:57+0000 lvl=dbug msg="Found cert" name=39f18526097465b8e7429257897715238785b36241c880e417aa98a90a265b89
t=2020-08-13T18:53:57+0000 lvl=dbug msg=Handling ip=172.30.0.61:39856 method=GET url="/1.0/cluster/members?recursion=1" user=39f18526097465b8e7429257897715238785b36241c880e417aa98a90a265b89
t=2020-08-13T18:54:03+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:03+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:54:14+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:14+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:54:26+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:26+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:54:37+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:37+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:54:42+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:42+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:54:51+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:54:51+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:55:06+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:55:06+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:55:15+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:55:15+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T18:55:23+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T18:55:23+0000 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:spare}]"

I hope someone can provide some help to check the health of my cluster round before yet another crash during upgrade.

Ok, so to clarify, did you manage to unstick things to get the upgrade through or are you still stuck halfway through the upgrade?

On the surface, the cluster seems functional and the version seems to have successfully gone up to 4.0.2 on all hosts. Although two hosts show (16740) while the third shows (16099) All containers have gone back online.

But I am concerned is that something is unhealthy in the DB. For example this IP 172.30.4.7 in the logs, simply does not exist on my network. The command lxd cluster list-database no longer lists anything. The lxc cluster list command only work on one host, but lxc list functions on all.

A bit of background on what I did to bring it back online: At first I did a lxd cluster recover-from-quorum-loss which failed with Error: This LXD instance has no database role despite having chose the only node listed in lxd cluster list-database. I added manually to the raft_nodes based on what I made up appeared to be the last state of it. I used systemctl reload snap.lxd.daemon on each host after doing so. Although I had to use kill -9 on some lxd processes.

Thank you for you help ! :smiley:

Ok, so yeah, that may have caused quite a bit of a mess to the database…

Can you show:

  • lxd sql global "SELECT * FROM nodes;" from the one where lxc cluster list is working
  • lxd sql local "SELECT * FROM raft_nodes;" on all of them
  • readlink /proc/$(cat /var/snap/lxd/common/lxd.pid)/exe on all of them

Yes of course:
Here if what I am getting :

root@mgnt-lxd-02# lxd sql global "SELECT * FROM nodes"
+----+-------+-------------+-------------------------------------+--------+----------------+--------------------------------+---------+------+
| id | name  | description |               address               | schema | api_extensions |           heartbeat            | pending | arch |
+----+-------+-------------+-------------------------------------+--------+----------------+--------------------------------+---------+------+
| 3  | WT-02 |             | mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 | 30     | 186            | 2020-08-13T16:49:27.650676591Z | 0       | 2    |
| 4  | OD-06 |             | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 30     | 186            | 2020-08-13T16:49:27.651042358Z | 0       | 2    |
| 5  | OD-05 |             | mgnt-lxd-01.metal.dsi.ic.ac.uk:8443 | 30     | 186            | 2020-08-13T16:49:27.65132993Z  | 0       | 0    |
+----+-------+-------------+-------------------------------------+--------+----------------+--------------------------------+---------+------+
root@mgnt-lxd-02# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 4  | mgnt-lxd-01.metal.dsi.ic.ac.uk:8443 | 2    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 0    |
+----+-------------------------------------+------+
root@mgnt-lxd-02:/home/ubuntu# readlink /proc/$(cat /var/snap/lxd/common/lxd.pid)/exe
/snap/lxd/16740/bin/lxd
root@mgnt-lxd-01# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 3  | mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 | 0    |
| 4  | mgnt-lxd-01.metal.dsi.ic.ac.uk:8443 | 1    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 2    |
+----+-------------------------------------+------+
root@mgnt-lxd-01:/home/ubuntu# readlink /proc/$(cat /var/snap/lxd/common/lxd.pid)/exe
/snap/lxd/16740/bin/lxd
root@mgnt-lxd-03# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 3  | 172.30.2.2:8443 | 0    |
| 4  | 172.30.4.7:8443 | 2    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 2    |
+----+-------------------------------------+------+
root@mgnt-lxd-03:/home/ubuntu# readlink /proc/$(cat /var/snap/lxd/common/lxd.pid)/exe
/snap/lxd/16099/bin/lxd

There is a /etc/hosts file on all hosts containing:

172.30.5.1      mgnt-lxd-01.metal.dsi.ic.ac.uk
172.30.5.0      mgnt-lxd-02.metal.dsi.ic.ac.uk
172.30.2.2      mgnt-lxd-03.metal.dsi.ic.ac.uk

Thank you for your very prompt reply !

I am also seeing multiple of these:

t=2020-08-13T21:47:39+0000 lvl=dbug msg="Database error: protocol.Error{Code:5, Message:\"database is locked\"}"
t=2020-08-13T21:47:39+0000 lvl=dbug msg="Retry failed db interaction (database is locked)"

Likewise I’ve witnessed some sort of stuttering where the LXD will restart on its own, in a failed state, and then again and succeed :

t=2020-08-13T22:16:47+0000 lvl=dbug msg="Listening for events on node mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T22:16:47+0000 lvl=dbug msg="Connecting to a remote LXD over HTTPs"
t=2020-08-13T22:16:47+0000 lvl=dbug msg="Connected to the websocket: wss://mgnt-lxd-01.metal.dsi.ic.ac.uk:8443/1.0/events?project=%2A"
t=2020-08-13T22:16:47+0000 lvl=dbug msg="Listening for events on node mgnt-lxd-01.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T22:16:50+0000 lvl=dbug msg="Starting heartbeat round"
t=2020-08-13T22:16:50+0000 lvl=dbug msg="Heartbeat updating local raft nodes to [{ID:3 Address:172.30.2.2:8443 Role:voter} {ID:5 Address:mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 Role:voter} {ID:4 Address:mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 Role:spare}]"
t=2020-08-13T22:16:50+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[172.30.2.2:8443:{ID:3 Address:172.30.2.2:8443 Role:voter}]"
t=2020-08-13T22:16:51+0000 lvl=info msg="Found node mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 whose role needs to be changed to voter"
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Connecting to a remote LXD over HTTPs"
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Sending request to LXD" etag= method=POST url=https://mgnt-lxd-03.metal.dsi.ic.ac.uk:8443/internal/cluster/assign
t=2020-08-13T22:16:51+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"mgnt-lxd-03.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:51+0000 lvl=dbug msg=Handling ip=172.30.2.2:44494 method=POST url=/internal/cluster/assign user=
t=2020-08-13T22:16:51+0000 lvl=dbug msg="\n\t{\n\t\t\"raft_nodes\": [\n\t\t\t{\n\t\t\t\t\"id\": 3,\n\t\t\t\t\"address\": \"172.30.2.2:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 5,\n\t\t\t\t\"address\": \"mgnt-lxd-02.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"id\": 4,\n\t\t\t\t\"address\": \"mgnt-lxd-03.metal.dsi.ic.ac.uk:8443\",\n\t\t\t\t\"role\": 0\n\t\t\t}\n\t\t]\n\t}"
t=2020-08-13T22:16:51+0000 lvl=info msg="Changing dqlite raft role" address=mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 id=4 role=voter
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:51+0000 lvl=dbug msg="Found cert" name=0
tail: /var/snap/lxd/common/lxd/logs/lxd.log: file truncated
t=2020-08-13T22:16:53+0000 lvl=info msg="LXD 4.0.2 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-08-13T22:16:53+0000 lvl=info msg="Kernel uid/gid map:"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - u 0 0 4294967295"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - g 0 0 4294967295"
t=2020-08-13T22:16:53+0000 lvl=info msg="Configured LXD uid/gid map:"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-08-13T22:16:53+0000 lvl=info msg="Kernel features:"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - pidfds: no"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - uevent injection: yes"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - seccomp listener: yes"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - seccomp listener continue syscalls: yes"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - cgroup layout: hybrid"
t=2020-08-13T22:16:53+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored"
t=2020-08-13T22:16:53+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - shiftfs support: disabled"
t=2020-08-13T22:16:53+0000 lvl=info msg="Initializing local database"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Initializing database gateway"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Start database node" address=mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 id=4 role=voter
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Connecting to a local LXD over a Unix socket"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Sending request to LXD" etag= method=GET url=http://unix.socket/1.0
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Detected stale unix socket, deleting"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Detected stale unix socket, deleting"
t=2020-08-13T22:16:53+0000 lvl=info msg="Starting cluster handler:"
t=2020-08-13T22:16:53+0000 lvl=info msg="Starting /dev/lxd handler:"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-08-13T22:16:53+0000 lvl=info msg="REST API daemon:"
t=2020-08-13T22:16:53+0000 lvl=info msg=" - binding Unix socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-08-13T22:16:53+0000 lvl=info msg=" - binding TCP socket" socket=172.30.2.2:8443
t=2020-08-13T22:16:53+0000 lvl=info msg="Initializing global database"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=warn msg="Dqlite: attempt 0: server 172.30.2.2:8443: no known leader"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: connect to reported leader mgnt-lxd-03.metal.dsi.ic.ac.uk:8443"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: reported leader server is not the leader"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-13T22:16:53+0000 lvl=warn msg="Dqlite: attempt 1: server 172.30.2.2:8443: no known leader"
t=2020-08-13T22:16:54+0000 lvl=dbug msg="Found cert" name=0

Yeah, that’s indeed quite inconsistent and shows that some changes have certainly been done by hand :slight_smile:

It also confirms that 03 is on a much much older revision of LXD.
I’d strongly recommend starting by solving that.

See what snap refresh lxd does on it. If that fails, look at snap changes for anything stuck, you can use snap abort ID to hopefully unblock that and re-try the refresh.

So this stayed hanging:

root@mgnt-lxd-03# snap refresh lxd
Start snap "lxd" (16740) services 
root@mgnt-lxd-03# snap changes
ID   Status  Spawn               Ready               Summary
77   Done    today at 16:09 UTC  today at 16:09 UTC  Running service command
78   Done    today at 16:32 UTC  today at 16:32 UTC  Running service command
79   Done    today at 22:33 UTC  today at 22:33 UTC  Change configuration of "core" snap
80   Doing   today at 22:51 UTC  -                   Refresh "lxd" snap

I aborted it after 10 minutes, and is nothing happens.
PS still show the refresh process running. The bash in which I triggered the refresh froze.

root@mgnt-lxd-03# ps x | grep lxd
1599697 ?        Sl     2:21 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
1647878 ?        Ss     0:00 SCREEN -S lxd-recovery
1651652 ?        Ss     0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers dsi-gateway-02
1651846 ?        Ss     0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers dsi-smtp-01
1652115 ?        Ss     0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers juju-9469cb-2
1652453 ?        Ss     0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers maas-controller-02
1652752 ?        Ss     0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers mongo-config-03
1680753 pts/0    S+     0:00 screen -rf lxd-recovery
1684573 pts/1    Sl+    0:19 snap refresh lxd
1685119 ?        S      0:00 systemctl start snap.lxd.activate.service
1685120 ?        Ss     0:00 /bin/sh /snap/lxd/16740/commands/daemon.activate
1685182 ?        SLl    0:01 lxd activateifneeded
1685200 ?        Ss     0:00 /bin/sh /snap/lxd/16740/commands/daemon.start
1685334 ?        SLl    3:17 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug
1685335 ?        SLl    0:00 lxd waitready
1685336 ?        S      0:00 /bin/sh /snap/lxd/16740/commands/daemon.start
1687457 pts/3    S+     0:00 grep --color=auto lxd

Ok, looks like that one isn’t too willing to start back up.

What’s in journalctl -u snap.lxd.daemon -n 300 and in the last few /var/snap/lxd/common/lxd/logs files?

I’m not totally sure what happened here, surely there’s a mix of IP node addresses and DNS names that shouldn’t be there.

Please also note that lxd recover-from-quorum-loss should e used:

if you are *absolutely* certain that this is
the only database node left in your cluster AND that other database nodes will
never come back (i.e. their LXD daemon won't ever be started again).

which is the message printed by the command before doing anything. There might be cases where it’s still possible to run that command, but please ask that here before doing that.

If you experience an upgrade issue, please let us know as soon as it happens so we can provide better assistance and possibly fix bugs.

As @stgraber says, the first step is to upgrade the snap version on that node. Then we can force a new consistent state from there.

So since I’ve attempted the refresh none of the hosts are responding to LXC commands (But all my containers are still up, which is a consolation). Logs are presenting the following:

For the journalctl:

root@mgnt-lxd-01#journalctl -u snap.lxd.daemon -n 300
Aug 14 07:22:32 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:32+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:32 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:32+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:33 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:33+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:33 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:33+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:34 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:34+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:34 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:34+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:35 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:35+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:35 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:35+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:36 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:36+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:36 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:36+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:37 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:37+0000 lvl=warn msg="Failed connecting to global database (attempt 2310): failed to create dqlite connection: no available dqlite leader server found"
Aug 14 07:22:39 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:39+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:39 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:39+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:39 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:39+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:39 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:39+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:40 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:40+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:40 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:40+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:41 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:41+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:41 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:41+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:42 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:42+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:42 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:42+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:43 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:43+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:43 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:43+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:44 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:44+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:44 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:44+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:45 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:45+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:45 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:45+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:46 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:46+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:46 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:46+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:48 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:48+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:48 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:48+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:49 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:49+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:49 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:49+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:52 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:52+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:52 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:52+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:52 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:52+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:52 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:52+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:53 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:53+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:53 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:53+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:54 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:54+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:54 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:54+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:55 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:55+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:55 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:55+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:56 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:56+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:56 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:56+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:57 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:57+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:57 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:57+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:58 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:58+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:58 od-05 lxd.daemon[366470]: t=2020-08-14T07:22:58+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
root@mgnt-lxd-02#journalctl -u snap.lxd.daemon -n 300
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:47 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:47+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Database error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server foun>
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:44:48 od-06 lxd.daemon[2429546]: t=2020-08-14T07:44:48+0000 lvl=dbug msg="Found cert" name=0
root@mgnt-lxd-03#journalctl -u snap.lxd.daemon -n 300
Aug 14 07:22:19 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:19+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:19 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:19+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=warn msg="Dqlite: attempt 5: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:20 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:20+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:21 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:21+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:21 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:21+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=warn msg="Dqlite: attempt 6: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:22 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:22+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=warn msg="Dqlite: attempt 7: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:23 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:23+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=warn msg="Dqlite: attempt 8: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:24 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:24+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=warn msg="Dqlite: attempt 9: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:25 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:25+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:26 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:26+0000 lvl=warn msg="Failed connecting to global database (attempt 2430): failed to create dqlite connection: no available dqlite leader server found"
Aug 14 07:22:26 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:26+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:27 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:27+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:27 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:27+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:27 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:27+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:27 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:27+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=warn msg="Dqlite: attempt 0: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:28 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:28+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 1: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 2: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
Aug 14 07:22:29 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:29+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:30 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:30+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:30 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:30+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:30 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:30+0000 lvl=dbug msg="Found cert" name=0
Aug 14 07:22:30 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:30+0000 lvl=warn msg="Dqlite: attempt 3: server 172.30.2.2:8443: no known leader"
Aug 14 07:22:30 wt-02 lxd.daemon[1688568]: t=2020-08-14T07:22:30+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"

For tail:

root@mgnt-lxd-01#tail -n 40 /var/snap/lxd/common/lxd/logs/lxd.log
t=2020-08-14T07:28:22+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:22+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:23+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:23+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:24+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:24+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:25+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:25+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:26+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:26+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:27+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:27+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:28+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:29+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:30+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:30+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:31+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:31+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:35+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:35+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:36+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:36+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:37+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:37+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:38+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:38+0000 lvl=warn msg="Dqlite: attempt 5: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:39+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:39+0000 lvl=warn msg="Dqlite: attempt 6: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:40+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:40+0000 lvl=warn msg="Dqlite: attempt 7: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:41+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:41+0000 lvl=warn msg="Dqlite: attempt 8: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:42+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:42+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:43+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:44+0000 lvl=warn msg="Dqlite: attempt 10: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
root@mgnt-lxd-02#tail -n 40 /var/snap/lxd/common/lxd/logs/lxd.log
t=2020-08-14T07:43:46+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:43:47+0000 lvl=dbug msg="Failed to fetch leader address from mgnt-lxd-02.metal.dsi.ic.ac.uk:8443"
root@mgnt-lxd-03#tail -n 40 /var/snap/lxd/common/lxd/logs/lxd.log
t=2020-08-14T07:28:28+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:28+0000 lvl=warn msg="Dqlite: attempt 9: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:28+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:29+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:29+0000 lvl=warn msg="Failed connecting to global database (attempt 2460): failed to create dqlite connection: no available dqlite leader server found"
t=2020-08-14T07:28:30+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:30+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=warn msg="Dqlite: attempt 0: server 172.30.2.2:8443: no known leader"
t=2020-08-14T07:28:31+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:31+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:31+0000 lvl=warn msg="Dqlite: attempt 0: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 1: server 172.30.2.2:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 1: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 2: server 172.30.2.2:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:32+0000 lvl=warn msg="Dqlite: attempt 2: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:32+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:33+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:33+0000 lvl=warn msg="Dqlite: attempt 3: server 172.30.2.2:8443: no known leader"
t=2020-08-14T07:28:33+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:33+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:33+0000 lvl=warn msg="Dqlite: attempt 3: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:34+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:34+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:34+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 4: server 172.30.2.2:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-02.metal.dsi.ic.ac.uk:8443: no known leader"
t=2020-08-14T07:28:34+0000 lvl=dbug msg="Found cert" name=0
t=2020-08-14T07:28:34+0000 lvl=warn msg="Dqlite: attempt 4: server mgnt-lxd-03.metal.dsi.ic.ac.uk:8443: no known leader"

The raft_nodes table also seems to have slightly changed with this:

root@mgnt-lxd-01#  lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 4  | mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 | 2    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 0    |
+----+-------------------------------------+------+
root@mgnt-lxd-02#  lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 4  | mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 | 2    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 0    |
+----+-------------------------------------+------+
root@mgnt-lxd-03#  lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------------+------+
| id |               address               | role |
+----+-------------------------------------+------+
| 3  | 172.30.2.2:8443                     | 0    |
| 4  | mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 | 0    |
| 5  | mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 | 0    |
+----+-------------------------------------+------+

On the last one I can see that node id 3 and 4 have 172.30.2.2 and mgnt-lxd-03.metal.dsi.ic.ac.uk but both are the same machine. I also tried which started as database and their what their id were:

root@mgnt-lxd-01# grep -rn 'Start database node' /var/snap/lxd/common/lxd/logs/*
root@mgnt-lxd-02# grep -rn 'Start database node' /var/snap/lxd/common/lxd/logs/*
/var/snap/lxd/common/lxd/logs/lxd.log:21:t=2020-08-13T22:33:58+0000 lvl=dbug msg="Start database node" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=spare
/var/snap/lxd/common/lxd/logs/lxd.log.1:21:t=2020-08-13T22:22:27+0000 lvl=dbug msg="Start database node" address=mgnt-lxd-02.metal.dsi.ic.ac.uk:8443 id=5 role=spare
root@mgnt-lxd-03# grep -rn 'Start database node' /var/snap/lxd/common/lxd/logs/*
/var/snap/lxd/common/lxd/logs/lxd.log:21:t=2020-08-13T23:12:40+0000 lvl=dbug msg="Start database node" address=mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 id=4 role=voter
/var/snap/lxd/common/lxd/logs/lxd.log.1:21:t=2020-08-13T22:51:43+0000 lvl=dbug msg="Start database node" address=mgnt-lxd-03.metal.dsi.ic.ac.uk:8443 id=4 role=voter

Consequently if looks like the raft_nodes is not being properly populated.

I will be waiting for your guidance.
Thank you very much for your help !

Hello @freeekanayaka,

Thank you very much for you guidance.
Regarding the refresh of the snap, after attempting and aborting many time, I believe I ended up managing to push it through by killing the lxd waitready process which was triggering and endless loop of sleep 1. Not exactly the greatest thing, but now the lxd process is running with the good version.

Great. Is the cluster somehow functional? I.e. can you run lxc cluster list on at least one node?

Since attempting the snap refresh lxd as suggested by @stgraber none of the nodes repond to lxc cluster list anymore.

One (mgnt-lxd-02) responded Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found, the other two hang forever.

Okay.

Please can you paste the output of:

sqlite3 /var/snap/lxd/common/database/local.db "SELECT core.https_address FROM config"
sqlite3 /var/snap/lxd/common/database/local.db "SELECT cluster.https_address FROM config"
sqlite3 /var/snap/lxd/common/database/local.db "SELECT * FROM raft_nodes"

on each node?

Of course, here is the output that I get :

root@mgnt-lxd-01# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
3|core.https_address|mgnt-lxd-01.metal.dsi.ic.ac.uk:8443
4|cluster.https_address|mgnt-lxd-01.metal.dsi.ic.ac.uk:8443

root@mgnt-lxd-01# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
4|mgnt-lxd-03.metal.dsi.ic.ac.uk:8443|2
5|mgnt-lxd-02.metal.dsi.ic.ac.uk:8443|0
root@mgnt-lxd-02# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
20|cluster.https_address|mgnt-lxd-02.metal.dsi.ic.ac.uk:8443
21|core.https_address|mgnt-lxd-02.metal.dsi.ic.ac.uk:8443

root@mgnt-lxd-02# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
4|mgnt-lxd-03.metal.dsi.ic.ac.uk:8443|2
5|mgnt-lxd-02.metal.dsi.ic.ac.uk:8443|0
root@mgnt-lxd-03# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
1|core.https_address|mgnt-lxd-03.metal.dsi.ic.ac.uk:8443
2|cluster.https_address|mgnt-lxd-03.metal.dsi.ic.ac.uk:8443

root@mgnt-lxd-03# sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
3|172.30.2.2:8443|0
4|mgnt-lxd-03.metal.dsi.ic.ac.uk:8443|0
5|mgnt-lxd-02.metal.dsi.ic.ac.uk:8443|0

Thank you for the support !

Please try to:

  1. Shutdown all lxd processes (if systemctl stop snap.lxd.daemon is not enough, kill them by hand).
  2. Backup /var/snap/lxd/common/database on all three nodes.
  3. Run rm -r /var/snap/lxd/common/database/global on mgnt-lxd-01 and on mgnt-lxd-03.
  4. Run lxd cluster recover-from-quorum-loss on mgnt-lxd-02.
  5. Restart lxd on mgnt-lxd-02 and confirm its working fine
  6. Restart lxd on mgnt-lxd-01 and mgnt-lxd-03 as well
1 Like

Before doing this, my understanding is that this will effectively kill all my active containers. As this is a production cluster and the containers on it are critical to business, can I confirm with you this is indeed what will happen ?

If that is the case I will have to attempt this over a low traffic period. Are there any documentation out there about the recovery of containers from a node that used to be part of a cluster (i.e. Is there any failsafe that could let me start containers regardless of the cluster working on these nodes ?). Sort of as a step to full recovery.

Thank you for your guidance.

Yes, that procedure will shutdown your containers too (although they will be restarted automatically when the procedure completes).

@stgraber is there a way to stop the daemon without stopping containers? I thought that pkill -TERM lxd would do that, but then systemd wants to restart the daemon. Maybe the systemd unit needs to be disabled as well?

You can trick systemd/snapd into this by doing:

  • echo shutdown > /var/snap/lxd/common/state
  • systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
1 Like