LXD daemon fails to start with "Dqlite: attempt 0: server XX.XX.XX.XX:8443: no known leader"

leopaul36 · October 10, 2020, 6:20pm

Hello,

Since the LXD snap release 4.6 (rev. 17629), I am unable to start the LXD daemon.

I have a 3 nodes cluster, the master node’s LXD daemon fails to start with:

Oct 10 18:05:06 n4 lxd.daemon[6442]: t=2020-10-10T18:05:06+0000 lvl=warn msg="Dqlite: attempt 0: server 51.178.21.158:8443: no known leader"

The IP address is the one from the local node which seems to be my LXD cluster master:

sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
1|51.178.21.158:8443|0
2|51.178.21.160:8443|0
3|51.178.21.159:8443|0

sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
1|cluster.https_address|51.178.21.158:8443
2|core.https_address|51.178.21.158:8443

Here is what I gathered from journalctl -u snap.lxd.daemon at the time of the snap update:

Reloaded LXCFS
 => Starting LXD
 t=2020-10-07T19:00:55+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored"
 t=2020-10-07T19:00:55+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
 t=2020-10-07T19:00:55+0000 lvl=warn msg="Dqlite: attempt 0: server 51.178.21.158:8443: no known leader"
 t=2020-10-07T19:01:35+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: call exec-sql (budget 0s): receive: header: EOF"
 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x95ce40]
 goroutine 450 [running]:
 github.com/lxc/lxd/lxd/db.(*Cluster).Transaction(0x0, 0xc000c2fcf0, 0x0, 0x0)
         /build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/db/db.go:349 +0x40
 github.com/lxc/lxd/lxd/cluster.(*Gateway).heartbeat(0xc000130dc0, 0x1756c60, 0xc000464440, 0x0)
         /build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/heartbeat.go:268 +0x627
 github.com/lxc/lxd/lxd/cluster.HeartbeatTask.func1.1(0xc000130dc0, 0x1756c60, 0xc000464440, 0xc0008a2180)
         /build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/heartbeat.go:177 +0x45
 created by github.com/lxc/lxd/lxd/cluster.HeartbeatTask.func1
         /build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxd/cluster/heartbeat.go:176 +0x9c
 => LXD failed to start

For now I only tried to restart the master LXD node without any success and now all the containers from this node are down so did not restart the other two nodes.

How can i get LXD back up?

Thanks for your help.

Léo

stgraber · October 11, 2020, 2:21am

There is no such concept as master with dqlite.

For the database to be function you need to have a quorum of the database servers.
You have a cluster with three database servers, therefore you need a minimum of two to be started for the database to be accessible.

wdavidw · October 11, 2020, 9:02am

Thank you Stéphane, I work with Leo and I look at the issue this morning. I was so cautious that I didn’t touch the other nodes for some time. Once I came upon your answer and reload every lxd instances with systemctl reload snap.lxd.daemon, the quorum got satisfied and everyone was back on track.

leopaul36 · October 11, 2020, 10:52am

Indeed, as @wdavidw said, we were cautious about restarting the other 2 LXD daemons as the first one did not go smoothly but systemctl reload snap.lxd.daemon on all three of the cluster did the trick.

Thanks for your help @stgraber