I've broken my cluster!

I created a physical network for OVN) and broke networking on one of my nodes. LXD won’t start on that node, but fortunately the other 2 nodes seems to have escaped to problem.

Am I safe just deleting the lxd global database on that node and restarting (as well as the OVN NB & SB databases)? Will that re-create the database and sync it from the other 2 OK nodes, or is it more complex than that?
(Is the cluster config for that node in the local database?)

… or is there a better way, for example deleting lines from the database using sqlite?

Thanks
David

What’s failing exactly?
Can you do:

  • systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
  • lxd --debug --group lxd

On the affected system?

Any snap lxd or lxc/d command just hangs on the affected node. lxc commands on the other nodes hang. They’re complaining that the broken node isn’t replying.

journalctl on one of the other nodes:

May 13 18:38:13.092186 grantham lxd.daemon[231394]: t=2021-05-13T18:38:13+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.213:8443: no known leader"
May 13 18:38:13.223716 grantham lxd.daemon[231394]: t=2021-05-13T18:38:13+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.214:8443: reported leader server is not the leader"
May 13 18:38:18.224032 grantham lxd.daemon[231394]: t=2021-05-13T18:38:18+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: i/o timeout"
May 13 18:38:18.494346 grantham lxd.daemon[231394]: t=2021-05-13T18:38:18+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.213:8443: no known leader"
May 13 18:38:18.556216 grantham lxd.daemon[231394]: t=2021-05-13T18:38:18+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.214:8443: no known leader"
May 13 18:38:23.018790 grantham lxd.daemon[231394]: t=2021-05-13T18:38:23+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: i/o timeout"

on the broken node:

$ ps -Aly x | grep '\blx[cd]\b'
S     0    2558       1  0  80   0  1772  1161 -      ?          0:00 /bin/sh /snap/lxd/20309/commands/daemon.start
S     0    2697       1  0  80   0  1576 24453 -      ?          0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
S     0    2708    2558  1  80   0 88896 363541 -     ?          3:48 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
S     0    2709    2558  0  80   0 31836 289477 -     ?          0:00 lxd waitready
S     0    2710    2558  0  80   0  1132  1161 -      ?          0:03 /bin/sh /snap/lxd/20309/commands/daemon.start

# systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
<< hangs >>

# lxd --debug --group lxd
DBUG[05-13|18:36:04] Connecting to a local LXD over a Unix socket 
DBUG[05-13|18:36:04] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
<< hangs >>

Do the systemctl stop again and while it hangs, manually kill the lxd and lxd waitready processes, that should make the stop succeed at which point the lxd --debug --group lxd will show more useful output.

From your log, it suggests network connectivity issues, but it’s going to be easier to see when run again directly.

I didn’t need to kill the processes … the stop worked immediately.

$ ps -Aly x | grep '\blx[cd]\b'
<<nothing >>

# lxd --debug --group lxd
DBUG[05-13|20:48:20] Connecting to a local LXD over a Unix socket
DBUG[05-13|20:48:20] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[05-13|20:48:20] LXD 4.13 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[05-13|20:48:20] Kernel uid/gid map:
INFO[05-13|20:48:20]  - u 0 0 4294967295
INFO[05-13|20:48:20]  - g 0 0 4294967295
INFO[05-13|20:48:20] Configured LXD uid/gid map:
INFO[05-13|20:48:20]  - u 0 1000000 1000000000
INFO[05-13|20:48:20]  - g 0 1000000 1000000000
INFO[05-13|20:48:20] Kernel features:
INFO[05-13|20:48:20]  - closing multiple file descriptors efficiently: no
INFO[05-13|20:48:20]  - netnsid-based network retrieval: yes
INFO[05-13|20:48:20]  - pidfds: yes
INFO[05-13|20:48:20]  - uevent injection: yes
INFO[05-13|20:48:20]  - seccomp listener: yes
INFO[05-13|20:48:20]  - seccomp listener continue syscalls: yes
INFO[05-13|20:48:20]  - seccomp listener add file descriptors: no
INFO[05-13|20:48:20]  - attach to namespaces via pidfds: no
INFO[05-13|20:48:20]  - safe native terminal allocation : yes
INFO[05-13|20:48:20]  - unprivileged file capabilities: yes
INFO[05-13|20:48:20]  - cgroup layout: hybrid
WARN[05-13|20:48:20]  - Couldn't find the CGroup blkio.weight, disk priority will be ignored
INFO[05-13|20:48:20]  - shiftfs support: yes
INFO[05-13|20:48:20] Initializing local database
DBUG[05-13|20:48:20] Initializing database gateway
DBUG[05-13|20:48:20] Start database node                      role=voter id=1 address=10.1.0.215:8443
DBUG[05-13|20:48:21] Connecting to a local LXD over a Unix socket
DBUG[05-13|20:48:21] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[05-13|20:48:21] Detected stale unix socket, deleting
DBUG[05-13|20:48:21] Detected stale unix socket, deleting
INFO[05-13|20:48:21] Starting cluster handler:
INFO[05-13|20:48:21] Starting /dev/lxd handler:
INFO[05-13|20:48:21]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-13|20:48:21] REST API daemon:
INFO[05-13|20:48:21]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-13|20:48:21]  - binding TCP socket                    socket=10.1.0.215:8443
INFO[05-13|20:48:21] Initializing global database
WARN[05-13|20:48:24] Dqlite: attempt 0: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host
WARN[05-13|20:48:27] Dqlite: attempt 0: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: no route to host
DBUG[05-13|20:48:27] Found cert                               name=0
DBUG[05-13|20:48:27] Triggering an out of schedule hearbeat   address=10.1.0.215:8443
WARN[05-13|20:48:27] Dqlite: attempt 0: server 10.1.0.215:8443: no known leader
WARN[05-13|20:48:30] Dqlite: attempt 1: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host
WARN[05-13|20:48:31] Dqlite: attempt 1: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: i/o timeout
WARN[05-13|20:48:31] Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: i/o timeout
DBUG[05-13|20:48:31] Failed connecting to global database (attempt 0): failed to create dqlite connection: no available dqlite leader server found
WARN[05-13|20:48:33] Dqlite: attempt 0: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host
WARN[05-13|20:48:36] Dqlite: attempt 0: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: no route to host 
DBUG[05-13|20:48:36] Found cert                               name=0
DBUG[05-13|20:48:36] Triggering an out of schedule hearbeat   address=10.1.0.215:8443
WARN[05-13|20:48:36] Dqlite: attempt 0: server 10.1.0.215:8443: no known leader 
WARN[05-13|20:48:39] Dqlite: attempt 1: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host 
WARN[05-13|20:48:42] Dqlite: attempt 1: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: no route to host 
DBUG[05-13|20:48:42] Found cert                               name=0
DBUG[05-13|20:48:42] Triggering an out of schedule hearbeat   address=10.1.0.215:8443
WARN[05-13|20:48:42] Dqlite: attempt 1: server 10.1.0.215:8443: no known leader 
WARN[05-13|20:48:43] Dqlite: attempt 2: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: i/o timeout 
WARN[05-13|20:48:43] Dqlite: attempt 2: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: i/o timeout 
WARN[05-13|20:48:43] Dqlite: attempt 2: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: i/o timeout 
DBUG[05-13|20:48:44] Failed connecting to global database (attempt 1): failed to create dqlite connection: no available dqlite leader server found 
WARN[05-13|20:48:48] Dqlite: attempt 0: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host 
WARN[05-13|20:48:51] Dqlite: attempt 0: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: no route to host 
DBUG[05-13|20:48:51] Found cert                               name=0
DBUG[05-13|20:48:51] Triggering an out of schedule hearbeat   address=10.1.0.215:8443
WARN[05-13|20:48:51] Dqlite: attempt 0: server 10.1.0.215:8443: no known leader 
WARN[05-13|20:48:54] Dqlite: attempt 1: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: no route to host 
WARN[05-13|20:48:56] Dqlite: attempt 1: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: i/o timeout 
WARN[05-13|20:48:56] Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: i/o timeout 
... etc

Yes, it’s a networking issue:

$ ip l show dmz0
3: dmz0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7152 qdisc fq_codel master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether b8:a3:86:70:cc:e6 brd ff:ff:ff:ff:ff:ff

$ ip a show dmz0
3: dmz0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7152 qdisc fq_codel master ovs-system state UP group default qlen 1000
    link/ether b8:a3:86:70:cc:e6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.157.215/24 brd 192.168.157.255 scope global dmz0
       valid_lft forever preferred_lft forever
    inet 10.1.0.215/24 brd 10.1.0.255 scope global dmz0
       valid_lft forever preferred_lft forever
    inet 10.2.0.215/24 brd 10.2.0.255 scope global dmz0
       valid_lft forever preferred_lft forever
    inet 10.3.0.215/24 brd 10.3.0.255 scope global dmz0
       valid_lft forever preferred_lft forever

… that master ovs-system shouldn’t be there …

Ah yeah, not sure what happens in this case, it looks like your NIC is indeed part of a OVS bridge or something which may then break normal connectivity and cause those issues.

LXD looks to work correctly, it’s just unable to reach any of the peers and so can’t bring the DB and daemon up. Resolving the network issue should get it back online.

Make sure to only use unused interfaces as the basis of a lxd physical network, as when used as an uplink for an ovn network they get added to an ovs switch.

2 Likes

See General question about the supported function of OVN by LXD

yes, I remembered that I’d read that after I’d done it!

OK. Well, for anyone else reading, deleting that database & the ovn & openvswitch databases didn’t work (well, it got networking back up - but not LXD). After stops & starts (the two “healthy” nodes first, then the broken one), the two good nodes are still flooding the journal with complaints about no leader & cannot connect … and nothing’s changed. :frowning:

No instructions I can see about recovery for this situation, so full reinstall. That’ll certainly fix it. :slight_smile:

What were the specific error messages, as it still suggests some of the members couldn’t communicate with each other?

I’m thinking we should add detection for when creating a physical network that if the parent interface has non-link-local IPs configured we fail with an error saying that an in-use NIC cannot be used as a physical network parent to avoid this in the future. We already do something similar when picking a free virtual function in SR-IOV.

1 Like

Yeah, that’d be a good safety check against misconfig and typos.

1 Like

From after the restarts of the 2 good hosts:
(the "Heartbeat round duration greater than heartbeat interval" duration=149.092909ms interval=10ns message is a bit surprising - I’ve never touched any of the heartbeat or timeout settings.)

May 14 06:36:23.640991 uxbridge lxd.daemon[2856]: 2021/05/14 06:36:23 http: TLS handshake error from 10.1.0.213:35128: read tcp 10.1.0.214:8443->10.1.0.213:35128: read: connection reset by peer
May 14 06:36:23.696526 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:23+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:23.743287 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:23+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.214:8443: reported leader unavailable err=dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:23.744834 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:23+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:23.946211 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:23+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:24.000489 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.214:8443: reported leader unavailable err=dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:24.000961 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:24.402433 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.213:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:24.457978 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.214:8443: reported leader unavailable err=dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.213:8443: connect: connection refused"
May 14 06:36:24.458508 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:25.329124 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.213:8443: no known leader"
May 14 06:36:25.447514 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:25.448025 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:26.522839 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.213:8443: no known leader"
May 14 06:36:26.634605 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:26.635186 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:27.701384 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.213:8443: no known leader"
May 14 06:36:27.822688 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:27.823254 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:28.956944 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:28+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.027836 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=73.393226ms interval=10ns
May 14 06:36:29.032054 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.102799 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=72.745306ms interval=10ns
May 14 06:36:29.107211 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.172935 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=67.634223ms interval=10ns
May 14 06:36:29.179111 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.253303 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=77.90671ms interval=10ns
May 14 06:36:29.258778 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.330125 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=74.142156ms interval=10ns
May 14 06:36:29.334661 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.402968 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=70.4517ms interval=10ns
May 14 06:36:29.408138 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.481175 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=75.502421ms interval=10ns
May 14 06:36:29.485634 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.564262 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=80.354124ms interval=10ns
May 14 06:36:29.568618 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.640664 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=74.208395ms interval=10ns
May 14 06:36:29.645163 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.710266 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=67.123356ms interval=10ns
May 14 06:36:29.715090 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.781984 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=69.178812ms interval=10ns
May 14 06:36:29.787063 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.861400 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=76.884467ms interval=10ns
May 14 06:36:29.866376 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:29.935502 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=71.508107ms interval=10ns
May 14 06:36:29.940258 uxbridge lxd.daemon[2856]: t=2021-05-14T06:36:29+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:09.432617 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.213:8443: no known leader"
May 14 06:27:09.433421 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:09.434024 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:09.704535 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.213:8443: no known leader"
May 14 06:27:09.705043 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:09.705524 grantham lxd.daemon[2711]: t=2021-05-14T06:27:09+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:10.180323 grantham lxd.daemon[2711]: t=2021-05-14T06:27:10+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.213:8443: no known leader"
May 14 06:27:10.180775 grantham lxd.daemon[2711]: t=2021-05-14T06:27:10+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:10.181275 grantham lxd.daemon[2711]: t=2021-05-14T06:27:10+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:11.056090 grantham lxd.daemon[2711]: t=2021-05-14T06:27:11+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.213:8443: no known leader"
May 14 06:27:11.056813 grantham lxd.daemon[2711]: t=2021-05-14T06:27:11+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:11.057198 grantham lxd.daemon[2711]: t=2021-05-14T06:27:11+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:12.132506 grantham lxd.daemon[2711]: t=2021-05-14T06:27:12+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.213:8443: no known leader"
May 14 06:27:12.133222 grantham lxd.daemon[2711]: t=2021-05-14T06:27:12+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:12.133764 grantham lxd.daemon[2711]: t=2021-05-14T06:27:12+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:13.205570 grantham lxd.daemon[2711]: t=2021-05-14T06:27:13+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.213:8443: no known leader"
May 14 06:27:13.206204 grantham lxd.daemon[2711]: t=2021-05-14T06:27:13+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:13.206701 grantham lxd.daemon[2711]: t=2021-05-14T06:27:13+0000 lvl=warn msg="Dqlite: attempt 5: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:14.277151 grantham lxd.daemon[2711]: t=2021-05-14T06:27:14+0000 lvl=warn msg="Dqlite: attempt 6: server 10.1.0.213:8443: no known leader"
May 14 06:27:14.277815 grantham lxd.daemon[2711]: t=2021-05-14T06:27:14+0000 lvl=warn msg="Dqlite: attempt 6: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:14.278145 grantham lxd.daemon[2711]: t=2021-05-14T06:27:14+0000 lvl=warn msg="Dqlite: attempt 6: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:27:15.351409 grantham lxd.daemon[2711]: t=2021-05-14T06:27:15+0000 lvl=warn msg="Dqlite: attempt 7: server 10.1.0.213:8443: no known leader"
May 14 06:27:15.352059 grantham lxd.daemon[2711]: t=2021-05-14T06:27:15+0000 lvl=warn msg="Dqlite: attempt 7: server 10.1.0.214:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.214:8443: connect: connection refused"
May 14 06:27:15.352492 grantham lxd.daemon[2711]: t=2021-05-14T06:27:15+0000 lvl=warn msg="Dqlite: attempt 7: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
<snip>
May 14 06:36:24.683335 grantham lxd.daemon[4022]: t=2021-05-14T06:36:24+0000 lvl=warn msg="No local trusted server certificates found, falling back to trusting network certificate"
May 14 06:36:24.990229 grantham lxd.daemon[4022]: t=2021-05-14T06:36:24+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.213:8443: no known leader"
May 14 06:36:25.127514 grantham lxd.daemon[4022]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:25.128036 grantham lxd.daemon[4022]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 0: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:25.399733 grantham lxd.daemon[4022]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.213:8443: no known leader"
May 14 06:36:25.536530 grantham lxd.daemon[4022]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:25.536877 grantham lxd.daemon[4022]: t=2021-05-14T06:36:25+0000 lvl=warn msg="Dqlite: attempt 1: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:26.009135 grantham lxd.daemon[4022]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.213:8443: no known leader"
May 14 06:36:26.146371 grantham lxd.daemon[4022]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:26.146831 grantham lxd.daemon[4022]: t=2021-05-14T06:36:26+0000 lvl=warn msg="Dqlite: attempt 2: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:27.020774 grantham lxd.daemon[4022]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.213:8443: no known leader"
May 14 06:36:27.153788 grantham lxd.daemon[4022]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:27.154295 grantham lxd.daemon[4022]: t=2021-05-14T06:36:27+0000 lvl=warn msg="Dqlite: attempt 3: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:36:28.238070 grantham lxd.daemon[4022]: t=2021-05-14T06:36:28+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.213:8443: no known leader"
May 14 06:36:28.378077 grantham lxd.daemon[4022]: t=2021-05-14T06:36:28+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.214:8443: reported leader server is not the leader"
May 14 06:36:28.378497 grantham lxd.daemon[4022]: t=2021-05-14T06:36:28+0000 lvl=warn msg="Dqlite: attempt 4: server 10.1.0.215:8443: dial: Failed to connect to HTTP endpoint: dial tcp 10.1.0.215:8443: connect: connection refused"
<<snip>>
May 14 06:45:01.729308 grantham lxd.daemon[4022]: 2021/05/14 06:45:01 http: TLS handshake error from 10.1.0.214:39890: EOF
May 14 06:45:07.209606 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: call exec-sql (budget 0s): receive: header: EOF"
May 14 06:45:07.287847 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.362516 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=149.092909ms interval=10ns
May 14 06:45:07.368128 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.446627 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.528866 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=162.592282ms interval=10ns
May 14 06:45:07.534490 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.607153 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=74.694599ms interval=10ns
May 14 06:45:07.612958 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.693192 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=82.34439ms interval=10ns
May 14 06:45:07.699099 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.780019 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=83.022174ms interval=10ns
May 14 06:45:07.786100 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.860610 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=76.904245ms interval=10ns
May 14 06:45:07.865483 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:07.945020 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=81.29149ms interval=10ns
May 14 06:45:07.951190 grantham lxd.daemon[4022]: t=2021-05-14T06:45:07+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.025340 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=76.312777ms interval=10ns
May 14 06:45:08.031868 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.123583 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=94.578243ms interval=10ns
May 14 06:45:08.130292 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.203843 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=76.237684ms interval=10ns
May 14 06:45:08.209629 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.287399 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=79.633856ms interval=10ns
May 14 06:45:08.293169 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.379858 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=88.629242ms interval=10ns
May 14 06:45:08.386003 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.458199 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=74.348068ms interval=10ns
May 14 06:45:08.463888 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.544347 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=82.334137ms interval=10ns
May 14 06:45:08.549819 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.625387 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=77.290341ms interval=10ns
May 14 06:45:08.631255 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.703618 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=74.297682ms interval=10ns
May 14 06:45:08.709317 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.785697 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=78.320874ms interval=10ns
May 14 06:45:08.791218 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.866137 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=76.564904ms interval=10ns
May 14 06:45:08.872117 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"
May 14 06:45:08.952218 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=82.400636ms interval=10ns
May 14 06:45:08.958286 grantham lxd.daemon[4022]: t=2021-05-14T06:45:08+0000 lvl=warn msg="Failed heartbeat" address=10.1.0.215:8443 err="Failed to send heartbeat request: Put \"https://10.1.0.215:8443/internal/database\": dial tcp 10.1.0.215:8443: connect: connection refused"

So I suspect your cluster has become partitioned and now none of them can start up.

The heartbeat interval warning is something that was added recently, and in situations where there is no access to the clustered database, it should fallback to the default heartbeat interval, but there is a conversion bug.

This fixes it:

Can you clarify what you mean by “healthy” nodes? They all seem to be problematic now.

Can you stop LXD on all of them and try to start them back up.

The “healthy” nodes are the two I’m assuming were unaffected by the error creating the physical network - they didn’t have the NIC mastered by ovs_system.

Restarts is one of the first things I did to try and fix the problem. The sequence of my attempts was:

  1. restart the damaged host.
  2. restart all the hosts. The host with the issue is the one which reboots fastest.
  3. stop then restart LXD: stop all then start the 2 assumed good nodes then start the damaged node.
  4. on the damaged host, stop lxd, ovn & ovs, delete the ovs / ovn databases and restart the host.
  5. stop lxd, ovn & ovs, delete the databases + LXD global db on damaged host, stop all the hosts, start “good” hosts, start damaged host.

This last one fixed the networking but left the LXD cluster broken and the hoped for recovery didn’t happen.

At present, LXD is stopped on all hosts.

1 hr to recreate everything.

This PR implements the check: