After the automatic upgrade of snap, lxd began to report no known leader

huoxingdawang · July 15, 2024, 4:32pm

One my my production clusters is in a broken state.
I first discovered all containers were available but the lxd daemon and commands are not working, which is pretty similar to here.
However, I rebooted the machine, and both the containers and the lxd daemon is not working now

I have 3 machine in my cluster and I found 2 of them upgraded automatically when I run snap changes lxd.

root@lian-xlab-01:~# snap changes lxd
ID   Status  Spawn                   Ready                   Summary
13   Done    yesterday at 18:36 CST  yesterday at 19:37 CST  Auto-refresh snap "lxd"
14   Done    yesterday at 23:41 CST  yesterday at 23:41 CST  Change configuration of "lxd" snap

root@lian-xlab-02:~# snap changes lxd
ID   Status  Spawn                   Ready                   Summary
17   Done    yesterday at 18:38 CST  yesterday at 19:39 CST  Refresh "lxd" snap
18   Done    yesterday at 23:41 CST  yesterday at 23:41 CST  Change configuration of "lxd" snap

And when I run snap info lxd, these two machines both reported they have the version 6.1-90889b0 .

installed:          6.1-90889b0               (29398) 108MB in-cohort

The other machine, however, didn’t upgraded. When I run snap switch lxd --cohort=+ and snap refresh lxd following here, it reported snap "lxd" has no updates available. Then I use sudo snap refresh lxd --channel=6.1/stable --classic to upgrade to the version 6.1 manualy, and it is still Doing until now.

root@lian-xlab-00:~# snap changes lxd
ID   Status  Spawn                   Ready                   Summary
14   Done    yesterday at 18:38 CST  yesterday at 18:38 CST  Refresh "lxd" snap
15   Done    yesterday at 22:36 CST  yesterday at 22:36 CST  Change configuration of "lxd" snap
16   Done    yesterday at 22:53 CST  yesterday at 22:53 CST  Switch "lxd" snap to cohort "+"
17   Done    yesterday at 22:54 CST  yesterday at 22:54 CST  Switch "lxd" snap to cohort "+"
18   Done    yesterday at 22:55 CST  yesterday at 22:55 CST  Switch "lxd" snap to cohort "+"
19   Doing   yesterday at 22:56 CST  -                       Refresh "lxd" snap from "6.1/stable" channel

I also used snap set lxd daemon.debug=true to turn on the debug log.
And I found all three machines seems trying to connect with each other and they are all reporting things like no known leader

time="2024-07-16T00:22:02+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47958"
time="2024-07-16T00:22:02+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:02+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47958"
time="2024-07-16T00:22:02+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial tcp 192.168.3.150:8443: connect: connection refused"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:47970" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T00:22:03+08:00" level=debug msg="Matched trusted cert" fingerprint=76089127cc18ad7c146ebb5c902c45f9c2825db0b2a3648df7fd61d133caa1b8 subject="CN=root@lian-xlab-01,O=LXD"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47970"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47970"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial tcp 192.168.3.150:8443: connect: connection refused"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:47984" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T00:22:04+08:00" level=debug msg="Matched trusted cert" fingerprint=76089127cc18ad7c146ebb5c902c45f9c2825db0b2a3648df7fd61d133caa1b8 subject="CN=root@lian-xlab-01,O=LXD"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47984"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47984"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"

I also tried

sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
3|lian-xlab-01:8443|0|lian-xlab-01
11|lian-xlab-02:8443|0|lian-xlab-02
13|lian-xlab-00:8443|0|lian-xlab-00

and

sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
1|cluster.https_address|lian-xlab-01:8443
2|core.https_address|lian-xlab-01:8443

All of the three machine is Ubuntu 22.04.4 LTS and the kernel version is 5.15.0-101.

It is also worth noting that all command starting with lxc that I currently know will get stuck such as lxc list, lxc cluster list, lxc export and so on, they all get stuck and do nothing.

I want to know if there is anything I can do to recover my cluster from this disaster or I can export the data and rebuild the cluster.
I’m trying to replicate zfs on three machines, but I’m not sure how to re-import the containers with only zfs (rather than the .tar.gz package exported with lxc export).

huoxingdawang · July 16, 2024, 2:54am

I followed the instruction in the cluster_recover and run sudo lxd cluster recover-from-quorum-loss, and things looks worse.
One of my machine seems to be the leader now

sudo lxd cluster list-database
+-------------------+
|      ADDRESS      |
+-------------------+
| lian-xlab-01:8443 |
+-------------------+

and it is sending heartbeat to two other machines.

time="2024-07-16T10:50:23+08:00" level=info msg="Initializing global database"
time="2024-07-16T10:50:23+08:00" level=info msg="Connecting to global database"
time="2024-07-16T10:50:23+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:37154"
time="2024-07-16T10:50:23+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:55448" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:50:23+08:00" level=debug msg="Matched trusted cert" fingerprint=76089127cc18ad7c146ebb5c902c45f9c2825db0b2a3648df7fd61d133caa1b8 subject="CN=root@lian-xlab-01,O=LXD"
time="2024-07-16T10:50:23+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:55448"
time="2024-07-16T10:50:23+08:00" level=debug msg="Dqlite: attempt 1: server lian-xlab-01:8443: connected"
time="2024-07-16T10:50:23+08:00" level=info msg="Connected to global database"
time="2024-07-16T10:50:23+08:00" level=debug msg="Database error" err="Failed to ensure schema: schema check gracefully aborted"
time="2024-07-16T10:50:23+08:00" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"
time="2024-07-16T10:50:23+08:00" level=debug msg="Starting heartbeat round" local="lian-xlab-01:8443" mode=normal
time="2024-07-16T10:50:23+08:00" level=debug msg="Heartbeat updating local raft members" members="[{{3 lian-xlab-01:8443 voter} lian-xlab-01}]"
time="2024-07-16T10:50:24+08:00" level=debug msg="Sending heartbeat request" address="lian-xlab-00:8443"
time="2024-07-16T10:50:26+08:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://lian-xlab-00:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="lian-xlab-00:8443"
time="2024-07-16T10:50:26+08:00" level=debug msg="Database error" err="Local member name not available"
time="2024-07-16T10:50:26+08:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-07-16T10:50:28+08:00" level=debug msg="Sending heartbeat request" address="lian-xlab-02:8443"
time="2024-07-16T10:50:28+08:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://lian-xlab-02:8443/internal/database\": dial tcp 192.168.3.152:8443: connect: connection refused" remote="lian-xlab-02:8443"
time="2024-07-16T10:50:28+08:00" level=debug msg="Database error" err="Local member name not available"
time="2024-07-16T10:50:28+08:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-07-16T10:50:28+08:00" level=debug msg="Completed heartbeat round" duration=4.305474857s local="lian-xlab-01:8443"
time="2024-07-16T10:50:33+08:00" level=debug msg="Starting heartbeat round" local="lian-xlab-01:8443" mode=normal
time="2024-07-16T10:50:33+08:00" level=debug msg="Heartbeat updating local raft members" members="[{{3 lian-xlab-01:8443 voter} lian-xlab-01}]"
time="2024-07-16T10:50:36+08:00" level=debug msg="Sending heartbeat request" address="lian-xlab-00:8443"
time="2024-07-16T10:50:37+08:00" level=debug msg="Sending heartbeat request" address="lian-xlab-02:8443"
time="2024-07-16T10:50:37+08:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://lian-xlab-02:8443/internal/database\": dial tcp 192.168.3.152:8443: connect: connection refused" remote="lian-xlab-02:8443"
time="2024-07-16T10:50:37+08:00" level=debug msg="Database error" err="Local member name not available"
time="2024-07-16T10:50:37+08:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-07-16T10:50:38+08:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://lian-xlab-00:8443/internal/database\": dial tcp 192.168.3.150:8443: i/o timeout (Client.Timeout exceeded while awaiting headers)" remote="lian-xlab-00:8443"
time="2024-07-16T10:50:38+08:00" level=debug msg="Database error" err="Local member name not available"
time="2024-07-16T10:50:38+08:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-07-16T10:50:38+08:00" level=debug msg="Completed heartbeat round" duration=4.260517705s local="lian-xlab-01:8443"

However, when I run sudo snap start lxd to start two other machines, they are still reportting no known leader

time="2024-07-16T10:41:25+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:45444"
time="2024-07-16T10:41:25+08:00" level=warning msg="Dqlite: attempt 1: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:25+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:45444"
time="2024-07-16T10:41:28+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial                                                                 tcp 192.168.3.150:8443: connect: no route to host"
time="2024-07-16T10:41:28+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-01:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-01:8443\": dial                                                                 tcp 192.168.3.151:8443: connect: connection refused"
time="2024-07-16T10:41:28+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:53640" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:41:28+08:00" level=debug msg="Matched trusted cert" fingerprint=a87978c582e0aff64e03d3c9f85200b1cdacc4fffb188997d36f4ba0c73574a6 subject="CN=root                                                                @lian-xlab-02,O=LXD"
time="2024-07-16T10:41:28+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53640"
time="2024-07-16T10:41:28+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:28+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53640"
time="2024-07-16T10:41:31+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial                                                                 tcp 192.168.3.150:8443: connect: no route to host"
time="2024-07-16T10:41:31+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-01:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-01:8443\": dial                                                                 tcp 192.168.3.151:8443: connect: connection refused"
time="2024-07-16T10:41:31+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:53652" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:41:31+08:00" level=debug msg="Matched trusted cert" fingerprint=a87978c582e0aff64e03d3c9f85200b1cdacc4fffb188997d36f4ba0c73574a6 subject="CN=root                                                                @lian-xlab-02,O=LXD"
time="2024-07-16T10:41:31+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53652"
time="2024-07-16T10:41:31+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:31+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53652"
time="2024-07-16T10:41:34+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial                                                                 tcp 192.168.3.150:8443: connect: no route to host"
time="2024-07-16T10:41:34+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-01:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-01:8443\": dial                                                                 tcp 192.168.3.151:8443: connect: connection refused"
time="2024-07-16T10:41:34+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:53658" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:41:34+08:00" level=debug msg="Matched trusted cert" fingerprint=a87978c582e0aff64e03d3c9f85200b1cdacc4fffb188997d36f4ba0c73574a6 subject="CN=root                                                                @lian-xlab-02,O=LXD"
time="2024-07-16T10:41:34+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53658"
time="2024-07-16T10:41:34+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:34+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:53658"
time="2024-07-16T10:41:35+08:00" level=error msg="Failed connecting to global database" attempt=11 err="failed to create dqlite connection: no available dqlite leade                                                                r server found"
time="2024-07-16T10:41:37+08:00" level=info msg="Connecting to global database"
time="2024-07-16T10:41:37+08:00" level=warning msg="Dqlite: attempt 1: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial                                                                 tcp 192.168.3.150:8443: connect: no route to host"
time="2024-07-16T10:41:37+08:00" level=warning msg="Dqlite: attempt 1: server lian-xlab-01:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-01:8443\": dial                                                                 tcp 192.168.3.151:8443: connect: connection refused"
time="2024-07-16T10:41:37+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:50398" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:41:37+08:00" level=debug msg="Matched trusted cert" fingerprint=a87978c582e0aff64e03d3c9f85200b1cdacc4fffb188997d36f4ba0c73574a6 subject="CN=root                                                                @lian-xlab-02,O=LXD"
time="2024-07-16T10:41:37+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:50398"
time="2024-07-16T10:41:37+08:00" level=warning msg="Dqlite: attempt 1: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:37+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:50398"
time="2024-07-16T10:41:40+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial                                                                 tcp 192.168.3.150:8443: connect: no route to host"
time="2024-07-16T10:41:40+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-01:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-01:8443\": dial                                                                 tcp 192.168.3.151:8443: connect: connection refused"
time="2024-07-16T10:41:40+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:50406" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T10:41:40+08:00" level=debug msg="Matched trusted cert" fingerprint=a87978c582e0aff64e03d3c9f85200b1cdacc4fffb188997d36f4ba0c73574a6 subject="CN=root                                                                @lian-xlab-02,O=LXD"
time="2024-07-16T10:41:40+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:50406"
time="2024-07-16T10:41:40+08:00" level=warning msg="Dqlite: attempt 2: server lian-xlab-02:8443: no known leader"
time="2024-07-16T10:41:40+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:50406"

system · August 15, 2024, 2:55am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.