One my my production clusters is in a broken state.
I first discovered all containers were available but the lxd daemon and commands are not working, which is pretty similar to here.
However, I rebooted the machine, and both the containers and the lxd daemon is not working now
I have 3 machine in my cluster and I found 2 of them upgraded automatically when I run snap changes lxd
.
root@lian-xlab-01:~# snap changes lxd
ID Status Spawn Ready Summary
13 Done yesterday at 18:36 CST yesterday at 19:37 CST Auto-refresh snap "lxd"
14 Done yesterday at 23:41 CST yesterday at 23:41 CST Change configuration of "lxd" snap
root@lian-xlab-02:~# snap changes lxd
ID Status Spawn Ready Summary
17 Done yesterday at 18:38 CST yesterday at 19:39 CST Refresh "lxd" snap
18 Done yesterday at 23:41 CST yesterday at 23:41 CST Change configuration of "lxd" snap
And when I run snap info lxd
, these two machines both reported they have the version 6.1-90889b0
.
installed: 6.1-90889b0 (29398) 108MB in-cohort
The other machine, however, didn’t upgraded. When I run snap switch lxd --cohort=+
and snap refresh lxd
following here, it reported snap "lxd" has no updates available
. Then I use sudo snap refresh lxd --channel=6.1/stable --classic
to upgrade to the version 6.1
manualy, and it is still Doing
until now.
root@lian-xlab-00:~# snap changes lxd
ID Status Spawn Ready Summary
14 Done yesterday at 18:38 CST yesterday at 18:38 CST Refresh "lxd" snap
15 Done yesterday at 22:36 CST yesterday at 22:36 CST Change configuration of "lxd" snap
16 Done yesterday at 22:53 CST yesterday at 22:53 CST Switch "lxd" snap to cohort "+"
17 Done yesterday at 22:54 CST yesterday at 22:54 CST Switch "lxd" snap to cohort "+"
18 Done yesterday at 22:55 CST yesterday at 22:55 CST Switch "lxd" snap to cohort "+"
19 Doing yesterday at 22:56 CST - Refresh "lxd" snap from "6.1/stable" channel
I also used snap set lxd daemon.debug=true
to turn on the debug log.
And I found all three machines seems trying to connect with each other and they are all reporting things like no known leader
time="2024-07-16T00:22:02+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47958"
time="2024-07-16T00:22:02+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:02+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47958"
time="2024-07-16T00:22:02+08:00" level=warning msg="Dqlite: attempt 3: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial tcp 192.168.3.150:8443: connect: connection refused"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:47970" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T00:22:03+08:00" level=debug msg="Matched trusted cert" fingerprint=76089127cc18ad7c146ebb5c902c45f9c2825db0b2a3648df7fd61d133caa1b8 subject="CN=root@lian-xlab-01,O=LXD"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47970"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:03+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47970"
time="2024-07-16T00:22:03+08:00" level=warning msg="Dqlite: attempt 4: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-00:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-00:8443\": dial tcp 192.168.3.150:8443: connect: connection refused"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite connected outbound" local="127.0.0.1:47984" name=dqlite remote="127.0.1.1:8443"
time="2024-07-16T00:22:04+08:00" level=debug msg="Matched trusted cert" fingerprint=76089127cc18ad7c146ebb5c902c45f9c2825db0b2a3648df7fd61d133caa1b8 subject="CN=root@lian-xlab-01,O=LXD"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite proxy started" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47984"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-01:8443: no known leader"
time="2024-07-16T00:22:04+08:00" level=info msg="Dqlite proxy stopped" local="127.0.1.1:8443" name=dqlite remote="127.0.0.1:47984"
time="2024-07-16T00:22:04+08:00" level=warning msg="Dqlite: attempt 5: server lian-xlab-02:8443: dial: Failed connecting to HTTP endpoint \"lian-xlab-02:8443\": dial tcp 192.168.3.152:8443: connect: connection refused"
I also tried
sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes"
3|lian-xlab-01:8443|0|lian-xlab-01
11|lian-xlab-02:8443|0|lian-xlab-02
13|lian-xlab-00:8443|0|lian-xlab-00
and
sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM config"
1|cluster.https_address|lian-xlab-01:8443
2|core.https_address|lian-xlab-01:8443
All of the three machine is Ubuntu 22.04.4 LTS and the kernel version is 5.15.0-101.
It is also worth noting that all command starting with lxc
that I currently know will get stuck such as lxc list
, lxc cluster list
, lxc export
and so on, they all get stuck and do nothing.
I want to know if there is anything I can do to recover my cluster from this disaster or I can export the data and rebuild the cluster.
I’m trying to replicate zfs on three machines, but I’m not sure how to re-import the containers with only zfs (rather than the .tar.gz package exported with lxc export
).