I have been running a 7-node LXD cluster for a few weeks now, and things were working as expected till an hour ago. The cluster was installed with LXD 4.0, and the snap has since automatically updated to 4.0.1 (don’t know if there is a silent update of the revisions as well).
The problem:
lxc cluster list shows the cluster nodes not exchanging heartbeat messages as of an hour ago, as a result showing all of the nodes in OFFLINE state. All the containers/VMs I have running in the cluster remain accessible.
lxc list shows only the local containers/VMs, and reports “ERROR” status for non-local instances (container or VM).
When I run an lxc command that requires one to connect to a remote node’s lxd, shows an issue with interacting with the remote cluster member:
Error: Missing event connection with target cluster member
Have restarted the lxd daemons on all of the cluster nodes - no change.
Don’t see anything unusual in the daemon logs on any of the nodes.
Found another place where the log files may be stored - /var/snap/lxd/common/lxd/logs/lxd.log …
On one of the cluster nodes found the following messages:
t=2020-05-05T22:27:35+0000 lvl=warn msg="Failed to get events from node 10.30.30.205:8443: Unable to connect to: 10.30.30.205:8443"
t=2020-05-05T22:27:36+0000 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:10.30.30.197:8443 RaftID:2 RaftRole:0 Raft:true LastHeartbeat:2020-05-05 22:14:49.859025896 +0000 UTC Online:false updated:false}"
t=2020-05-05T22:27:36+0000 lvl=warn msg="Excluding offline node from refresh: {ID:3 Address:10.30.30.215:8443 RaftID:3 RaftRole:2 Raft:true LastHeartbeat:2020-05-05 22:14:49.859373204 +0000 UTC Online:false updated:false}"
t=2020-05-05T22:27:36+0000 lvl=info msg="Refreshing forkdns peers for lxdfan0"
t=2020-05-05T22:27:36+0000 lvl=eror msg="Error refreshing forkdns: Get \"https://10.30.30.205:8443/1.0/networks/lxdfan0/state\": Unable to connect to: 10.30.30.205:8443"
t=2020-05-05T22:27:36+0000 lvl=warn msg="Failed to get events from node 10.30.30.205:8443: Unable to connect to: 10.30.30.205:8443"
t=2020-05-05T22:27:38+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"
name: lxd
summary: System container manager and API
publisher: Canonical*
store-url: https://snapcraft.io/lxd
contact: https://github.com/lxc/lxd/issues
license: unset
description: |
**LXD is a system container manager**
With LXD you can run hundreds of containers of a variety of Linux
distributions, apply resource limits, pass in directories, USB devices
or GPUs and setup any network and storage you want.
LXD containers are lightweight, secure by default and a great
alternative to running Linux virtual machines.
**Run any Linux distribution you want**
Pre-made images are available for Ubuntu, Alpine Linux, ArchLinux,
CentOS, Debian, Fedora, Gentoo, OpenSUSE and more.
A full list of available images can be found here:
https://images.linuxcontainers.org
Can't find the distribution you want? It's easy to make your own images
too, either using our `distrobuilder` tool or by assembling your own image
tarball by hand.
**Containers at scale**
LXD is network aware and all interactions go through a simple REST API,
making it possible to remotely interact with containers on remote
systems, copying and moving them as you wish.
Want to go big? LXD also has built-in clustering support,
letting you turn dozens of servers into one big LXD server.
**Configuration options**
Supported options for the LXD snap (`snap set lxd KEY=VALUE`):
- criu.enable: Enable experimental live-migration support [default=false]
- daemon.debug: Increases logging to debug level [default=false]
- daemon.group: Group of users that can interact with LXD [default=lxd]
- ceph.builtin: Use snap-specific ceph configuration [default=false]
- openvswitch.builtin: Run a snap-specific OVS daemon [default=false]
Documentation: https://lxd.readthedocs.io
commands:
- lxd.benchmark
- lxd.buginfo
- lxd.check-kernel
- lxd.lxc
- lxd.lxc-to-lxd
- lxd
- lxd.migrate
services:
lxd.activate: oneshot, enabled, inactive
lxd.daemon: simple, enabled, active
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: 8 days ago, at 15:50 UTC
channels:
latest/stable: 4.0.1 2020-05-06 (14954) 72MB -
latest/candidate: 4.0.1 2020-05-05 (14954) 72MB -
latest/beta: ^
latest/edge: git-23f392d 2020-05-05 (14962) 72MB -
4.0/stable: 4.0.1 2020-05-06 (14954) 72MB -
4.0/candidate: 4.0.1 2020-05-05 (14954) 72MB -
4.0/beta: ^
4.0/edge: ^
3.23/stable: 3.23 2020-03-30 (14133) 70MB -
3.23/candidate: 3.23 2020-03-30 (14133) 70MB -
3.23/beta: ^
3.23/edge: ^
3.22/stable: 3.22 2020-03-18 (13901) 70MB -
3.22/candidate: 3.22 2020-03-19 (13911) 70MB -
3.22/beta: ^
3.22/edge: ^
3.21/stable: 3.21 2020-02-24 (13522) 69MB -
3.21/candidate: 3.21 2020-03-04 (13588) 69MB -
3.21/beta: ^
3.21/edge: ^
3.20/stable: 3.20 2020-02-06 (13300) 69MB -
3.20/candidate: 3.20 2020-02-06 (13300) 69MB -
3.20/beta: ^
3.20/edge: ^
3.19/stable: 3.19 2020-01-27 (13162) 67MB -
3.19/candidate: 3.19 2020-01-27 (13162) 67MB -
3.19/beta: ^
3.19/edge: ^
3.18/stable: 3.18 2019-12-02 (12631) 57MB -
3.18/candidate: 3.18 2019-12-02 (12631) 57MB -
3.18/beta: ^
3.18/edge: ^
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ^
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.11 2019-10-10 (8023) 28MB -
2.0/candidate: 2.0.11 2019-10-10 (8023) 28MB -
2.0/beta: ^
2.0/edge: git-160221d 2020-01-13 (12854) 27MB -
installed: 4.0.1 (14890) 72MB -
Also, for what it is worth, found the following when I did lxc info (on the subset of the hosts in the cluster where the client is able to connect to the server): is the driver version 4.0.2 expected? (Everything else is 4.0.1)
One possibility is that the behavior that we observed that creating a new connection in those cases works, actually does not work immediately, but after some time. In that case the Go sql package would retry 3 times in fast succession, but would still propagate not an error if all three attempts fail.
Yes, that’s expected. It’s the version of the liblxc library that LXD uses.
Please can you try to:
Stop all your LXD daemons: systemctl stop snap.lxd.daemon should be enough, but double check with ps aux | grep lxd or similar.
Run lxd cluster list-database on a few nodes, check what it returns and see if it’s the same output across nodes.
Assuming 2) returns you the same output, start just the nodes in that list, and monitor lxd.log. After a few seconds you should see that the cluster is up (check it with lxc cluster list).
Try to start the remaining nodes too.
If anything in these steps goes wrong, please follow up here with the details.
Okay, this morning when I checked the cluster … everything seems to be back to normal. lxc cluster list shows all of the cluster nodes online, and am able to carry out standard lxc ... operations from each of the nodes.
Strange, and curious to understand both why the system went off the rails yesterday evening and as to how it repaired itself overnight. Am concerned about the fact that the system entered the bad state all on its own without any apparent external trigger …
The only change I can see is that overnight, snap appears to have refreshed the lxd version to 4.0.1, rev 14954. When I reported the issue, the version on the nodes was lxd 4.0.1, rev 14890.
The system entering into a bad state by its own is unlikely, at least in our experience. Most probably some race condition happened during the upgrade, which led to the issue, and that’s where we need to investigate.
I came across the very error on my 3-node cluster. However, I believe that it’s caused by the faulty HDD of that server in my case as one of the 5 HDDs has been warning me of its malfunction. My temporary solution is to add a 4th node and backup the apps so that to leave me time to trouble-shoot the problem as well as keep the cluster running normally.