LXD 4.0.1, Rev 14890 - Cluster heartbeats stopped

I have been running a 7-node LXD cluster for a few weeks now, and things were working as expected till an hour ago. The cluster was installed with LXD 4.0, and the snap has since automatically updated to 4.0.1 (don’t know if there is a silent update of the revisions as well).

The problem:

  • lxc cluster list shows the cluster nodes not exchanging heartbeat messages as of an hour ago, as a result showing all of the nodes in OFFLINE state. All the containers/VMs I have running in the cluster remain accessible.
  • lxc list shows only the local containers/VMs, and reports “ERROR” status for non-local instances (container or VM).
  • When I run an lxc command that requires one to connect to a remote node’s lxd, shows an issue with interacting with the remote cluster member:
Error: Missing event connection with target cluster member

Have restarted the lxd daemons on all of the cluster nodes - no change.
Don’t see anything unusual in the daemon logs on any of the nodes.

Any suggestions for how to recover from this?

Thanks

On some of the cluster nodes, lxc list errors out with the following message:

Error: failed to begin transaction: not an error

The daemon logs don’t show anything off on these nodes.

Found another place where the log files may be stored - /var/snap/lxd/common/lxd/logs/lxd.log
On one of the cluster nodes found the following messages:

t=2020-05-05T22:27:35+0000 lvl=warn msg="Failed to get events from node 10.30.30.205:8443: Unable to connect to: 10.30.30.205:8443" 
t=2020-05-05T22:27:36+0000 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:10.30.30.197:8443 RaftID:2 RaftRole:0 Raft:true LastHeartbeat:2020-05-05 22:14:49.859025896 +0000 UTC Online:false updated:false}" 
t=2020-05-05T22:27:36+0000 lvl=warn msg="Excluding offline node from refresh: {ID:3 Address:10.30.30.215:8443 RaftID:3 RaftRole:2 Raft:true LastHeartbeat:2020-05-05 22:14:49.859373204 +0000 UTC Online:false updated:false}" 
t=2020-05-05T22:27:36+0000 lvl=info msg="Refreshing forkdns peers for lxdfan0" 
t=2020-05-05T22:27:36+0000 lvl=eror msg="Error refreshing forkdns: Get \"https://10.30.30.205:8443/1.0/networks/lxdfan0/state\": Unable to connect to: 10.30.30.205:8443" 
t=2020-05-05T22:27:36+0000 lvl=warn msg="Failed to get events from node 10.30.30.205:8443: Unable to connect to: 10.30.30.205:8443" 
t=2020-05-05T22:27:38+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"

Can you show snap info lxd?

The not an error shouldn’t be possible anymore but we’ve somehow still been seeing it pop up lately…

name:      lxd
summary:   System container manager and API
publisher: Canonical*
store-url: https://snapcraft.io/lxd
contact:   https://github.com/lxc/lxd/issues
license:   unset
description: |
  **LXD is a system container manager**
  
  With LXD you can run hundreds of containers of a variety of Linux
  distributions, apply resource limits, pass in directories, USB devices
  or GPUs and setup any network and storage you want.
  
  LXD containers are lightweight, secure by default and a great
  alternative to running Linux virtual machines.
  
  
  **Run any Linux distribution you want**
  
  Pre-made images are available for Ubuntu, Alpine Linux, ArchLinux,
  CentOS, Debian, Fedora, Gentoo, OpenSUSE and more.
  A full list of available images can be found here:
  https://images.linuxcontainers.org
  
  Can't find the distribution you want? It's easy to make your own images
  too, either using our `distrobuilder` tool or by assembling your own image
  tarball by hand.
  
  
  **Containers at scale**
  
  LXD is network aware and all interactions go through a simple REST API,
  making it possible to remotely interact with containers on remote
  systems, copying and moving them as you wish.
  
  Want to go big? LXD also has built-in clustering support,
  letting you turn dozens of servers into one big LXD server.
  
  
  **Configuration options**
  
  Supported options for the LXD snap (`snap set lxd KEY=VALUE`):
   - criu.enable: Enable experimental live-migration support [default=false]
   - daemon.debug: Increases logging to debug level [default=false]
   - daemon.group: Group of users that can interact with LXD [default=lxd]
   - ceph.builtin: Use snap-specific ceph configuration [default=false]
   - openvswitch.builtin: Run a snap-specific OVS daemon [default=false]
  
  Documentation: https://lxd.readthedocs.io
commands:
  - lxd.benchmark
  - lxd.buginfo
  - lxd.check-kernel
  - lxd.lxc
  - lxd.lxc-to-lxd
  - lxd
  - lxd.migrate
services:
  lxd.activate: oneshot, enabled, inactive
  lxd.daemon:   simple, enabled, active
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: 8 days ago, at 15:50 UTC
channels:
  latest/stable:    4.0.1       2020-05-06 (14954) 72MB -
  latest/candidate: 4.0.1       2020-05-05 (14954) 72MB -
  latest/beta:      ^                                   
  latest/edge:      git-23f392d 2020-05-05 (14962) 72MB -
  4.0/stable:       4.0.1       2020-05-06 (14954) 72MB -
  4.0/candidate:    4.0.1       2020-05-05 (14954) 72MB -
  4.0/beta:         ^                                   
  4.0/edge:         ^                                   
  3.23/stable:      3.23        2020-03-30 (14133) 70MB -
  3.23/candidate:   3.23        2020-03-30 (14133) 70MB -
  3.23/beta:        ^                                   
  3.23/edge:        ^                                   
  3.22/stable:      3.22        2020-03-18 (13901) 70MB -
  3.22/candidate:   3.22        2020-03-19 (13911) 70MB -
  3.22/beta:        ^                                   
  3.22/edge:        ^                                   
  3.21/stable:      3.21        2020-02-24 (13522) 69MB -
  3.21/candidate:   3.21        2020-03-04 (13588) 69MB -
  3.21/beta:        ^                                   
  3.21/edge:        ^                                   
  3.20/stable:      3.20        2020-02-06 (13300) 69MB -
  3.20/candidate:   3.20        2020-02-06 (13300) 69MB -
  3.20/beta:        ^                                   
  3.20/edge:        ^                                   
  3.19/stable:      3.19        2020-01-27 (13162) 67MB -
  3.19/candidate:   3.19        2020-01-27 (13162) 67MB -
  3.19/beta:        ^                                   
  3.19/edge:        ^                                   
  3.18/stable:      3.18        2019-12-02 (12631) 57MB -
  3.18/candidate:   3.18        2019-12-02 (12631) 57MB -
  3.18/beta:        ^                                   
  3.18/edge:        ^                                   
  3.0/stable:       3.0.4       2019-10-10 (11348) 55MB -
  3.0/candidate:    3.0.4       2019-10-10 (11348) 55MB -
  3.0/beta:         ^                                   
  3.0/edge:         git-81b81b9 2019-10-10 (11362) 55MB -
  2.0/stable:       2.0.11      2019-10-10  (8023) 28MB -
  2.0/candidate:    2.0.11      2019-10-10  (8023) 28MB -
  2.0/beta:         ^                                   
  2.0/edge:         git-160221d 2020-01-13 (12854) 27MB -
installed:          4.0.1                  (14890) 72MB -

Also, for what it is worth, found the following when I did lxc info (on the subset of the hosts in the cluster where the client is able to connect to the server): is the driver version 4.0.2 expected? (Everything else is 4.0.1)

environment:
  addresses:
  - 10.30.30.204:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICCTCCAY+gAwIBAgIQXIw2fbbsZeh7aXRX6kUsrTAKBggqhkjOPQQDAzA2MRww
    GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRYwFAYDVQQDDA1yb290QGM0YWty
    aTAxMB4XDTIwMDMyNzIyMTQyOVoXDTMwMDMyNTIyMTQyOVowNjEcMBoGA1UEChMT
    bGludXhjb250YWluZXJzLm9yZzEWMBQGA1UEAwwNcm9vdEBjNGFrcmkwMTB2MBAG
    ByqGSM49AgEGBSuBBAAiA2IABNU2wxRxynV1Fw/ZZyViOhvKbMSlpkiFils4ALsm
    aBbTivARA5k9QRlDkCbxmUrdGBfX9yE9gWR72NHegtjvCfOWJHRkTMJjhLdQkGaq
    GJkFlDR2P3CsAQAY2hsETLUB/KNiMGAwDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQM
    MAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwKwYDVR0RBCQwIoIIYzRha3JpMDGH
    BH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIwUs9/Ah4T
    TsmHkpPnd5rYo9FZQpX7AsNq2RAlGgumcs5Ze9Xzo+Ndl6+dEAt0dsiYAjEA2PXX
    ydXDlIn0l5aA1aTCMG3R6TTUGJjiwGk39N2wCzOlhek0rfv2Oo0d0Q9M9EWy
    -----END CERTIFICATE-----
  certificate_fingerprint: 76c43449250e367dd5dba79c9c04a2b70dc3a362a70cab31ccd3f70bcc9bcd6d
  driver: lxc
  driver_version: 4.0.2
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "false"
    seccomp_listener: "false"
    seccomp_listener_continue: "false"
    shiftfs: "false"
    uevent_injection: "false"
    unpriv_fscaps: "true"
  kernel_version: 4.15.0-91-generic
  lxc_features:
    cgroup2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    seccomp_notify: "true"
  os_name: Ubuntu
  os_version: "18.04"
  project: default
  server: lxd
  server_clustered: true
  server_name: c4akri01
  server_pid: 22651
  server_version: 4.0.1
  storage: zfs
  storage_version: 0.7.5-1ubuntu16.7

One possibility is that the behavior that we observed that creating a new connection in those cases works, actually does not work immediately, but after some time. In that case the Go sql package would retry 3 times in fast succession, but would still propagate not an error if all three attempts fail.

Yes, that’s expected. It’s the version of the liblxc library that LXD uses.

Please can you try to:

  1. Stop all your LXD daemons: systemctl stop snap.lxd.daemon should be enough, but double check with ps aux | grep lxd or similar.

  2. Run lxd cluster list-database on a few nodes, check what it returns and see if it’s the same output across nodes.

  3. Assuming 2) returns you the same output, start just the nodes in that list, and monitor lxd.log. After a few seconds you should see that the cluster is up (check it with lxc cluster list).

  4. Try to start the remaining nodes too.

If anything in these steps goes wrong, please follow up here with the details.

Okay, this morning when I checked the cluster … everything seems to be back to normal. lxc cluster list shows all of the cluster nodes online, and am able to carry out standard lxc ... operations from each of the nodes.

Strange, and curious to understand both why the system went off the rails yesterday evening and as to how it repaired itself overnight. Am concerned about the fact that the system entered the bad state all on its own without any apparent external trigger …

The only change I can see is that overnight, snap appears to have refreshed the lxd version to 4.0.1, rev 14954. When I reported the issue, the version on the nodes was lxd 4.0.1, rev 14890.

The system entering into a bad state by its own is unlikely, at least in our experience. Most probably some race condition happened during the upgrade, which led to the issue, and that’s where we need to investigate.

I came across the very error on my 3-node cluster. However, I believe that it’s caused by the faulty HDD of that server in my case as one of the 5 HDDs has been warning me of its malfunction. My temporary solution is to add a 4th node and backup the apps so that to leave me time to trouble-shoot the problem as well as keep the cluster running normally.