LXD 4.19 has been released

Introduction

The LXD team is very excited to announce the release of LXD 4.19!

This is a release that’s very busy on the bugfixing front with a lot of improvements around clustering, including improved shutdown logic, easier disaster recovery, improved logging and better handling of a variety of network setups.

There are also a number of fixes and minor improvements to the recently added network forwards feature, now properly integrating with BGP and a new lxc network forward get command.

The headline feature for this release is the addition of instance metrics, effectively a new endpoint (/1.0/metrics) which exposes a text OpenMetrics endpoint suitable for scraping with tool like Prometheus.

Enjoy!

New features and highlights

Instance metrics

A frequent request over the years has been for a better way to track instance resource usage. This becomes particularly critical on busy systems with many projects or even multiple clustered servers.

To handle this, LXD 4.19 introduces a new /1.0/metrics API endpoint which provides a text OpenMetric endpoint suitable for use with Prometheus and similar tools.

As it stands it provides a variety of metrics related to:

  • CPU
  • Memory
  • Disk
  • Network
  • Processes

In general, we’ve tried to keep the metric names aligned with those of node-exporter which should then make adapting existing dashboards and tooling pretty easy.

The endpoint is always available to authenticated users but can also be configured to listen to an additional address with core.metrics_address as well as adding additional trusted certificates which will be restricted only to the metrics interface (lxc config trust add --type metrics).

Example output at: https://gist.github.com/stgraber/ab7f204fb4bf53dbe134f6460bf41470

Specification: [LXD] Metric exporter for instances
Documentation: Instance metrics exporter | LXD

Reworked output for lxc cluster list

The lxc cluster list output was changed from just showing a boolean YES/NO in a database column to instead showing a text list of roles.

Currently the roles are database or database-standby but more will be added in the future. This makes it easier to understand exactly what each clustered server is doing.

stgraber@dakara:~$ lxc cluster list s-dcmtl-cluster:
+---------+-------------------------------------+----------+--------------+----------------+----------------------+--------+-------------------+
|  NAME   |                 URL                 |  ROLES   | ARCHITECTURE | FAILURE DOMAIN |     DESCRIPTION      | STATE  |      MESSAGE      |
+---------+-------------------------------------+----------+--------------+----------------+----------------------+--------+-------------------+
| abydos  | https://[2602:fd23:8:200::100]:8443 | database | x86_64       | default        | HIVE - top server    | ONLINE | Fully operational |
+---------+-------------------------------------+----------+--------------+----------------+----------------------+--------+-------------------+
| langara | https://[2602:fd23:8:200::101]:8443 | database | x86_64       | default        | HIVE - middle server | ONLINE | Fully operational |
+---------+-------------------------------------+----------+--------------+----------------+----------------------+--------+-------------------+
| orilla  | https://[2602:fd23:8:200::102]:8443 | database | x86_64       | default        | HIVE - bottom server | ONLINE | Fully operational |
+---------+-------------------------------------+----------+--------------+----------------+----------------------+--------+-------------------+

Export of block custom storage volumes

It’s now possible to export block custom storage volumes using lxc storage volume export just as it is for filesystem volumes.

Note however that block custom storage volumes tend to end up being significantly larger than the filesystem ones and so can take quite a bit of resources to export and import.

Complete changelog

Here is a complete list of all changes in this release:

Full commit list
  • lxd/util/net: Update CanonicalNetworkAddress to return canconical IP
  • lxd/util/net: Update IsAddressCovered to use net.IP when comparing IP equality
  • lxd/endpoints/cluster: Improve error message in ClusterUpdateAddress
  • lxd/endpoints/network: Improve error message in NetworkUpdateAddress
  • lxd/util/net: Improve comment in CanonicalNetworkAddress
  • lxd/main/init/interactive: Use util.CanonicalNetworkAddress in askClustering
  • lxd/main/init: Use util.CanonicalNetworkAddress when constructing address from join token
  • lxd/main/init: Ensure config.Cluster.ServerAddress and config.Cluster.ClusterAddress are in canonical form
  • doc: Adds network forwards to left hand nav
  • doc/server: Fix incorrect default for routerid
  • lxd/endpoints/endpoints: require set network listener before checking coverage
  • test/suites/clustering: add enable clustering test on lxd reload
  • lxd/resources/network: send not-found error instead of internal error
  • shared/util: rename DefaultPort to HTTPSDefaultPort
  • lxd/util/net: specify default port to CanonicalNetworkAddress
  • lxd/util/net: specify default port to CanonicalNetworkAddressFromAddressAndPort
  • shared/util: add HTTPDefaultPort
  • lxd/endpoints/pprof: use HTTP port instead of HTTPS for debug address
  • lxd/node/config: Canonicalize core.debug_address
  • lxd/daemon: Move ahead startTime
  • lxd/warnings: Add ResolveWarningsOlderThan
  • lxd/daemon: Resolve warnings earlier than startTime
  • lxc: Fix aliases containing @ARGS@
  • lxd/db/raft: rename RemoteRaftNode to RemoveRaftNode
  • lxd/db/node/update: Add updateFromV41
  • lxd/db/node/schema: update schema
  • lxd/db/raft: add Name field to RaftNode
  • lxd/storage/driver/zfs: Fix ListVolumes with custom zpool
  • lxd/node/raft: use empty Name if not yet clustered
  • lxd/cluster: handle Name field for RaftNode
  • lxd/cluster/gateway: populate RaftNode Name from global database
  • lxd/api/cluster: add Name field to internalRaftNode struct
  • lxd/main/cluster: add name to ‘lxd cluster show/edit’
  • lxd/test: add Name field to RaftNode tests
  • lxd/cluster/recover: append to patch.global.sql if exists
  • lxd/main/cluster: make segmentID a comment instead of struct field
  • doc/clustering: update ‘lxd cluster edit’ docs
  • lxd: Fix swagger definitions to avoid conflicts
  • doc/rest-api: Refresh swagger YAML
  • doc/instances: Clarify default CPU/RAM for VMs
  • lxd/networks: Handle stateful DHCPv6 leases
  • lxd/networks: Add EUI64 records to leases
  • lxd/device/nic: ensure instance device IP is different from parent network
  • lxd/network/driver/common: Adds bgpNextHopAddress function
  • lxd/network/driver/common: Reduce duplication of logic in bgpSetupPrefixes and uses bgpNextHopAddress
  • lxd/network/driver/common: Removes unnecessary function n.bgpClearPrefixes
  • lxd/network/driver/common: Improve errors in bgpSetup
  • lxd/network/driver/common: Clear address forward BGP prefixes in bgpClear
  • lxd/network/driver/bridge: Setup BGP prefix export in forwardsSetup
  • lxd/daemon/storage: unmount all storage pools on shutdown
  • lxd/project: Change restrictions check function in CheckClusterTargetRestriction
  • lxd/network/network/interface: Adds clientType arg to Forward management functions
  • lxd/network/driver: Add clientType to Forward management functions
  • lxd/network/driver/common: Remove empty newline
  • lxd/network/forwards: Pass clientType into Forward management functions
  • lxd/network/driver/ovn: Update Forward management functions to only apply changes for ClientTypeNormal requests
  • lxd/network/forwards: Removes duplicate record check from networkForwardsPost
  • lxd/network/driver: Moves duplicate forward record check into drivers
  • lxd/network/driver/ovn: Adds cluster member notification to Forward management functions
  • lxd/network/driver/ovn: Refresh BGP prefixes on Forward management
  • lxd/network/driver/common: Include exporting forward addresses in bgpSetup
  • lxd/network/driver/bridge: Remove BGP forward address refresh from forwardSetup
  • lxd/network/driver/bridge: Rename forwardsSetup to forwardSetupFirewall
  • test: Adds BGP prefix export checks to forward tests
  • lxd/cluster/heartbeat: Adds Name field to APIHeartbeatMember
  • lxd/cluster/heartbeat: Preallocate raftNodeMap in Update
  • lxd/cluster/heartbeat: Populate Name in Update
  • lxd/cluster/gateway: Update currentRaftNodes to use a single query to get cluster member info
  • lxd/cluster/gateway: Preallocate raftNodes slice for efficiency
  • lxd/cluster/gateway: Do not query leader cluster DB to enrich raft member name in HandlerFuncs
  • lxd/cluster/recover: Preallocate nodes in Reconfigure
  • lxd/util: Respect modprobe configuration
  • shared/instance: don’t allow ‘limits.memory’ to be 0
  • lxd/cgroup: Add GetMemoryStats
  • lxd/cgroup: Add GetIOStats
  • lxd/cgroup: Add GetCPUAcctUsageAll
  • lxd/cgroup: Add GetTotalProcesses
  • lxd/response: Add SyncResponsePlain
  • lxd/storage/filesystem: Add FSTypeToName
  • lxd/network/openvswitch/ovn: Work around a bug in lr-nat-del in ovn-nbctl in LogicalRouterDNATSNATAdd
  • shared/api/network/forward: Fix api extension references
  • lxd/network/forwards: Use consistent terminology in network address forward swagger comments
  • doc/rest-api: Refresh swagger YAML
  • test: Remove restart tests that don’t use --force
  • lxd/daemon/storage: Skip unmounting LVM pools in daemonStorageUnmount
  • lxc: Cleanup LXD client imports
  • lxd: Cleanup LXD client imports
  • lxc-to-lxd: Cleanup LXD client imports
  • lxc/cluster: Show roles instead of database column
  • tests: Support for showing roles by
  • i18n: Update translation templates
  • doc: update link to rest-api.yaml
  • Typo
  • lxd/device/tpm: Require path only for containers
  • lxd/instance: Fix response for patch
  • swagger: Fix return code for operations
  • doc/rest-api: Refresh swagger YAML
  • lxd/endpoints/network: Specify protocol version for 0.0.0.0 address
  • doc: Document recently added architectures
  • seccomp: Add riscv64 syscall mappings
  • shared/api: Add CertificateTypeMetrics
  • lxd/db: Add CertificateTypeMetrics
  • lxd: Check metrics certificates
  • lxc/config_trust: Allow adding metrics certificates
  • lxd/metrics: Add API types
  • lxd/metrics: Add types
  • lxd/metrics: Add helper functions
  • lxd: Add metrics related fields to daemon
  • lxd: Add /1.0/metrics endpoint
  • lxd/instance/drivers: Add Metrics function
  • lxd-agent: Add metrics endpoint
  • api: Add metrics API extension
  • i18n: Update translation templates
  • doc/rest-api: Refresh swagger YAML
  • doc: Add metrics.md
  • doc: Mention core.metrics_address
  • test/suites: Add lxd/metrics to static analysis
  • shared/util: Add HTTPSMetricsDefaultPort
  • lxd/node: Add core.metrics_address config key
  • lxd/endpoints: Add metrics endpoint
  • lxd: Handle metrics server
  • test: Add metrics test
  • lxd/daemon/storage: Renames daemonStorageUnmount to daemonStorageVolumesUnmount
  • lxd/daemon: Rename numRunningContainers numRunningInstances
  • Fix documented HTTP return code in console POST
  • doc/rest-api: Refresh swagger YAML
  • lxd/main/daemon: Rework cmdDaemon shutdown process
  • lxd/storage/drivers/driver/lvm: Fix Unmount to be more reliable
  • lxd/storage/drivers/driver/lvm: Fix Mount to be more reliable
  • lxd/main/daemon: Removes LVM shutdown unmount workaround
  • doc/rest-api: Add missing entry for 112 (error)
  • lxd/instance/drivers: Move raw.lxc config load to separate function
  • lxd/instance/drivers: Fix raw.lxc handling for shutdown/stop
  • lxd/storage/filesystem: Removes duplicated constants from unix package
  • lxd/storage/filesystem/fs: Removes duplicated constants from unix package
  • lxd/storage/filesystem/fs: Update FSTypeToName to work on 32bit platforms
  • lxd/instance/drivers/driver/lxc: filesystem.FSTypeToName usage
  • lxd-agent/metrics: filesystem.FSTypeToName usage
  • lxd/storage/drivers/driver/lvm: Skip unmount
  • lxd/cgroup: Implement CPU usage for cgroup v2
  • shared/json: Removes DebugJson from shared
  • lxd/cgroup: Fix logging in cgroup init
  • lxd/util/http: Adds DebugJSON function
  • lxd/util/http: Adds debugLogger arg to WriteJSON
  • lxd/main: Set response debug mode based on --debug flag
  • lxd/response/response: Reworks syncResponse to use util.WriteJSON
  • lxd/response/response: Adds util.DebugJSON support to errorResponse
  • lxd/operations/response: Adds util.WriteJSON support to operationResponse
  • lxd/operations/response: Adds util.WriteJSON support to forwardedOperationResponse
  • lxd/endpoints/endpoints/test: util.WriteJSON usage
  • lxd/cluster/notify/test: util.WriteJSON usage
  • lxd/devlxd: Adds util.WriteJSON support to hoistReq
  • lxd-agent/devlxd: Add util.WriteJSON support to hoistReq
  • lxd-agent/server: util.DebugJSON usage
  • lxd/daemon: Clearer logging of API requests in createCmd
  • lxd/daemon: util.DebugJSON usage in createCmd
  • lxd/cluster/gateway: util.WriteJSON usage
  • lxd/response/response: Use api.ResponseRaw in error response
  • client/interfaces: Corrects typo in GetNetworkForward
  • lxd/db/network/forwards: Fix error handling in GetNetworkForward
  • lxd/instances: containerStopList → instanceStopList
  • lxd/instances: Handle VMs in instancesOnDisk
  • lxd/instances: s/containers/instances/
  • lxd/instances: Rename old container variables
  • lxd/instances: Check DB before calling VolatileSet
  • lxc/network/forward: Add lxc network forward get command
  • i18n: Update translation templates
  • lxd/util: Handle ‘:8443’ syntax in ListenAddresses
  • lxd/util/http: Improve comment on ListenAddresses
  • lxd/util/http: Improve argument name in configListenAddress
  • lxd/util/http: Use net.JoinHostPort in ListenAddresses rather than wrapping IPv6 addresses in []
  • lxd/util/http: Improve ListenAddresses by breaking the parsing into phases
  • lxd/util/http/test: Adds ExampleListenAddresses function
  • lxd: Remove public facing errors that mention cluster “node”
  • shared/api/url: Adds URL builder type and functions
  • lxd/network/network/utils: Updates UsedBy to use api.URLBuild
  • doc/metrics: typo fix
  • lxc/file: use flagMkdir to create dirs on lxc pull
  • lxc/file: add DirMode constant for ‘lxc file’
  • lxd/api/cluster: only change member role from leader
  • test/suites/clustering: wait for node shutdown to propagate to members
  • lxd/storage/drivers: Support generic custom block volume backup/restore
  • lxd/storage/drivers/zfs: Drop restriction on custom block volume backup/restore
  • lxd/storage/drivers/btrfs: Drop restriction on custom block volume backup/restore
  • lxd/main/shutdown: Updates cmdShutdown to handle /internal/shutdown being synchronous
  • lxd/api/internal: Updates shutdown request to wait for d.shutdownDoneCtx
  • lxd/main/daemon: Call d.shutdownDoneCancel when daemon function ends
  • lxd/daemon: Adds shutdownDoneCtx context to indicate shutdown has finished
  • lxd: d.shutdownCtx usage
  • lxd/main/daemon: d.shutdownCancel usage in daemon function
  • lxc/config_trust: Delete only works on fingerprints
  • i18n: Update translation templates
  • test: Log PID of process being killed
  • test: Require node removal to succeed in test_clustering_remove_leader
  • lxd/storage/drivers: Checks that mount refCount is zero in all drivers
  • lxd/storage/drivers/driver/cephfs/volumes: Adds mount ref counting
  • lxd/device/disk: Use errors.Is() when checking for storageDrivers.ErrInUse in Update
  • lxd/device/disk: Ignore storageDrivers.ErrInUse error from pool.UnmountCustomVolume in postStop
  • lxd/storage/drivers: Log volName in UnmountVolume
  • lxd/instance/drivers: Add instance type to metrics
  • lxd: add core scheduling support
  • lxd/response/response: Adds manualResponse type
  • lxd/api/cluster: Removes arbitrary 3s wait in clusterPutDisable which was causing test issues
  • test: Wait for daemons to exit in test_clustering_remove_leader
  • lxd/api/cluster: Add logging to clusterPutDisable
  • test: Detect if clustering network needs removing
  • lxd/qemu: Disable large decrementor on ppc64le
  • lxd/daemon: Reworks shutdown sequence
  • lxd/daemon: Reworks Stop
  • lxd/api/cluster: d.shutdownCtx.Err usage
  • lxd/api/internal: d.shutdownCtx.Err usage
  • lxd: daemon.Stop usage
  • lxd/operations: Updates waitForOperations to accept context
  • lxd/main/shutdown: Require valid response from /internal/shutdown in cmdShutdown
  • lxd: db.OpenCluster usage
  • lxd/cluster/membership: Update notifyNodesUpdate to wait until all heartbeats have been sent
  • lxd/db/db: Replace clusterMu and closing with closingCtx in OpenCluster
  • lxd/api/cluster: Improves logging
  • lxd/api/internal: Rework internalShutdown to return valid response as LXD is shutdown
  • lxd/daemon: db.OpenCluster usage in init
  • lxd/daemon: Improved logging and error handling in init
  • lxd/main/daemon: Reworks cmdDaemon to use d.shutdownDoneCh and call d.Stop()
  • test: Increase timeouts on ping tests
  • lxd/daemon: Adds daemon started log
  • lxd/daemon: Whitespace in NodeRefreshTask
  • lxd/api/cluster: Improve logging in handoverMemberRole
  • lxd/api/cluster: Adds cluster logging
  • test: Addition test logging
  • lxd/cluster/membership: Improve logging in Rebalance
  • lxd/daemon: Stop clustering tasks during Stop
  • lxd/api/cluster: Improve logging in clusterNodeDelete
  • test: Try and kill LXD daemon that fails to start
  • lxd/dameon: Removes unnecessary go routines in NodeRefreshTask
  • lxd/db/db: Use db.PingContext in OpenCluster
  • lxd/db/db: Rework logging and error handling in OpenCluster
  • lxc/file: Fix file push help message
  • lxd/storage/drivers: Handle symlinks when walking file tree
  • test/suites/backup: Add cephfs
  • test/suites/backup: Check file content for storage volume backups
  • i18n: Update translation templates
  • lxd/cgroup: Fix GetIOStats on cgroup2
  • lxd/endpoints/network/test: Test tcp4 interface and request via IPv6
  • lxd/endpoints/network/test: Test tcp4 connection with configured 0.0.0.0 network address
  • i18n: Update translations from weblate
  • gomod: Update dependencies

Try it for yourself

This new LXD release is already available for you to try on our demo service.

Downloads

The release tarballs can be found on our download page.

Binary builds are also available for:

  • Linux: snap install lxd
  • MacOS: brew install lxc
  • Windows: choco install lxc
5 Likes
2 Likes

This is now rolling out to our stable snap users.

Note that this is the first LXD release where we’re using phased rollout, primarily to avoid creating a lot of stress on the infrastructure as everyone updates. The full rollout is expected to take up to 48h but we may speed it up if we see everything going smoothly.

Looks like I have a problem: last night one of the nodes on my cluster tried to update to LXD 4.19 (via snap). However, other nodes did not update, so the node fell off the cluster. Since then I am not able to put it back. I have reverted lxd to version 4.18, however LXD does not start giving me the error:

lxd.daemon[200097]: Error: Failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade

Looks like snap still thinks that a new version is available:

$ snap refresh --list
Name  Version  Rev    Publisher   Notes
lxd   4.19     21624  canonical✓  -

However, the version 4.19 is not currently listed in the stable channel wich is being tracked:

$ snap info lxd
...
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable/ubuntu-20.04
refresh-date: today at 16:18 UTC
channels:
  latest/stable:    4.18        2021-09-13 (21497) 75MB -
  latest/candidate: 4.19        2021-10-04 (21624) 76MB -
  latest/beta:      ↑                                   
  latest/edge:      git-e6523c3 2021-10-05 (21636) 76MB -
...
installed:          4.18                   (21497) 75MB in-cohort

Other nodes on version 4.18 have the identical lxd info except for the last in-cohort. What does it mean? They do not list any updates in snap refresh --list.

Also, after reverting to 4.18, the node cannot longer update to 4.19: refreshing gets stuck.

Any idea how to recover?

So the fact that you have servers that aren’t in the cohort would explain the different version, it’s puzzling that they don’t have the cohort key though…

To recover, run on all servers:

  • snap switch lxd --cohort=+
  • snap refresh lxd

This should get them all on 4.19

2 Likes

I just tried to restart lxd on one of the other nodes:

sudo snap disable lxd
sudo snap enable lxd

After that this node also starts seeing the update:

$ snap refresh --list
Name  Version  Rev    Publisher   Notes
lxd   4.19     21624  canonical✓  -

However, 4.19 is still listed in the latest/candidate channel. Confused…

It’s normal that you don’t see 4.19 in stable yet, that’s because of the phased rollout, the in-cohort line should however appear on all clustered servers, otherwise you can get in the situation where some get the new release and some don’t.

Can you show a journalctl -u snap.lxd.daemon -n 500 of a server which didn’t have the in-cohort listed?

Thanks! Adding --cohort=+ helped bringing the update to 4.19 on 3 out of 4 servers. However, one server still does not see the update. I guess, I need to wait?

Here you are (the hostname has been redacted):

-- Logs begin at Sat 2021-02-20 12:59:16 UTC, end at Tue 2021-10-05 17:07:19 UTC. --
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   3: fd:   9: perf_event
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   4: fd:  10: freezer
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   5: fd:  11: rdma
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   6: fd:  12: net_cls,net_prio
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   7: fd:  13: pids
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   8: fd:  14: cpu,cpuacct
Oct 05 16:36:13 server1 lxd.daemon[3386139]:   9: fd:  15: blkio
Oct 05 16:36:13 server1 lxd.daemon[3386139]:  10: fd:  16: hugetlb
Oct 05 16:36:13 server1 lxd.daemon[3386139]:  11: fd:  17: memory
Oct 05 16:36:13 server1 lxd.daemon[3386139]:  12: fd:  19: cpuset
Oct 05 16:36:13 server1 lxd.daemon[3386139]: Kernel supports pidfds
Oct 05 16:36:13 server1 lxd.daemon[3386139]: Kernel does not support swap accounting
Oct 05 16:36:13 server1 lxd.daemon[3386139]: api_extensions:
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - cgroups
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - sys_cpu_online
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_cpuinfo
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_diskstats
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_loadavg
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_meminfo
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_stat
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_swaps
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - proc_uptime
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - shared_pidns
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - cpuview_daemon
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - loadavg_daemon
Oct 05 16:36:13 server1 lxd.daemon[3386139]: - pidfds
Oct 05 16:36:13 server1 lxd.daemon[3386139]: Reloaded LXCFS
Oct 05 16:36:13 server1 lxd.daemon[3387820]: => Re-using existing LXCFS
Oct 05 16:36:13 server1 lxd.daemon[3387820]: ==> Setting snap cohort
Oct 05 16:36:13 server1 lxd.daemon[3387820]: => Starting LXD
Oct 05 16:36:13 server1 lxd.daemon[3387994]: t=2021-10-05T16:36:13+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, disk priority will be ignored"
Oct 05 16:36:13 server1 lxd.daemon[3387994]: t=2021-10-05T16:36:13+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
Oct 05 16:36:13 server1 lxd.daemon[3387994]: t=2021-10-05T16:36:13+0000 lvl=warn msg="Dqlite: attempt 1: server 134.60.40.196:8443: no known leader"
Oct 05 16:36:13 server1 lxd.daemon[3387994]: t=2021-10-05T16:36:13+0000 lvl=eror msg="Failed to start the daemon: Failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrad>
Oct 05 16:36:13 server1 lxd.daemon[3387994]: Error: Failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade
Oct 05 16:36:14 server1 lxd.daemon[3387820]: => LXD failed to start
Oct 05 16:36:14 server1 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Oct 05 16:36:14 server1 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Oct 05 16:36:14 server1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 9.
Oct 05 16:36:14 server1 systemd[1]: Stopped Service for snap application lxd.daemon.
Oct 05 16:36:14 server1 systemd[1]: Started Service for snap application lxd.daemon.
Oct 05 16:36:14 server1 lxd.daemon[3388042]: => Preparing the system (21497)
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Setting snap cohort
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Loading snap configuration
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Setting up mntns symlink (mnt:[4026532610])
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Setting up kmod wrapper
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Preparing /boot
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Preparing a clean copy of /run
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Preparing /run/bin
Oct 05 16:36:14 server1 lxd.daemon[3388042]: ==> Preparing a clean copy of /etc
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Preparing a clean copy of /usr/share/misc
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Setting up ceph configuration
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Setting up LVM configuration
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Rotating logs
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Setting up ZFS (0.8)
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Escaping the systemd cgroups
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ====> Detected cgroup V1
Oct 05 16:36:15 server1 lxd.daemon[3388042]: ==> Escaping the systemd process resource limits

Oh, looks like the this server has meanwhile updated by itself. Now everything is back online. Many thanks!

Glad you’re back online. It’s a bit odd that some servers didn’t have the cohort key set. We attempt to set it every time LXD starts so you’d think it would have caught it by now…

I’ll put some extra logic to ensure that this is always set before an in-cluster refresh, hopefully that helps avoid this.

Maybe it is a coincidence, but I just noticed that the node that fell off the cluster has role database-standby. The other three nodes have role database. As far as I understand, this means that this node does not participate in the database agreement.

Should just be a coincidence. LXD itself has no interaction with snapd and those database roles dynamically move around the cluster. Most likely what happened is that since that one server was offline, the roles were reshuffled so the 3 that are online would act as database voters.