I have now seen in the past two days snap auto-refreshes either hang indefinitely and/or error on two separate nodes in a cluster. In both cases, the snap change log shows a problem at the ‘Start snap “lxd” (14623) services’ step. When I traced that through to the logs under /var/snap/lxd/common/lxd/logs on the first machine yesterday it showed:
lvl=eror msg=“Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000911734-0000000000911987: found 253 entries (expected 254)”
Based on reading from other threads I aborted the snap auto-refresh and removed the last (corrupt) segment. After doing so the snap refresh succeeded and the cluster was back to all nodes being ONLINE/fully operational.
Today, I checked the cluster list to find that a second cluster node (different machine from yesterday) was not doing heartbeats. Checking ‘snap services’ on the machine shows the lxd instance isn’t running:
$ snap services
Service Startup Current Notes
lxd.activate enabled inactive -
lxd.daemon enabled inactive socket-activated
Checking ‘snap changes’ on the machine shows two snap auto-refresh errors:
109 Error today at 11:53 PDT today at 11:54 PDT Auto-refresh snap “lxd”
110 Error today at 12:43 PDT today at 12:44 PDT Auto-refresh snap “lxd”
Checking the last change log again shows the issue is at the Start/Stop snap “lxd” services steps:
$ snap change 110
Status Spawn Ready Summary
Done today at 12:43 PDT today at 12:44 PDT Ensure prerequisites for “lxd” are available
Undone today at 12:43 PDT today at 12:44 PDT Download snap “lxd” (14623) from channel “latest/stable”
Done today at 12:43 PDT today at 12:44 PDT Fetch and check assertions for snap “lxd” (14623)
Undone today at 12:43 PDT today at 12:44 PDT Mount snap “lxd” (14623)
Undone today at 12:43 PDT today at 12:44 PDT Run pre-refresh hook of “lxd” snap if present
Error today at 12:43 PDT today at 12:44 PDT Stop snap “lxd” services
Undone today at 12:43 PDT today at 12:44 PDT Remove aliases for snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Make current revision for snap “lxd” unavailable
Undone today at 12:43 PDT today at 12:44 PDT Copy snap “lxd” data
Undone today at 12:43 PDT today at 12:44 PDT Setup snap “lxd” (14623) security profiles
Undone today at 12:43 PDT today at 12:44 PDT Make snap “lxd” (14623) available to the system
Undone today at 12:43 PDT today at 12:44 PDT Automatically connect eligible plugs and slots of snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Set automatic aliases for snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Setup snap “lxd” aliases
Undone today at 12:43 PDT today at 12:44 PDT Run post-refresh hook of “lxd” snap if present
Error today at 12:43 PDT today at 12:44 PDT Start snap “lxd” (14623) services
Hold today at 12:43 PDT today at 12:44 PDT Remove data for snap “lxd” (14594)
Hold today at 12:43 PDT today at 12:44 PDT Remove snap “lxd” (14594) from the system
Hold today at 12:43 PDT today at 12:44 PDT Clean up “lxd” (14623) install
Hold today at 12:43 PDT today at 12:44 PDT Run configure hook of “lxd” snap if present
Hold today at 12:43 PDT today at 12:44 PDT Run health check of “lxd” snap
Done today at 12:43 PDT today at 12:44 PDT Consider re-refresh of “lxd”
Following that through to the lxd log shows the same problem as the first node did yesterday:
lvl=eror msg=“Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000913597-0000000000914725: found 1128 entries (expected 1129)”
So, it seems, somehow that the way the snap auto-refresh is happening is prone to corrupting the raft DB segments, possibly do to an unclean shutdown?
Both machines have /var/snap on a two disk raid1 MD device. In both cases, the MD raid is health and the underlying disks don’t show any concerning symptoms in SMART data.
Let me know what other information I can gather to help isolate the issue here.