Raft segment corruption on two separate cluster nodes during snap auto-refresh

amcduffee · April 16, 2020, 8:27pm

I have now seen in the past two days snap auto-refreshes either hang indefinitely and/or error on two separate nodes in a cluster. In both cases, the snap change log shows a problem at the ‘Start snap “lxd” (14623) services’ step. When I traced that through to the logs under /var/snap/lxd/common/lxd/logs on the first machine yesterday it showed:

lvl=eror msg=“Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000911734-0000000000911987: found 253 entries (expected 254)”

Based on reading from other threads I aborted the snap auto-refresh and removed the last (corrupt) segment. After doing so the snap refresh succeeded and the cluster was back to all nodes being ONLINE/fully operational.

Today, I checked the cluster list to find that a second cluster node (different machine from yesterday) was not doing heartbeats. Checking ‘snap services’ on the machine shows the lxd instance isn’t running:

$ snap services
Service Startup Current Notes
lxd.activate enabled inactive -
lxd.daemon enabled inactive socket-activated

Checking ‘snap changes’ on the machine shows two snap auto-refresh errors:

109 Error today at 11:53 PDT today at 11:54 PDT Auto-refresh snap “lxd”
110 Error today at 12:43 PDT today at 12:44 PDT Auto-refresh snap “lxd”

Checking the last change log again shows the issue is at the Start/Stop snap “lxd” services steps:

$ snap change 110
Status Spawn Ready Summary
Done today at 12:43 PDT today at 12:44 PDT Ensure prerequisites for “lxd” are available
Undone today at 12:43 PDT today at 12:44 PDT Download snap “lxd” (14623) from channel “latest/stable”
Done today at 12:43 PDT today at 12:44 PDT Fetch and check assertions for snap “lxd” (14623)
Undone today at 12:43 PDT today at 12:44 PDT Mount snap “lxd” (14623)
Undone today at 12:43 PDT today at 12:44 PDT Run pre-refresh hook of “lxd” snap if present
Error today at 12:43 PDT today at 12:44 PDT Stop snap “lxd” services
Undone today at 12:43 PDT today at 12:44 PDT Remove aliases for snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Make current revision for snap “lxd” unavailable
Undone today at 12:43 PDT today at 12:44 PDT Copy snap “lxd” data
Undone today at 12:43 PDT today at 12:44 PDT Setup snap “lxd” (14623) security profiles
Undone today at 12:43 PDT today at 12:44 PDT Make snap “lxd” (14623) available to the system
Undone today at 12:43 PDT today at 12:44 PDT Automatically connect eligible plugs and slots of snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Set automatic aliases for snap “lxd”
Undone today at 12:43 PDT today at 12:44 PDT Setup snap “lxd” aliases
Undone today at 12:43 PDT today at 12:44 PDT Run post-refresh hook of “lxd” snap if present
Error today at 12:43 PDT today at 12:44 PDT Start snap “lxd” (14623) services
Hold today at 12:43 PDT today at 12:44 PDT Remove data for snap “lxd” (14594)
Hold today at 12:43 PDT today at 12:44 PDT Remove snap “lxd” (14594) from the system
Hold today at 12:43 PDT today at 12:44 PDT Clean up “lxd” (14623) install
Hold today at 12:43 PDT today at 12:44 PDT Run configure hook of “lxd” snap if present
Hold today at 12:43 PDT today at 12:44 PDT Run health check of “lxd” snap
Done today at 12:43 PDT today at 12:44 PDT Consider re-refresh of “lxd”

Following that through to the lxd log shows the same problem as the first node did yesterday:

lvl=eror msg=“Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000913597-0000000000914725: found 1128 entries (expected 1129)”

So, it seems, somehow that the way the snap auto-refresh is happening is prone to corrupting the raft DB segments, possibly do to an unclean shutdown?

Both machines have /var/snap on a two disk raid1 MD device. In both cases, the MD raid is health and the underlying disks don’t show any concerning symptoms in SMART data.

Let me know what other information I can gather to help isolate the issue here.

stgraber · April 16, 2020, 8:43pm

@freeekanayaka

freeekanayaka · April 17, 2020, 2:14pm

Not sure what version you where upgrading from, but both 4.0.0 and 4.0.1 should have a few fixes in libraft that might help with this issue.

I have looked at the code and as for now I can’t figure out what could lead to this bug.

Please make sure that all your cluster nodes are now running 4.0.1 and are healthy, and keep an eye on this forum where we announce new releases. If after pushing a new release you hit this again, just follow up here.

We saw a few failures in our CI as well related to cluster upgrades, I’ll investigate them as well as it might be a similar issue (although I don’t think we saw the same error there).

amcduffee · April 17, 2020, 4:11pm

Snap info for all three nodes:

$ snap list
Name Version Rev Tracking Publisher Notes
core 16-2.44.1 8935 latest/stable canonical✓ core
core18 20200311 1705 latest/stable canonical✓ base
lxd 4.0.0 14611 latest/stable canonical✓ -

$ snap list
Name Version Rev Tracking Publisher Notes
core 16-2.44.1 8935 latest/stable canonical✓ core
core18 20200311 1705 latest/stable canonical✓ base
lxd 4.0.0 14663 latest/stable canonical✓ -

The 4.0.0 rev that is different above is on the most recent node to experience the error. So I am guessing it happened between different revs of 4.0.0? Not sure, is there a snap command that will show details of the last version upgrade?

I just removed the corrupt segment and did a manual ‘snap refresh lxd’ on the affected node. Now I am seeing the following on 2/3 of the nodes:

$ lxc cluster list
Error: failed to begin transaction: not an error

amcduffee · April 17, 2020, 4:19pm

I had to do a ‘snap restart lxd’ on the 2 nodes with the following error:

Error: failed to begin transaction: not an error

All nodes are fully operational now. Seems odd that I had to restart LXD on two nodes though?

One of the restarts was on the node where I removed the corrupted segment, so the snap refresh should have implicitly caused an lxd snap restart, so why that error immediately after a fresh daemon start?

The second node hasn’t been touched since yesterday and has always shown as ONLINE in ‘lxc cluster list’ on the one node where the command works.

freeekanayaka · April 17, 2020, 4:48pm

We have received another report of this lately. It suggests there might be some regression. I have a mitigation in mind that should help and will work on it soon next week.

The reason you had to restart is probably because the database connection somehow got in a bad state and could not recover.

In the other report that we received, there was a busted node too, similar to what happened to you, so there’s probably something going on here that we need to investigate.

amcduffee · April 17, 2020, 4:54pm

Ok, let me know if you need extra information from my cluster. It is currently ONLINE and sane at 4.0.0 rev 14663.

I know you mentioned above to update all nodes to 4.0.1, but all of them are showing the following in ‘snap info lxd’:

…
channels:
latest/stable: 4.0.0 2020-04-17 (14663) 62MB -
…

So, not sure how to get 4.0.1?

freeekanayaka · April 17, 2020, 5:19pm

I’m actually not sure. @stgraber does the minor version show up in snap info? Or we simply still have to propagate it to the stable channel?

stgraber · April 17, 2020, 7:00pm

There is no LXD 4.0.1.

stgraber · April 17, 2020, 7:00pm

14663 is up to date with all the cherry-picks as of yesterday.