Snap refresh to 3.17 failed: "apply->status == RAFT_LEADERSHIPLOST"

fwaggle · September 9, 2019, 9:53pm

Hey folks,

We have a pile of problems similar to this thread: Here we go again...Upgrade to 3.17 Causing - Error: Get http://unix.socket/1.0: EOF

However we’re not running clustering and it’s still failed on a subset of machines. On some of them reverting to 3.16 works (and then installing 3.17 again brings it up), but on others it breaks and lxd segfaults each time on 3.17, and refuses to start because the schema is too new on 3.16.

I moved database/global/ out, and copied database/global.bak in place of it, then re-ran lxd in debug mode and captured this output:

DBUG[09-09|21:48:25] Connecting to a local LXD over a Unix socket 
DBUG[09-09|21:48:25] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[09-09|21:48:25] LXD 3.17 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[09-09|21:48:25] Kernel uid/gid map: 
INFO[09-09|21:48:25]  - u 0 0 4294967295 
INFO[09-09|21:48:25]  - g 0 0 4294967295 
INFO[09-09|21:48:25] Configured LXD uid/gid map: 
INFO[09-09|21:48:25]  - u 0 1000000 1000000000 
INFO[09-09|21:48:25]  - g 0 1000000 1000000000 
INFO[09-09|21:48:25] Kernel features: 
INFO[09-09|21:48:25]  - netnsid-based network retrieval: no 
INFO[09-09|21:48:25]  - uevent injection: no 
INFO[09-09|21:48:25]  - seccomp listener: no 
INFO[09-09|21:48:25]  - unprivileged file capabilities: yes 
INFO[09-09|21:48:25]  - shiftfs support: no 
INFO[09-09|21:48:25] Initializing local database 
DBUG[09-09|21:48:25] Initializing database gateway 
DBUG[09-09|21:48:25] Start database node                      id=1 address=
DBUG[09-09|21:48:27] Connecting to a local LXD over a Unix socket 
DBUG[09-09|21:48:27] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[09-09|21:48:27] Detected stale unix socket, deleting 
DBUG[09-09|21:48:27] Detected stale unix socket, deleting 
INFO[09-09|21:48:27] Starting /dev/lxd handler: 
INFO[09-09|21:48:27]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[09-09|21:48:27] REST API daemon: 
INFO[09-09|21:48:27]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[09-09|21:48:27]  - binding TCP socket                    socket=10.240.1.141:8443
INFO[09-09|21:48:27] Initializing global database 
DBUG[09-09|21:48:27] Dqlite: connected address=1 attempt=0 
INFO[09-09|21:48:27] Updating the LXD global schema. Backup made as "global.bak" 
DBUG[09-09|21:48:28] Updating global DB schema from 14 to 15 
DBUG[09-09|21:48:30] Updating global DB schema from 15 to 16 
apply failed with status 1006613808
lxd: src/replication.c:174: apply: Assertion `apply->status == RAFT_LEADERSHIPLOST' failed.
Aborted (core dumped)

Not sure if that’s helpful? I’ve only a tenuous grasp of raft, but it seems unusual that we’d assert that in a non-clustered setup.

stgraber · September 10, 2019, 11:47am

Can you make a tarball of the database directory and send it to stgraber at ubuntu dot com?

I’ll import it on one of our test systems, see if I can’t unblock that upgrade for you or at least send a good reproducer to @freeekanayaka

fwaggle · September 10, 2019, 8:28pm

Sure thing! We have a few machines that failed to upgrade… they all seemed to do different things. Some machines required a snap refresh back to 3.16, then an upgrade went through without issues. Some core-dump but don’t show the same message.

We have another that does this:

lxd.daemon[67862]: Error: failed to open cluster database: failed to ensure schema: failed to apply update 15: Found snapshot aaa-containername/temp_move_container_ltatk with no associated instance

It looks trivially fixable via the right dqlite commands, I just haven’t messed with it… it’s operating on 3.16 for now so that was good enough and we moved on, but obviously I wouldn’t want to leave these machines there (I think there’s probably 20 or 30 machines that failed to upgrade for one reason or another).

Do you want me to just bomb you with broken databases or is there a way to determine which bugs have already been fixed? Have any fixes been pushed out to 3.17 yet or are they still in the pipe?

Thanks for your time!

stgraber · September 10, 2019, 10:08pm

The one that complains about the snapshot is pretty easy to fix, we have given instructions to a number of users for that. It’s not something we want to automatically fix as it does involve deleting the entry from the database and in some specific cases it may have been data you cared about (very unlikely though).

Anyway, for that one, you need to create a file at /var/snap/lxd/common/lxd/database/patch.global.sql containing:

DELETE FROM containers WHERE name='containername/temp_move_container_ltalk`;

On startup LXD will then run that, causing the entry to go away and will then continue with the upgrade.

The error comes from the fact that for some reason you had a snapshot without a parent container. This isn’t supposed to be possible but may have happened due to a deletion issue or the daemon crashing during delete or the like. LXD 3.17 switches database schema to using separate tables for snapshots which in turns means we have a proper FOREIGN KEY between containers and the snapshots, so the database cannot represent a snapshot without a parent, causing the upgrade failure.

stgraber · September 10, 2019, 10:10pm

So we’re not interested in database tarballs for those which fail with such orphaned snapshot errors as those are expected to fail, which is why you’re getting that “nice” error message rather than a confusing low level error about foreign keys.

We definitely are interested in actual crashes or hangs though, the one that showed the apply failed followed by the assertion failure is something we’d definitely want to look at and fix as such assertions should never be hit.

fwaggle · September 10, 2019, 10:35pm

Awesome, thanks, I’ll go through the broken machines, and get these together for you… they’re typically about 200MB compressed in my experience so I didn’t want to just fill your inbox.

stgraber · September 10, 2019, 10:37pm

Ah yeah, we really need to get @freeekanayaka to look at what’s going on with the retention policy, even a busy LXD server shouldn’t have more than 100MB or so of stuff on disk and that should be far less once compressed, so 200MB compressed suggests that you have much more transaction files than we intend to keep around.

fwaggle · September 11, 2019, 2:40am

ACK, the one I’ve sent so far has been smaller than that, mercifully.

freeekanayaka · September 16, 2019, 3:15pm

I pushed a PR that should fix this issue in the vast majority of cases (including this one). However the proper fix will need a bit more work, and will be tracked here.

fwaggle · September 16, 2019, 10:23pm

Cheers! It doesn’t look like the merged patch has made it into the snap repo yet (still on 11964, and still crashes), however pulling the PRAGMA statement out of the patch, and dropping it into /var/snap/lxd/common/lxd/database/patch.global.sql allows LXD to move past the updates and start repeatedly on 3.17 on my test machine.

I had a couple of other failed updates that had different behaviour, so I’ll check them shortly and see if this fixes those as well, and if not I’ll report back. Is there a way to tell which commits got included in which build number on snap?