Today lxd stopped working with error Failed to start the daemon: Failed to start dqlite server: run failed with 13
lxd --debug --group lxd
DBUG[07-18|19:44:13] Connecting to a local LXD over a Unix socket
DBUG[07-18|19:44:13] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[07-18|19:44:13] LXD 3.15 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[07-18|19:44:13] Kernel uid/gid map:
INFO[07-18|19:44:13] - u 0 0 4294967295
INFO[07-18|19:44:13] - g 0 0 4294967295
INFO[07-18|19:44:13] Configured LXD uid/gid map:
INFO[07-18|19:44:13] - u 0 1000000 1000000000
INFO[07-18|19:44:13] - g 0 1000000 1000000000
WARN[07-18|19:44:13] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[07-18|19:44:13] Kernel features:
INFO[07-18|19:44:13] - netnsid-based network retrieval: no
INFO[07-18|19:44:13] - uevent injection: no
INFO[07-18|19:44:13] - seccomp listener: no
INFO[07-18|19:44:13] - unprivileged file capabilities: yes
INFO[07-18|19:44:13] - shiftfs support: no
INFO[07-18|19:44:13] Initializing local database
DBUG[07-18|19:44:13] Initializing database gateway
DBUG[07-18|19:44:13] Start database node id=1 address=
EROR[07-18|19:44:13] Failed to start the daemon: Failed to start dqlite server: run failed with 13
INFO[07-18|19:44:13] Starting shutdown sequence
DBUG[07-18|19:44:13] Not unmounting temporary filesystems (containers are still running)
Error: Failed to start dqlite server: run failed with 13
How to fix it?
lxd from snap candidate. Refresh to new candidate/stable does not fix the problem.
I get backup of database from /var/snap/lxd/common/lxd/database/global.bak (which 2 days old - from Jul 16) and lxd started with it.
I’m lucky , it’s a developers server, but what is best to do in such cases?
Should I backup daily /var/snap/lxd/common/lxd/database/ and just switch db’s? Is it safe to backup plain files, or is there some way to dump and restore db with some tools?
Thanks, I’ve managed to reproduce the issue here. Just to confirm, you’re not actually blocked on this right now, right? Just trying to set priorities on our side as we’re dealing with a few other issues on 3.15.
I’ve forwarded the tarball and instructions to our database guru (@freeekanayaka) so we can track this down and include a fix. So far this is the only report we’ve had of this error, so it doesn’t seem widespread but it’d be good to understand why the migration is failing.
You may be able to use the same workaround as the reporter here, after making that tarball (both so we can debug it and as a backup), you can look at how old your global.bak directory is and if it’s not too old (you didn’t create new containers since), then you can move that back into place as global and start LXD using it. With a bit of chance that old snapshot will be suitable for the 3.15 upgrade.
Unfortunately I tried copying the files from the database.bak dir to the database dir and I’m still getting the same error. They are dated today so I imagine they contain the same error.
Hmm, indeed the backup directory already contains a structure matching the new dqlite 1.0 format, so you can’t easily revert to that. I’m afraid you’ll need to wait for @freeekanayaka to be around to get that issue sorted (he’s in Europe so should just be 2-3 hours).
You should have mentioned that you’re in a cluster setup though as that likely makes things quite a bit different from the original report.
Ok, then you most likely can blow away the database and have it replicate from the others.
I’ll make sure that this works properly on a test cluster here before you do it though.
Good, so you should be good to go. I’ve still sent your database to @freeekanayaka after confirming I can reproduce the issue here, hopefully having more data will help him track down the issue.
@freeekanayaka we have one of our cluster nodes hitting this issue actually.
If you want to play with it, it’s snap-latest-candidate-02 on vm12. I’m sure we can use the same trick of blowing away the database and have it sync it from one of the others, but it may give you some extra data to work from to fix this.