Failed to start dqlite server

Nick_Knutov · July 18, 2019, 4:53pm

Today lxd stopped working with error Failed to start the daemon: Failed to start dqlite server: run failed with 13

lxd --debug --group lxd
DBUG[07-18|19:44:13] Connecting to a local LXD over a Unix socket
DBUG[07-18|19:44:13] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[07-18|19:44:13] LXD 3.15 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[07-18|19:44:13] Kernel uid/gid map:
INFO[07-18|19:44:13]  - u 0 0 4294967295
INFO[07-18|19:44:13]  - g 0 0 4294967295
INFO[07-18|19:44:13] Configured LXD uid/gid map:
INFO[07-18|19:44:13]  - u 0 1000000 1000000000
INFO[07-18|19:44:13]  - g 0 1000000 1000000000
WARN[07-18|19:44:13] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[07-18|19:44:13] Kernel features:
INFO[07-18|19:44:13]  - netnsid-based network retrieval: no
INFO[07-18|19:44:13]  - uevent injection: no
INFO[07-18|19:44:13]  - seccomp listener: no
INFO[07-18|19:44:13]  - unprivileged file capabilities: yes
INFO[07-18|19:44:13]  - shiftfs support: no
INFO[07-18|19:44:13] Initializing local database
DBUG[07-18|19:44:13] Initializing database gateway
DBUG[07-18|19:44:13] Start database node                      id=1 address=
EROR[07-18|19:44:13] Failed to start the daemon: Failed to start dqlite server: run failed with 13
INFO[07-18|19:44:13] Starting shutdown sequence
DBUG[07-18|19:44:13] Not unmounting temporary filesystems (containers are still running)
Error: Failed to start dqlite server: run failed with 13

How to fix it?

lxd from snap candidate. Refresh to new candidate/stable does not fix the problem.

Nick_Knutov · July 18, 2019, 5:22pm

I get backup of database from /var/snap/lxd/common/lxd/database/global.bak (which 2 days old - from Jul 16) and lxd started with it.

I’m lucky , it’s a developers server, but what is best to do in such cases?

Should I backup daily /var/snap/lxd/common/lxd/database/ and just switch db’s? Is it safe to backup plain files, or is there some way to dump and restore db with some tools?

stgraber · July 18, 2019, 6:43pm

Can you send us the broken database directory so we can reproduce the issue and prepare a fix for it?

You can send it to me at stgraber at ubuntu dot com

Nick_Knutov · July 18, 2019, 7:17pm

Thanks, sended from knutov at gmail dot com

stgraber · July 19, 2019, 2:39am

Thanks, I’ve managed to reproduce the issue here. Just to confirm, you’re not actually blocked on this right now, right? Just trying to set priorities on our side as we’re dealing with a few other issues on 3.15.

stgraber · July 19, 2019, 2:42am

I’ve forwarded the tarball and instructions to our database guru (@freeekanayaka) so we can track this down and include a fix. So far this is the only report we’ve had of this error, so it doesn’t seem widespread but it’d be good to understand why the migration is failing.

hobiga · July 19, 2019, 3:13am

I’m now having this issue as well.

stgraber · July 19, 2019, 3:28am

Could you also send me a database tarball?

You may be able to use the same workaround as the reporter here, after making that tarball (both so we can debug it and as a backup), you can look at how old your global.bak directory is and if it’s not too old (you didn’t create new containers since), then you can move that back into place as global and start LXD using it. With a bit of chance that old snapshot will be suitable for the 3.15 upgrade.

Nick_Knutov · July 19, 2019, 3:35am

@stgraber Yes, I’m not blocked with this now, backup from 2 days ago works for me.

hobiga · July 19, 2019, 3:38am

I have sent the database tarball.

Unfortunately I tried copying the files from the database.bak dir to the database dir and I’m still getting the same error. They are dated today so I imagine they contain the same error.

stgraber · July 19, 2019, 3:45am

Hmm, indeed the backup directory already contains a structure matching the new dqlite 1.0 format, so you can’t easily revert to that. I’m afraid you’ll need to wait for @freeekanayaka to be around to get that issue sorted (he’s in Europe so should just be 2-3 hours).

You should have mentioned that you’re in a cluster setup though as that likely makes things quite a bit different from the original report.

stgraber · July 19, 2019, 3:45am

Are all 3 nodes running into the same issue?

hobiga · July 19, 2019, 3:46am

Sorry forgot to mention that I"m in a 3 node cluster.

Actually only one node is having this issue.

stgraber · July 19, 2019, 3:46am

Ok, then you most likely can blow away the database and have it replicate from the others.
I’ll make sure that this works properly on a test cluster here before you do it though.

hobiga · July 19, 2019, 3:48am

I can save you some effort and just try it. I have a backup of folder already so if things go south I’ll just restore from that.

EDIT: That did indeed resolve the issue.

stgraber · July 19, 2019, 3:51am

Good to hear, can you confirm that after doing that, the node did get a database/global directory again and it looks populated?

Just want to makes sure that it indeed still acts like a database node.

hobiga · July 19, 2019, 3:56am

The directory looks good.

root@lxdlab01:/var/snap/lxd/common/lxd/database/global# ls -altr
total 25924
-rw------- 1 root root 32 Jul 18 21:55 metadata2
-rw------- 1 root root 32 Jul 18 21:55 metadata1
-rw------- 1 root root 112 Jul 18 21:55 snapshot-305-2135640-3686780.meta
-rw------- 1 root root 1356504 Jul 18 21:55 snapshot-305-2135640-3686780
-rw------- 1 root root 8388608 Jul 18 21:55 open-2
-rw------- 1 root root 8388608 Jul 18 21:55 open-3
drwxr-x— 2 root root 4096 Jul 18 21:55 .
drwx------ 3 root root 4096 Jul 18 22:00 …
-rw------- 1 root root 8388608 Jul 18 22:00 open-1

stgraber · July 19, 2019, 3:57am

Good, so you should be good to go. I’ve still sent your database to @freeekanayaka after confirming I can reproduce the issue here, hopefully having more data will help him track down the issue.

hobiga · July 19, 2019, 4:01am

I am back and running on that node and learned something new about how the database replication works. Thanks for your help.

stgraber · July 19, 2019, 4:03am

Good to hear you’re back online.

@freeekanayaka we have one of our cluster nodes hitting this issue actually.
If you want to play with it, it’s snap-latest-candidate-02 on vm12. I’m sure we can use the same trick of blowing away the database and have it sync it from one of the others, but it may give you some extra data to work from to fix this.