Failed to start dqlite server

@Nick_Knutov you can perform daily backups by plainly copying files, if you wish, but you should shutdown the lxd daemon during the copy. Alternatively you can perform a live dump, as described in the docs. In general you should not need backups, since lxd takes a backup itself when it upgrades (which is the most likely spot where issues can occur).

@hobiga fwiw I could fix your database by running:

rm <your-lxd-dir>/database/global/2133593-2133840

where I guess <your-lxd-dir-> is /var/snap/lxd/common/lxd.

Anyway, good to know that you sorted that out.

1 Like

@Nick_Knutov, I could fix your db by:

rm database/global/4066-4245

as in the case of @hobiga.

It’s not clear what situation leads to this, but if we see more cases like this occurring even after a successful upgrade to 3.15 we’ll need to at least put some code in place that attempts automatic recovery.

I got the same issue on LXD 3.15 (snap stable channel, updated yesterday but LXD would not start today after a system boot).

Here is the error message,
Failed to start the daemon: Failed to start dqlite server: run failed with 13

I got a backup of /var/snap/lxd/common/lxd/database for posterity.

There are global/ and global.bak/. The global.bak has the timestamp of the boot-up.

I removed that subdirectory directory with the two numbers,

rm /var/snap/lxd/common/lxd/database/global/40577-40640/

and then I ran

systemctl start snap.lxd.daemon

And LXD started normally.

1 Like

Thanks for the report simos, do you still have your global.bak around? If so, please send it to me by mail. I might try to reproduce it.

Thanks. I sent an email with the global.bak directory to your canonical.com email address.

LXD 3.15, Same issue, same error msg, and same solution as simos.

Sidenote: The number-thingy was a file on my system, not a dir like it was for simos:

rm /var/snap/lxd/common/lxd/database/global/7133-7217

@freeekanayaka , do you want a copy of database/ ? If so, I’ll send it to the canonical email.

Thanks all!

Thanks for reporting @pianoJ. I don’t think the copy of database/ would be useful, essentially because it won’t tell me how the system got to this state. We added some code to automatically recover from this situation, so at least people won’t even notice this anymore and lxd will start normally.

1 Like

can someone specify for a dummy how that practically is being done?

so would that be a suggestion for any situation like this (In my case its /database/global/32824-32868)

I’d go on a limb and suggest to do a backup of the whole database directory (verified) and then delete global dir and rename global.bak global.

I don’t know exactly what’s the role of these files whose name look like an UID , but I’d say that you can attempt it (backup, then delete the files looking like 32824-32868). A quick test on my test config has shown that’s it’s not deadly for a working config (and your config don’t work anyway)

this + snap refresh lxd --stable worked actually. Thanks a mil everybody

Hello @stgraber and @freeekanayaka,

I appear to have hit this yesterday, which seems odd if you have added code to automatically recover.

$ lxc list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

$ sudo lxd --debug --group lxd
[sudo] password for aaron: 
DBUG[08-12|22:21:00] Connecting to a local LXD over a Unix socket 
DBUG[08-12|22:21:00] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-12|22:21:00] LXD 3.15 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-12|22:21:00] Kernel uid/gid map: 
INFO[08-12|22:21:00]  - u 0 0 4294967295 
INFO[08-12|22:21:00]  - g 0 0 4294967295 
INFO[08-12|22:21:00] Configured LXD uid/gid map: 
INFO[08-12|22:21:00]  - u 0 1000000 1000000000 
INFO[08-12|22:21:00]  - g 0 1000000 1000000000 
WARN[08-12|22:21:00] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-12|22:21:00] Kernel features: 
INFO[08-12|22:21:00]  - netnsid-based network retrieval: yes 
INFO[08-12|22:21:00]  - uevent injection: yes 
INFO[08-12|22:21:00]  - seccomp listener: yes 
INFO[08-12|22:21:00]  - unprivileged file capabilities: yes 
INFO[08-12|22:21:00]  - shiftfs support: yes 
INFO[08-12|22:21:00] Initializing local database 
DBUG[08-12|22:21:00] Initializing database gateway 
DBUG[08-12|22:21:00] Start database node                      id=1 address=
EROR[08-12|22:21:00] Failed to start the daemon: Failed to start dqlite server: run failed with 13 
INFO[08-12|22:21:00] Starting shutdown sequence 
DBUG[08-12|22:21:00] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 13

I can follow the steps above to try to recover, but thought I would check whether it would be helpful for me to send anything over before I do.

$ lxd --version
3.15

There are several scenarios in which Failed to start dqlite server: run failed with 13 might be returned. One of them has now logic in place to just emit a warning instead of bailing out. Either the snap version you’re running does not have that fix (I guess that’s unlikely), or you are hitting a different problem. Please try to upgrade to 3.16 first, which has even further additional error-handling logic. If it’s what I think, then upgrading to 3.16 won’t fix it, but it’s worth a try (in addition to that 3.16 will also output some more debugging information). If the problem persist, please either paste here the output of ls -l /var/snap/lxd/common/lxd/database/global, or send me an email with a tarball of /var/snap/lxd/common/lxd/database so I can see what’s wrong.

Thanks Free,

Done. Didn’t fix it.

$ snap refresh lxd --stable
lxd 3.16 from Canonical✓ refreshed

$ lxc list
Error: Get http://unix.socket/1.0: read unix @->/var/snap/lxd/common/lxd/unix.socket: read: connection reset by peer

$ sudo lxd --debug --group lxd
DBUG[08-13|22:16:19] Connecting to a local LXD over a Unix socket 
DBUG[08-13|22:16:19] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-13|22:16:19] LXD 3.16 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-13|22:16:19] Kernel uid/gid map: 
INFO[08-13|22:16:19]  - u 0 0 4294967295 
INFO[08-13|22:16:19]  - g 0 0 4294967295 
INFO[08-13|22:16:19] Configured LXD uid/gid map: 
INFO[08-13|22:16:19]  - u 0 1000000 1000000000 
INFO[08-13|22:16:19]  - g 0 1000000 1000000000 
WARN[08-13|22:16:19] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-13|22:16:19] Kernel features: 
INFO[08-13|22:16:19]  - netnsid-based network retrieval: yes 
INFO[08-13|22:16:19]  - uevent injection: yes 
INFO[08-13|22:16:19]  - seccomp listener: yes 
INFO[08-13|22:16:19]  - unprivileged file capabilities: yes 
INFO[08-13|22:16:19]  - shiftfs support: yes 
INFO[08-13|22:16:19] Initializing local database 
DBUG[08-13|22:16:19] Initializing database gateway 
DBUG[08-13|22:16:19] Start database node                      id=1 address=
00:07:28.645 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:07:28.645 [DEBUG]: metadata1: version 57, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata2: version 58, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata: version 60, term 7, voted for 1
00:07:28.645 [DEBUG]: I/O: direct 1, block 4096
00:07:28.645 [INFO ]: starting
00:07:28.645 [DEBUG]: ignore .
00:07:28.645 [DEBUG]: ignore ..
00:07:28.645 [DEBUG]: segment 2666-2843
00:07:28.645 [DEBUG]: ignore db.bin
00:07:28.645 [DEBUG]: ignore metadata1
00:07:28.645 [DEBUG]: ignore metadata2
00:07:28.645 [DEBUG]: ignore snapshot-1-1793-1
00:07:28.645 [DEBUG]: snapshot snapshot-1-1793-1.meta
00:07:28.645 [DEBUG]: most recent snapshot at 1793
00:07:28.645 [DEBUG]: most recent closed segment is 2666-2843
00:07:28.645 [ERROR]: found closed segment past last snapshot: 2666-2843
EROR[08-13|22:16:19] Failed to start the daemon: Failed to start dqlite server: run failed with 12 
INFO[08-13|22:16:19] Starting shutdown sequence 
DBUG[08-13|22:16:19] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 12
$ sudo ls -l /var/snap/lxd/common/lxd/database/global
total 2872
-rw------- 1 root root 2265864 Aug  7 15:28 2666-2843
-rw------- 1 root root  327680 Aug 13 22:15 db.bin
-rw------- 1 root root      32 Aug 13 22:16 metadata1
-rw------- 1 root root      32 Aug 13 22:16 metadata2
-rw------- 1 root root  327720 Jul 25 23:19 snapshot-1-1793-1
-rw------- 1 root root      52 Jul 25 23:19 snapshot-1-1793-1.meta

I will also email you the database.

Many thanks for your help.

@aaron unfortunately the logs indicate that, like the other users, you have been hit by:

There’s not much that can be done, you’re best bet is to run:

sudo rm /var/snap/lxd/common/lxd/database/global/2666-2843

and restart LXD. That means that any database change committed between the 25th of July and now will be lost, but that’s the most we can do to recover the data loss. From this point one you should be good since you’ll be running 3.16 which has improvements in that area.

Thanks Free. Appreciate your work.

Unfortunately, much of what I wanted happened after that date and after doing that rm command a lxc list gave me no results. The containers are all still there in /var/snap/lxd/common/lxd/containers, but doing a lxc exec into them didn’t work.

Interestingly, doing a lxc launch with the correct image and name seems to recreate the container and my files are there when I exec into it. I should be able to remember the details enough to do that for nearly all of them, but is there a more recommended way to recover from something like this?

I wouldn’t recommend the launch approach as there are some good chances it may overwrite your data or at least mess with it.

The best way to recover is to use lxd import NAME which is used specifically for cases where you have the container data but no database record. That will go and read the backup.yaml file that’s part of the on-disk storage of all containers and re-create the majority of database records needed for the container.

That’s covered in our backup documentation here: https://lxd.readthedocs.io/en/latest/backup/#disaster-recovery

Thanks @stgraber. That sounds great.

I created a new topic here: Mounting container storage volume with a request for help on that process (to avoid distracting this thread further).

Many thanks for all of your help getting me back up and running properly with that import tip!

If this has good chances of overwriting or messing with data, would it be worth adding a confirmation prompt? From memory the creation process all just looked like a new container until I started poking around and found my old files.

I’m not sure why launch worked at all, I would have expected it to complain when creating the container storage volume that it already existed. We’re currently doing a big rework of our storage logic, partly to improve error handling, so I would expect what you did to fail in the near future as we fix some of those codepaths.

1 Like