Failed to start dqlite server

Thanks for the report simos, do you still have your global.bak around? If so, please send it to me by mail. I might try to reproduce it.

Thanks. I sent an email with the global.bak directory to your canonical.com email address.

LXD 3.15, Same issue, same error msg, and same solution as simos.

Sidenote: The number-thingy was a file on my system, not a dir like it was for simos:

rm /var/snap/lxd/common/lxd/database/global/7133-7217

@freeekanayaka , do you want a copy of database/ ? If so, I’ll send it to the canonical email.

Thanks all!

Thanks for reporting @pianoJ. I don’t think the copy of database/ would be useful, essentially because it won’t tell me how the system got to this state. We added some code to automatically recover from this situation, so at least people won’t even notice this anymore and lxd will start normally.

1 Like

can someone specify for a dummy how that practically is being done?

so would that be a suggestion for any situation like this (In my case its /database/global/32824-32868)

I’d go on a limb and suggest to do a backup of the whole database directory (verified) and then delete global dir and rename global.bak global.

I don’t know exactly what’s the role of these files whose name look like an UID , but I’d say that you can attempt it (backup, then delete the files looking like 32824-32868). A quick test on my test config has shown that’s it’s not deadly for a working config (and your config don’t work anyway)

this + snap refresh lxd --stable worked actually. Thanks a mil everybody

Hello @stgraber and @freeekanayaka,

I appear to have hit this yesterday, which seems odd if you have added code to automatically recover.

$ lxc list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

$ sudo lxd --debug --group lxd
[sudo] password for aaron: 
DBUG[08-12|22:21:00] Connecting to a local LXD over a Unix socket 
DBUG[08-12|22:21:00] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-12|22:21:00] LXD 3.15 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-12|22:21:00] Kernel uid/gid map: 
INFO[08-12|22:21:00]  - u 0 0 4294967295 
INFO[08-12|22:21:00]  - g 0 0 4294967295 
INFO[08-12|22:21:00] Configured LXD uid/gid map: 
INFO[08-12|22:21:00]  - u 0 1000000 1000000000 
INFO[08-12|22:21:00]  - g 0 1000000 1000000000 
WARN[08-12|22:21:00] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-12|22:21:00] Kernel features: 
INFO[08-12|22:21:00]  - netnsid-based network retrieval: yes 
INFO[08-12|22:21:00]  - uevent injection: yes 
INFO[08-12|22:21:00]  - seccomp listener: yes 
INFO[08-12|22:21:00]  - unprivileged file capabilities: yes 
INFO[08-12|22:21:00]  - shiftfs support: yes 
INFO[08-12|22:21:00] Initializing local database 
DBUG[08-12|22:21:00] Initializing database gateway 
DBUG[08-12|22:21:00] Start database node                      id=1 address=
EROR[08-12|22:21:00] Failed to start the daemon: Failed to start dqlite server: run failed with 13 
INFO[08-12|22:21:00] Starting shutdown sequence 
DBUG[08-12|22:21:00] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 13

I can follow the steps above to try to recover, but thought I would check whether it would be helpful for me to send anything over before I do.

$ lxd --version
3.15

There are several scenarios in which Failed to start dqlite server: run failed with 13 might be returned. One of them has now logic in place to just emit a warning instead of bailing out. Either the snap version you’re running does not have that fix (I guess that’s unlikely), or you are hitting a different problem. Please try to upgrade to 3.16 first, which has even further additional error-handling logic. If it’s what I think, then upgrading to 3.16 won’t fix it, but it’s worth a try (in addition to that 3.16 will also output some more debugging information). If the problem persist, please either paste here the output of ls -l /var/snap/lxd/common/lxd/database/global, or send me an email with a tarball of /var/snap/lxd/common/lxd/database so I can see what’s wrong.

Thanks Free,

Done. Didn’t fix it.

$ snap refresh lxd --stable
lxd 3.16 from Canonical✓ refreshed

$ lxc list
Error: Get http://unix.socket/1.0: read unix @->/var/snap/lxd/common/lxd/unix.socket: read: connection reset by peer

$ sudo lxd --debug --group lxd
DBUG[08-13|22:16:19] Connecting to a local LXD over a Unix socket 
DBUG[08-13|22:16:19] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-13|22:16:19] LXD 3.16 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-13|22:16:19] Kernel uid/gid map: 
INFO[08-13|22:16:19]  - u 0 0 4294967295 
INFO[08-13|22:16:19]  - g 0 0 4294967295 
INFO[08-13|22:16:19] Configured LXD uid/gid map: 
INFO[08-13|22:16:19]  - u 0 1000000 1000000000 
INFO[08-13|22:16:19]  - g 0 1000000 1000000000 
WARN[08-13|22:16:19] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-13|22:16:19] Kernel features: 
INFO[08-13|22:16:19]  - netnsid-based network retrieval: yes 
INFO[08-13|22:16:19]  - uevent injection: yes 
INFO[08-13|22:16:19]  - seccomp listener: yes 
INFO[08-13|22:16:19]  - unprivileged file capabilities: yes 
INFO[08-13|22:16:19]  - shiftfs support: yes 
INFO[08-13|22:16:19] Initializing local database 
DBUG[08-13|22:16:19] Initializing database gateway 
DBUG[08-13|22:16:19] Start database node                      id=1 address=
00:07:28.645 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:07:28.645 [DEBUG]: metadata1: version 57, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata2: version 58, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata: version 60, term 7, voted for 1
00:07:28.645 [DEBUG]: I/O: direct 1, block 4096
00:07:28.645 [INFO ]: starting
00:07:28.645 [DEBUG]: ignore .
00:07:28.645 [DEBUG]: ignore ..
00:07:28.645 [DEBUG]: segment 2666-2843
00:07:28.645 [DEBUG]: ignore db.bin
00:07:28.645 [DEBUG]: ignore metadata1
00:07:28.645 [DEBUG]: ignore metadata2
00:07:28.645 [DEBUG]: ignore snapshot-1-1793-1
00:07:28.645 [DEBUG]: snapshot snapshot-1-1793-1.meta
00:07:28.645 [DEBUG]: most recent snapshot at 1793
00:07:28.645 [DEBUG]: most recent closed segment is 2666-2843
00:07:28.645 [ERROR]: found closed segment past last snapshot: 2666-2843
EROR[08-13|22:16:19] Failed to start the daemon: Failed to start dqlite server: run failed with 12 
INFO[08-13|22:16:19] Starting shutdown sequence 
DBUG[08-13|22:16:19] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 12
$ sudo ls -l /var/snap/lxd/common/lxd/database/global
total 2872
-rw------- 1 root root 2265864 Aug  7 15:28 2666-2843
-rw------- 1 root root  327680 Aug 13 22:15 db.bin
-rw------- 1 root root      32 Aug 13 22:16 metadata1
-rw------- 1 root root      32 Aug 13 22:16 metadata2
-rw------- 1 root root  327720 Jul 25 23:19 snapshot-1-1793-1
-rw------- 1 root root      52 Jul 25 23:19 snapshot-1-1793-1.meta

I will also email you the database.

Many thanks for your help.

@aaron unfortunately the logs indicate that, like the other users, you have been hit by:

There’s not much that can be done, you’re best bet is to run:

sudo rm /var/snap/lxd/common/lxd/database/global/2666-2843

and restart LXD. That means that any database change committed between the 25th of July and now will be lost, but that’s the most we can do to recover the data loss. From this point one you should be good since you’ll be running 3.16 which has improvements in that area.

Thanks Free. Appreciate your work.

Unfortunately, much of what I wanted happened after that date and after doing that rm command a lxc list gave me no results. The containers are all still there in /var/snap/lxd/common/lxd/containers, but doing a lxc exec into them didn’t work.

Interestingly, doing a lxc launch with the correct image and name seems to recreate the container and my files are there when I exec into it. I should be able to remember the details enough to do that for nearly all of them, but is there a more recommended way to recover from something like this?

I wouldn’t recommend the launch approach as there are some good chances it may overwrite your data or at least mess with it.

The best way to recover is to use lxd import NAME which is used specifically for cases where you have the container data but no database record. That will go and read the backup.yaml file that’s part of the on-disk storage of all containers and re-create the majority of database records needed for the container.

That’s covered in our backup documentation here: https://lxd.readthedocs.io/en/latest/backup/#disaster-recovery

Thanks @stgraber. That sounds great.

I created a new topic here: Mounting container storage volume with a request for help on that process (to avoid distracting this thread further).

Many thanks for all of your help getting me back up and running properly with that import tip!

If this has good chances of overwriting or messing with data, would it be worth adding a confirmation prompt? From memory the creation process all just looked like a new container until I started poking around and found my old files.

I’m not sure why launch worked at all, I would have expected it to complain when creating the container storage volume that it already existed. We’re currently doing a big rework of our storage logic, partly to improve error handling, so I would expect what you did to fail in the near future as we fix some of those codepaths.

1 Like

@stgraber just encountered this issue on LXD 3.18 on Ubuntu 18.04 after a (I think) unclean shutdown of the server

root@ubuntu-bionic:/var/snap/lxd/common/lxd/logs# /snap/bin/lxd --debug --group lxd
DBUG[10-17|19:15:05] Connecting to a local LXD over a Unix socket 
DBUG[10-17|19:15:05] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[10-17|19:15:05] LXD 3.18 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[10-17|19:15:05] Kernel uid/gid map: 
INFO[10-17|19:15:05]  - u 0 0 4294967295 
INFO[10-17|19:15:05]  - g 0 0 4294967295 
INFO[10-17|19:15:05] Configured LXD uid/gid map: 
INFO[10-17|19:15:05]  - u 0 1000000 1000000000 
INFO[10-17|19:15:05]  - g 0 1000000 1000000000 
WARN[10-17|19:15:05] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[10-17|19:15:05] Kernel features: 
INFO[10-17|19:15:05]  - netnsid-based network retrieval: no 
INFO[10-17|19:15:05]  - uevent injection: no 
INFO[10-17|19:15:05]  - seccomp listener: no 
INFO[10-17|19:15:05]  - unprivileged file capabilities: yes 
INFO[10-17|19:15:05]  - shiftfs support: no 
INFO[10-17|19:15:05] Initializing local database 
DBUG[10-17|19:15:05] Initializing database gateway 
DBUG[10-17|19:15:05] Start database node                      id=1 address=
00:13:04.131 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:13:04.131 [DEBUG]: metadata1: version 151, term 24, voted for 1
00:13:04.131 [DEBUG]: metadata2: version 150, term 24, voted for 1
00:13:04.131 [DEBUG]: metadata: version 153, term 24, voted for 1
00:13:04.131 [DEBUG]: I/O: direct 1, block 4096
00:13:04.131 [INFO ]: starting
00:13:04.131 [DEBUG]: ignore .
00:13:04.131 [DEBUG]: ignore ..
00:13:04.131 [DEBUG]: segment 1-1
00:13:04.131 [DEBUG]: segment 1175-1698
00:13:04.131 [DEBUG]: segment 1699-1718
00:13:04.131 [DEBUG]: segment 17-340
00:13:04.131 [DEBUG]: segment 1719-1777
00:13:04.131 [DEBUG]: segment 1778-1797
00:13:04.131 [DEBUG]: segment 1798-1898
00:13:04.131 [DEBUG]: segment 1899-1936
00:13:04.131 [DEBUG]: segment 1937-2012
00:13:04.131 [DEBUG]: segment 2-16
00:13:04.131 [DEBUG]: segment 2013-2111
00:13:04.131 [DEBUG]: segment 2112-2141
00:13:04.131 [DEBUG]: segment 2142-2170
00:13:04.131 [DEBUG]: segment 2171-2205
00:13:04.131 [DEBUG]: segment 2206-2329
00:13:04.131 [DEBUG]: segment 2206-2331
00:13:04.131 [DEBUG]: segment 341-425
00:13:04.131 [DEBUG]: segment 426-441
00:13:04.131 [DEBUG]: segment 442-457
00:13:04.131 [DEBUG]: segment 458-608
00:13:04.131 [DEBUG]: segment 609-712
00:13:04.131 [DEBUG]: segment 713-828
00:13:04.131 [DEBUG]: segment 829-897
00:13:04.131 [DEBUG]: segment 898-1174
00:13:04.131 [DEBUG]: ignore db.bin
00:13:04.131 [DEBUG]: ignore db.bin-wal
00:13:04.131 [DEBUG]: ignore metadata1
00:13:04.131 [DEBUG]: ignore metadata2
00:13:04.131 [DEBUG]: ignore snapshot-11-1024-1384224
00:13:04.131 [DEBUG]: snapshot snapshot-11-1024-1384224.meta
00:13:04.131 [DEBUG]: ignore snapshot-19-2048-1994647
00:13:04.131 [DEBUG]: snapshot snapshot-19-2048-1994647.meta
00:13:04.131 [DEBUG]: most recent snapshot at 2048
00:13:04.131 [DEBUG]: most recent closed segment is 2206-2329
00:13:04.131 [WARN ]: discarding non contiguous segment 2206-2331
00:13:04.131 [ERROR]: found closed segment past last snapshot: 2206-2329
EROR[10-17|19:15:05] Failed to start the daemon: Failed to start dqlite server: failed to start task 
INFO[10-17|19:15:05] Starting shutdown sequence 
DBUG[10-17|19:15:05] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: failed to start task

I have a global.bak that I was able to successfully start LXD with that’s a few days old.

Update:
Removing 2206-2331 seemed to fix the issue, however just wanted to report this here just in case it’s still a possible bug.

Thanks for your help!

Thanks for reporting. Previous cases where this happened were due to the system running out of disk space. Could this have happened to your system too?

btw, in my case earlier there were definitely enough space on all partitions.

@freeekanayaka thanks for the quick response!

Looks like I have around 4.7GB available on the partition /dev/sda1 9.7G 5.0G 4.7G 52% /

When this error happened I had LXD running as the foreground process of a terminal window via:
/snap/bin/lxd --debug --group lxd

When I shut the VM down I don’t think it shut down cleanly and unfortunately I can’t remember if I had killed LXD via Ctrl + C before I shut it down or not.

Ironically right before I was going to post this, my mac that’s running the VM (that’s running Ubuntu/LXD) froze up and I had to do a forced restart which did not shut the VM down correctly. I was thinking this would possibly trigger the bug again, but LXD restarted just fine.

Apologies for the poor debug info on my end!