Failed to start dqlite server

I’d go on a limb and suggest to do a backup of the whole database directory (verified) and then delete global dir and rename global.bak global.

I don’t know exactly what’s the role of these files whose name look like an UID , but I’d say that you can attempt it (backup, then delete the files looking like 32824-32868). A quick test on my test config has shown that’s it’s not deadly for a working config (and your config don’t work anyway)

this + snap refresh lxd --stable worked actually. Thanks a mil everybody

Hello @stgraber and @freeekanayaka,

I appear to have hit this yesterday, which seems odd if you have added code to automatically recover.

$ lxc list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

$ sudo lxd --debug --group lxd
[sudo] password for aaron: 
DBUG[08-12|22:21:00] Connecting to a local LXD over a Unix socket 
DBUG[08-12|22:21:00] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-12|22:21:00] LXD 3.15 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-12|22:21:00] Kernel uid/gid map: 
INFO[08-12|22:21:00]  - u 0 0 4294967295 
INFO[08-12|22:21:00]  - g 0 0 4294967295 
INFO[08-12|22:21:00] Configured LXD uid/gid map: 
INFO[08-12|22:21:00]  - u 0 1000000 1000000000 
INFO[08-12|22:21:00]  - g 0 1000000 1000000000 
WARN[08-12|22:21:00] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-12|22:21:00] Kernel features: 
INFO[08-12|22:21:00]  - netnsid-based network retrieval: yes 
INFO[08-12|22:21:00]  - uevent injection: yes 
INFO[08-12|22:21:00]  - seccomp listener: yes 
INFO[08-12|22:21:00]  - unprivileged file capabilities: yes 
INFO[08-12|22:21:00]  - shiftfs support: yes 
INFO[08-12|22:21:00] Initializing local database 
DBUG[08-12|22:21:00] Initializing database gateway 
DBUG[08-12|22:21:00] Start database node                      id=1 address=
EROR[08-12|22:21:00] Failed to start the daemon: Failed to start dqlite server: run failed with 13 
INFO[08-12|22:21:00] Starting shutdown sequence 
DBUG[08-12|22:21:00] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 13

I can follow the steps above to try to recover, but thought I would check whether it would be helpful for me to send anything over before I do.

$ lxd --version
3.15

There are several scenarios in which Failed to start dqlite server: run failed with 13 might be returned. One of them has now logic in place to just emit a warning instead of bailing out. Either the snap version you’re running does not have that fix (I guess that’s unlikely), or you are hitting a different problem. Please try to upgrade to 3.16 first, which has even further additional error-handling logic. If it’s what I think, then upgrading to 3.16 won’t fix it, but it’s worth a try (in addition to that 3.16 will also output some more debugging information). If the problem persist, please either paste here the output of ls -l /var/snap/lxd/common/lxd/database/global, or send me an email with a tarball of /var/snap/lxd/common/lxd/database so I can see what’s wrong.

Thanks Free,

Done. Didn’t fix it.

$ snap refresh lxd --stable
lxd 3.16 from Canonical✓ refreshed

$ lxc list
Error: Get http://unix.socket/1.0: read unix @->/var/snap/lxd/common/lxd/unix.socket: read: connection reset by peer

$ sudo lxd --debug --group lxd
DBUG[08-13|22:16:19] Connecting to a local LXD over a Unix socket 
DBUG[08-13|22:16:19] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[08-13|22:16:19] LXD 3.16 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[08-13|22:16:19] Kernel uid/gid map: 
INFO[08-13|22:16:19]  - u 0 0 4294967295 
INFO[08-13|22:16:19]  - g 0 0 4294967295 
INFO[08-13|22:16:19] Configured LXD uid/gid map: 
INFO[08-13|22:16:19]  - u 0 1000000 1000000000 
INFO[08-13|22:16:19]  - g 0 1000000 1000000000 
WARN[08-13|22:16:19] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-13|22:16:19] Kernel features: 
INFO[08-13|22:16:19]  - netnsid-based network retrieval: yes 
INFO[08-13|22:16:19]  - uevent injection: yes 
INFO[08-13|22:16:19]  - seccomp listener: yes 
INFO[08-13|22:16:19]  - unprivileged file capabilities: yes 
INFO[08-13|22:16:19]  - shiftfs support: yes 
INFO[08-13|22:16:19] Initializing local database 
DBUG[08-13|22:16:19] Initializing database gateway 
DBUG[08-13|22:16:19] Start database node                      id=1 address=
00:07:28.645 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:07:28.645 [DEBUG]: metadata1: version 57, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata2: version 58, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata: version 60, term 7, voted for 1
00:07:28.645 [DEBUG]: I/O: direct 1, block 4096
00:07:28.645 [INFO ]: starting
00:07:28.645 [DEBUG]: ignore .
00:07:28.645 [DEBUG]: ignore ..
00:07:28.645 [DEBUG]: segment 2666-2843
00:07:28.645 [DEBUG]: ignore db.bin
00:07:28.645 [DEBUG]: ignore metadata1
00:07:28.645 [DEBUG]: ignore metadata2
00:07:28.645 [DEBUG]: ignore snapshot-1-1793-1
00:07:28.645 [DEBUG]: snapshot snapshot-1-1793-1.meta
00:07:28.645 [DEBUG]: most recent snapshot at 1793
00:07:28.645 [DEBUG]: most recent closed segment is 2666-2843
00:07:28.645 [ERROR]: found closed segment past last snapshot: 2666-2843
EROR[08-13|22:16:19] Failed to start the daemon: Failed to start dqlite server: run failed with 12 
INFO[08-13|22:16:19] Starting shutdown sequence 
DBUG[08-13|22:16:19] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: run failed with 12
$ sudo ls -l /var/snap/lxd/common/lxd/database/global
total 2872
-rw------- 1 root root 2265864 Aug  7 15:28 2666-2843
-rw------- 1 root root  327680 Aug 13 22:15 db.bin
-rw------- 1 root root      32 Aug 13 22:16 metadata1
-rw------- 1 root root      32 Aug 13 22:16 metadata2
-rw------- 1 root root  327720 Jul 25 23:19 snapshot-1-1793-1
-rw------- 1 root root      52 Jul 25 23:19 snapshot-1-1793-1.meta

I will also email you the database.

Many thanks for your help.

@aaron unfortunately the logs indicate that, like the other users, you have been hit by:

There’s not much that can be done, you’re best bet is to run:

sudo rm /var/snap/lxd/common/lxd/database/global/2666-2843

and restart LXD. That means that any database change committed between the 25th of July and now will be lost, but that’s the most we can do to recover the data loss. From this point one you should be good since you’ll be running 3.16 which has improvements in that area.

Thanks Free. Appreciate your work.

Unfortunately, much of what I wanted happened after that date and after doing that rm command a lxc list gave me no results. The containers are all still there in /var/snap/lxd/common/lxd/containers, but doing a lxc exec into them didn’t work.

Interestingly, doing a lxc launch with the correct image and name seems to recreate the container and my files are there when I exec into it. I should be able to remember the details enough to do that for nearly all of them, but is there a more recommended way to recover from something like this?

I wouldn’t recommend the launch approach as there are some good chances it may overwrite your data or at least mess with it.

The best way to recover is to use lxd import NAME which is used specifically for cases where you have the container data but no database record. That will go and read the backup.yaml file that’s part of the on-disk storage of all containers and re-create the majority of database records needed for the container.

That’s covered in our backup documentation here: https://lxd.readthedocs.io/en/latest/backup/#disaster-recovery

Thanks @stgraber. That sounds great.

I created a new topic here: Mounting container storage volume with a request for help on that process (to avoid distracting this thread further).

Many thanks for all of your help getting me back up and running properly with that import tip!

If this has good chances of overwriting or messing with data, would it be worth adding a confirmation prompt? From memory the creation process all just looked like a new container until I started poking around and found my old files.

I’m not sure why launch worked at all, I would have expected it to complain when creating the container storage volume that it already existed. We’re currently doing a big rework of our storage logic, partly to improve error handling, so I would expect what you did to fail in the near future as we fix some of those codepaths.

1 Like

@stgraber just encountered this issue on LXD 3.18 on Ubuntu 18.04 after a (I think) unclean shutdown of the server

root@ubuntu-bionic:/var/snap/lxd/common/lxd/logs# /snap/bin/lxd --debug --group lxd
DBUG[10-17|19:15:05] Connecting to a local LXD over a Unix socket 
DBUG[10-17|19:15:05] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[10-17|19:15:05] LXD 3.18 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[10-17|19:15:05] Kernel uid/gid map: 
INFO[10-17|19:15:05]  - u 0 0 4294967295 
INFO[10-17|19:15:05]  - g 0 0 4294967295 
INFO[10-17|19:15:05] Configured LXD uid/gid map: 
INFO[10-17|19:15:05]  - u 0 1000000 1000000000 
INFO[10-17|19:15:05]  - g 0 1000000 1000000000 
WARN[10-17|19:15:05] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[10-17|19:15:05] Kernel features: 
INFO[10-17|19:15:05]  - netnsid-based network retrieval: no 
INFO[10-17|19:15:05]  - uevent injection: no 
INFO[10-17|19:15:05]  - seccomp listener: no 
INFO[10-17|19:15:05]  - unprivileged file capabilities: yes 
INFO[10-17|19:15:05]  - shiftfs support: no 
INFO[10-17|19:15:05] Initializing local database 
DBUG[10-17|19:15:05] Initializing database gateway 
DBUG[10-17|19:15:05] Start database node                      id=1 address=
00:13:04.131 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:13:04.131 [DEBUG]: metadata1: version 151, term 24, voted for 1
00:13:04.131 [DEBUG]: metadata2: version 150, term 24, voted for 1
00:13:04.131 [DEBUG]: metadata: version 153, term 24, voted for 1
00:13:04.131 [DEBUG]: I/O: direct 1, block 4096
00:13:04.131 [INFO ]: starting
00:13:04.131 [DEBUG]: ignore .
00:13:04.131 [DEBUG]: ignore ..
00:13:04.131 [DEBUG]: segment 1-1
00:13:04.131 [DEBUG]: segment 1175-1698
00:13:04.131 [DEBUG]: segment 1699-1718
00:13:04.131 [DEBUG]: segment 17-340
00:13:04.131 [DEBUG]: segment 1719-1777
00:13:04.131 [DEBUG]: segment 1778-1797
00:13:04.131 [DEBUG]: segment 1798-1898
00:13:04.131 [DEBUG]: segment 1899-1936
00:13:04.131 [DEBUG]: segment 1937-2012
00:13:04.131 [DEBUG]: segment 2-16
00:13:04.131 [DEBUG]: segment 2013-2111
00:13:04.131 [DEBUG]: segment 2112-2141
00:13:04.131 [DEBUG]: segment 2142-2170
00:13:04.131 [DEBUG]: segment 2171-2205
00:13:04.131 [DEBUG]: segment 2206-2329
00:13:04.131 [DEBUG]: segment 2206-2331
00:13:04.131 [DEBUG]: segment 341-425
00:13:04.131 [DEBUG]: segment 426-441
00:13:04.131 [DEBUG]: segment 442-457
00:13:04.131 [DEBUG]: segment 458-608
00:13:04.131 [DEBUG]: segment 609-712
00:13:04.131 [DEBUG]: segment 713-828
00:13:04.131 [DEBUG]: segment 829-897
00:13:04.131 [DEBUG]: segment 898-1174
00:13:04.131 [DEBUG]: ignore db.bin
00:13:04.131 [DEBUG]: ignore db.bin-wal
00:13:04.131 [DEBUG]: ignore metadata1
00:13:04.131 [DEBUG]: ignore metadata2
00:13:04.131 [DEBUG]: ignore snapshot-11-1024-1384224
00:13:04.131 [DEBUG]: snapshot snapshot-11-1024-1384224.meta
00:13:04.131 [DEBUG]: ignore snapshot-19-2048-1994647
00:13:04.131 [DEBUG]: snapshot snapshot-19-2048-1994647.meta
00:13:04.131 [DEBUG]: most recent snapshot at 2048
00:13:04.131 [DEBUG]: most recent closed segment is 2206-2329
00:13:04.131 [WARN ]: discarding non contiguous segment 2206-2331
00:13:04.131 [ERROR]: found closed segment past last snapshot: 2206-2329
EROR[10-17|19:15:05] Failed to start the daemon: Failed to start dqlite server: failed to start task 
INFO[10-17|19:15:05] Starting shutdown sequence 
DBUG[10-17|19:15:05] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: failed to start task

I have a global.bak that I was able to successfully start LXD with that’s a few days old.

Update:
Removing 2206-2331 seemed to fix the issue, however just wanted to report this here just in case it’s still a possible bug.

Thanks for your help!

Thanks for reporting. Previous cases where this happened were due to the system running out of disk space. Could this have happened to your system too?

btw, in my case earlier there were definitely enough space on all partitions.

@freeekanayaka thanks for the quick response!

Looks like I have around 4.7GB available on the partition /dev/sda1 9.7G 5.0G 4.7G 52% /

When this error happened I had LXD running as the foreground process of a terminal window via:
/snap/bin/lxd --debug --group lxd

When I shut the VM down I don’t think it shut down cleanly and unfortunately I can’t remember if I had killed LXD via Ctrl + C before I shut it down or not.

Ironically right before I was going to post this, my mac that’s running the VM (that’s running Ubuntu/LXD) froze up and I had to do a forced restart which did not shut the VM down correctly. I was thinking this would possibly trigger the bug again, but LXD restarted just fine.

Apologies for the poor debug info on my end!

Hello!
I’ve just upgraded my Ubuntu to 20.04 and reboot during installation and I have the same problem
LXD doesn’t start
I tried:
/usr/bin/snap run lxd.daemon
And finally get:

EROR[04-27|10:55:30] Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000019244-0000000000019256: entries batch 19 starting at byte 181848: entries count in preamble is zero 
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000019244-0000000000019256: entries batch 19 starting at byte 181848: entries count in preamble is zero
=> LXD failed to start

Hello,

please backup your /var/snap/lxd/common/lxd/database directory and then try to:

sudo /var/snap/lxd/common/lxd/database/global/0000000000019244-0000000000019256

and see if the daemon starts fine after that.

Thank You @freeekanayaka !
I’ve moved 2 files from global after reading logs:

/var/snap/lxd/common/lxd/database/global/
-rw------- 1 root root   124032 kwi 24 22:42 0000000000019257-0000000000019269
-rw------- 1 root root   184320 kwi 24 22:42 0000000000019244-000000000001925

And working now:

lxd --debug --group lxd

or

/usr/bin/snap run lxd.daemon

But still dosn’t work service:

systemctl status snap.lxd.daemon

And now I have containers list but they don’t start

root@jp-laptop:~# lxc list jp-pss
+----------------+---------+------+------+-----------+-----------+
|      NAME      |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+----------------+---------+------+------+-----------+-----------+
| jp-pss | STOPPED |      |      | CONTAINER | 0         |
+----------------+---------+------+------+-----------+-----------+
root@jp-laptop:~# 
root@jp-laptop:~# lxc start jp-pss
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart jp-pss
 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/jp-pss/lxc.conf: 
Try `lxc info --show-log jp-pss` for more info
root@jp-laptop:~# lxc info --show-log jp-pss
Name: jp-pss
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/01/10 08:52 UTC
Status: Stopped
Type: container
Profiles: default

Log:

lxc jp-pss 20200427101700.195 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1143 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.jp-pss"
lxc jp-pss 20200427101700.197 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1143 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.jp-pss"
lxc jp-pss 20200427101700.204 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1455 - No such file or directory - Failed to fchownat(17, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc jp-pss 20200427101700.285 ERROR    dir - storage/dir.c:dir_mount:152 - No such file or directory - Failed to mount "/var/snap/lxd/common/lxd/containers/jp-pss/rootfs" on "/var/snap/lxd/common/lxc/"
lxc jp-pss 20200427101700.285 ERROR    conf - conf.c:lxc_mount_rootfs:1256 - Failed to mount rootfs "/var/snap/lxd/common/lxd/containers/jp-pss/rootfs" onto "/var/snap/lxd/common/lxc/" with options "(null)"
lxc jp-pss 20200427101700.285 ERROR    conf - conf.c:lxc_setup_rootfs_prepare_root:3178 - Failed to setup rootfs for
lxc jp-pss 20200427101700.285 ERROR    conf - conf.c:lxc_setup:3277 - Failed to setup rootfs
lxc jp-pss 20200427101700.285 ERROR    start - start.c:do_start:1231 - Failed to setup container "jp-pss"
lxc jp-pss 20200427101700.287 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc jp-pss 20200427101700.295 WARN     network - network.c:lxc_delete_network_priv:3213 - Failed to rename interface with index 0 from "eth0" to its initial name "veth9a338ad4"
lxc jp-pss 20200427101700.295 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:852 - Received container state "ABORTING" instead of "RUNNING"
lxc jp-pss 20200427101700.295 ERROR    start - start.c:__lxc_start:1952 - Failed to spawn container "jp-pss"
lxc jp-pss 20200427101700.295 WARN     start - start.c:lxc_abort:1025 - No such process - Failed to send SIGKILL via pidfd 30 for process 29151
lxc 20200427101700.488 WARN     commands - commands.c:lxc_cmd_rsp_recv:122 - Connection reset by peer - Failed to receive response for command "get_state"

Assuming that your lxd daemon starts fine (and only containers have issues), please can you start a separate forum post? Thanks.

Ok, I will