@Nick_Knutov you can perform daily backups by plainly copying files, if you wish, but you should shutdown the lxd daemon during the copy. Alternatively you can perform a live dump, as described in the docs. In general you should not need backups, since lxd takes a backup itself when it upgrades (which is the most likely spot where issues can occur).
@hobiga fwiw I could fix your database by running:
rm <your-lxd-dir>/database/global/2133593-2133840
where I guess <your-lxd-dir->
is /var/snap/lxd/common/lxd
.
Anyway, good to know that you sorted that out.
@Nick_Knutov, I could fix your db by:
rm database/global/4066-4245
as in the case of @hobiga.
Itâs not clear what situation leads to this, but if we see more cases like this occurring even after a successful upgrade to 3.15 weâll need to at least put some code in place that attempts automatic recovery.
I got the same issue on LXD 3.15 (snap stable
channel, updated yesterday but LXD would not start today after a system boot).
Here is the error message,
Failed to start the daemon: Failed to start dqlite server: run failed with 13
I got a backup of /var/snap/lxd/common/lxd/database
for posterity.
There are global/
and global.bak/
. The global.bak
has the timestamp of the boot-up.
I removed that subdirectory directory with the two numbers,
rm /var/snap/lxd/common/lxd/database/global/40577-40640/
and then I ran
systemctl start snap.lxd.daemon
And LXD started normally.
Thanks for the report simos, do you still have your global.bak around? If so, please send it to me by mail. I might try to reproduce it.
LXD 3.15, Same issue, same error msg, and same solution as simos.
Sidenote: The number-thingy was a file on my system, not a dir like it was for simos:
rm /var/snap/lxd/common/lxd/database/global/7133-7217
@freeekanayaka , do you want a copy of database/
? If so, Iâll send it to the canonical email.
Thanks all!
Thanks for reporting @pianoJ. I donât think the copy of database/
would be useful, essentially because it wonât tell me how the system got to this state. We added some code to automatically recover from this situation, so at least people wonât even notice this anymore and lxd will start normally.
can someone specify for a dummy how that practically is being done?
so would that be a suggestion for any situation like this (In my case its /database/global/32824-32868
)
Iâd go on a limb and suggest to do a backup of the whole database directory (verified) and then delete global dir and rename global.bak global.
I donât know exactly whatâs the role of these files whose name look like an UID , but Iâd say that you can attempt it (backup, then delete the files looking like 32824-32868). A quick test on my test config has shown thatâs itâs not deadly for a working config (and your config donât work anyway)
this + snap refresh lxd --stable
worked actually. Thanks a mil everybody
Hello @stgraber and @freeekanayaka,
I appear to have hit this yesterday, which seems odd if you have added code to automatically recover.
$ lxc list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused
$ sudo lxd --debug --group lxd
[sudo] password for aaron:
DBUG[08-12|22:21:00] Connecting to a local LXD over a Unix socket
DBUG[08-12|22:21:00] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[08-12|22:21:00] LXD 3.15 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[08-12|22:21:00] Kernel uid/gid map:
INFO[08-12|22:21:00] - u 0 0 4294967295
INFO[08-12|22:21:00] - g 0 0 4294967295
INFO[08-12|22:21:00] Configured LXD uid/gid map:
INFO[08-12|22:21:00] - u 0 1000000 1000000000
INFO[08-12|22:21:00] - g 0 1000000 1000000000
WARN[08-12|22:21:00] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[08-12|22:21:00] Kernel features:
INFO[08-12|22:21:00] - netnsid-based network retrieval: yes
INFO[08-12|22:21:00] - uevent injection: yes
INFO[08-12|22:21:00] - seccomp listener: yes
INFO[08-12|22:21:00] - unprivileged file capabilities: yes
INFO[08-12|22:21:00] - shiftfs support: yes
INFO[08-12|22:21:00] Initializing local database
DBUG[08-12|22:21:00] Initializing database gateway
DBUG[08-12|22:21:00] Start database node id=1 address=
EROR[08-12|22:21:00] Failed to start the daemon: Failed to start dqlite server: run failed with 13
INFO[08-12|22:21:00] Starting shutdown sequence
DBUG[08-12|22:21:00] Not unmounting temporary filesystems (containers are still running)
Error: Failed to start dqlite server: run failed with 13
I can follow the steps above to try to recover, but thought I would check whether it would be helpful for me to send anything over before I do.
$ lxd --version
3.15
There are several scenarios in which Failed to start dqlite server: run failed with 13
might be returned. One of them has now logic in place to just emit a warning instead of bailing out. Either the snap version youâre running does not have that fix (I guess thatâs unlikely), or you are hitting a different problem. Please try to upgrade to 3.16 first, which has even further additional error-handling logic. If itâs what I think, then upgrading to 3.16 wonât fix it, but itâs worth a try (in addition to that 3.16 will also output some more debugging information). If the problem persist, please either paste here the output of ls -l /var/snap/lxd/common/lxd/database/global
, or send me an email with a tarball of /var/snap/lxd/common/lxd/database
so I can see whatâs wrong.
Thanks Free,
Done. Didnât fix it.
$ snap refresh lxd --stable
lxd 3.16 from Canonicalâ refreshed
$ lxc list
Error: Get http://unix.socket/1.0: read unix @->/var/snap/lxd/common/lxd/unix.socket: read: connection reset by peer
$ sudo lxd --debug --group lxd
DBUG[08-13|22:16:19] Connecting to a local LXD over a Unix socket
DBUG[08-13|22:16:19] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[08-13|22:16:19] LXD 3.16 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[08-13|22:16:19] Kernel uid/gid map:
INFO[08-13|22:16:19] - u 0 0 4294967295
INFO[08-13|22:16:19] - g 0 0 4294967295
INFO[08-13|22:16:19] Configured LXD uid/gid map:
INFO[08-13|22:16:19] - u 0 1000000 1000000000
INFO[08-13|22:16:19] - g 0 1000000 1000000000
WARN[08-13|22:16:19] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[08-13|22:16:19] Kernel features:
INFO[08-13|22:16:19] - netnsid-based network retrieval: yes
INFO[08-13|22:16:19] - uevent injection: yes
INFO[08-13|22:16:19] - seccomp listener: yes
INFO[08-13|22:16:19] - unprivileged file capabilities: yes
INFO[08-13|22:16:19] - shiftfs support: yes
INFO[08-13|22:16:19] Initializing local database
DBUG[08-13|22:16:19] Initializing database gateway
DBUG[08-13|22:16:19] Start database node id=1 address=
00:07:28.645 [DEBUG]: data dir: /var/snap/lxd/common/lxd/database/global
00:07:28.645 [DEBUG]: metadata1: version 57, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata2: version 58, term 7, voted for 1
00:07:28.645 [DEBUG]: metadata: version 60, term 7, voted for 1
00:07:28.645 [DEBUG]: I/O: direct 1, block 4096
00:07:28.645 [INFO ]: starting
00:07:28.645 [DEBUG]: ignore .
00:07:28.645 [DEBUG]: ignore ..
00:07:28.645 [DEBUG]: segment 2666-2843
00:07:28.645 [DEBUG]: ignore db.bin
00:07:28.645 [DEBUG]: ignore metadata1
00:07:28.645 [DEBUG]: ignore metadata2
00:07:28.645 [DEBUG]: ignore snapshot-1-1793-1
00:07:28.645 [DEBUG]: snapshot snapshot-1-1793-1.meta
00:07:28.645 [DEBUG]: most recent snapshot at 1793
00:07:28.645 [DEBUG]: most recent closed segment is 2666-2843
00:07:28.645 [ERROR]: found closed segment past last snapshot: 2666-2843
EROR[08-13|22:16:19] Failed to start the daemon: Failed to start dqlite server: run failed with 12
INFO[08-13|22:16:19] Starting shutdown sequence
DBUG[08-13|22:16:19] Not unmounting temporary filesystems (containers are still running)
Error: Failed to start dqlite server: run failed with 12
$ sudo ls -l /var/snap/lxd/common/lxd/database/global
total 2872
-rw------- 1 root root 2265864 Aug 7 15:28 2666-2843
-rw------- 1 root root 327680 Aug 13 22:15 db.bin
-rw------- 1 root root 32 Aug 13 22:16 metadata1
-rw------- 1 root root 32 Aug 13 22:16 metadata2
-rw------- 1 root root 327720 Jul 25 23:19 snapshot-1-1793-1
-rw------- 1 root root 52 Jul 25 23:19 snapshot-1-1793-1.meta
I will also email you the database.
Many thanks for your help.
@aaron unfortunately the logs indicate that, like the other users, you have been hit by:
Thereâs not much that can be done, youâre best bet is to run:
sudo rm /var/snap/lxd/common/lxd/database/global/2666-2843
and restart LXD. That means that any database change committed between the 25th of July and now will be lost, but thatâs the most we can do to recover the data loss. From this point one you should be good since youâll be running 3.16 which has improvements in that area.
Thanks Free. Appreciate your work.
Unfortunately, much of what I wanted happened after that date and after doing that rm command a lxc list gave me no results. The containers are all still there in /var/snap/lxd/common/lxd/containers, but doing a lxc exec into them didnât work.
Interestingly, doing a lxc launch with the correct image and name seems to recreate the container and my files are there when I exec into it. I should be able to remember the details enough to do that for nearly all of them, but is there a more recommended way to recover from something like this?
I wouldnât recommend the launch approach as there are some good chances it may overwrite your data or at least mess with it.
The best way to recover is to use lxd import NAME
which is used specifically for cases where you have the container data but no database record. That will go and read the backup.yaml
file thatâs part of the on-disk storage of all containers and re-create the majority of database records needed for the container.
Thatâs covered in our backup documentation here: https://lxd.readthedocs.io/en/latest/backup/#disaster-recovery
Thanks @stgraber. That sounds great.
I created a new topic here: Mounting container storage volume with a request for help on that process (to avoid distracting this thread further).
Many thanks for all of your help getting me back up and running properly with that import tip!
If this has good chances of overwriting or messing with data, would it be worth adding a confirmation prompt? From memory the creation process all just looked like a new container until I started poking around and found my old files.
Iâm not sure why launch worked at all, I would have expected it to complain when creating the container storage volume that it already existed. Weâre currently doing a big rework of our storage logic, partly to improve error handling, so I would expect what you did to fail in the near future as we fix some of those codepaths.