LXD stopped working all of a sudden

Hi Team,

All of a sudden the lxd api seems to have stopped working on a system for me.

On running lxc list, it shows the following message:

Error: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

On running the status check command sudo systemctl status snap.lxd.daemon, I got the following message:

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-11-09 06:39:29 CET; 38s ago
Process: 26102 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 26102 (code=exited, status=1/FAILURE)
nov. 09 06:39:29 Node22 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
nov. 09 06:39:29 Node22 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 5.
nov. 09 06:39:29 Node22 systemd[1]: Stopped Service for snap application lxd.daemon.
nov. 09 06:39:29 Node22 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
nov. 09 06:39:29 Node22 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
nov. 09 06:39:29 Node22 systemd[1]: Failed to start Service for snap application lxd.daemon.

I then started the service and then again checked the status log and got the following logs:

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: active (running) since Tue 2021-11-09 06:15:55 CET; 1s ago
Main PID: 21544 (daemon.start)
Tasks: 0 (limit: 5529)
CGroup: /system.slice/snap.lxd.daemon.service
‣ 21544 /bin/sh /snap/lxd/21858/commands/daemon.start
nov. 09 06:15:55 Node22 lxd.daemon[21544]: ==> Escaping the systemd cgroups
nov. 09 06:15:55 Node22 lxd.daemon[21544]: ====> Detected cgroup V1
nov. 09 06:15:55 Node22 lxd.daemon[21544]: ==> Escaping the systemd process resource limits
nov. 09 06:15:55 Node22 lxd.daemon[21544]: ==> Disabling shiftfs on this kernel (auto)
nov. 09 06:15:55 Node22 lxd.daemon[21544]: => Re-using existing LXCFS
nov. 09 06:15:55 Node22 lxd.daemon[21544]: => Starting LXD
nov. 09 06:15:56 Node22 lxd.daemon[21544]: t=2021-11-09T06:15:56+0100 lvl=warn msg=" - Couldn’t find the CGroup blkio.weight, disk priority will be ignored"
nov. 09 06:15:56 Node22 lxd.daemon[21544]: t=2021-11-09T06:15:56+0100 lvl=warn msg=" - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored"
nov. 09 06:15:56 Node22 lxd.daemon[21544]: t=2021-11-09T06:15:56+0100 lvl=eror msg=“Failed to start the daemon” err=“Failed to start dqlite server: raft_start(): io: load cl
osed segment 0000000000011674-0000000000012487: entries batch 755 starting at byte 7757424: data checksum mismatch”
nov. 09 06:15:56 Node22 lxd.daemon[21544]: Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000011674-0000000000012487: entries batch 755 s
tarting at byte 7757424: data checksum mismatch

System configuration:
OS: Ubuntu 18.04 LTS
LXC Version: 4.20

A quick response would definitely help as this is a production thing for me and the service is currently down because of this. Thanks.

Did your system recently run out of disk space or suffered a power loss?
The error indicates a corrupted LXD database.

I’d recommend you do cp -R /var/snap/lxd/common/lxd/database /var/snap/lxd/common/lxd/database.broken to make a quick backup of the database.

Then run ls -lh /var/snap/lxd/common/lxd/database/global/ so we can identify the likely broken segment that would need to be deleted.

I made the backup of database and executed the command ls -lh /var/snap/lxd/common/lxd/database/global/:

total 96M
-rw------- 1 root root 8,0M aug. 6 11:13 0000000000007683-0000000000008349
-rw------- 1 root root 6,3M aug. 10 05:55 0000000000008350-0000000000008883
-rw------- 1 root root 8,0M aug. 19 09:57 0000000000008884-0000000000009602
-rw------- 1 root root 3,8M aug. 30 08:31 0000000000009603-0000000000009980
-rw------- 1 root root 5,0M sep. 2 07:12 0000000000009981-0000000000010381
-rw------- 1 root root 263K sep. 2 07:16 0000000000010382-0000000000010408
-rw------- 1 root root 4,9M sep. 6 20:52 0000000000010409-0000000000010902
-rw------- 1 root root 7,0M sep. 13 09:47 0000000000010903-0000000000011597
-rw------- 1 root root 243K sep. 13 09:50 0000000000011598-0000000000011622
-rw------- 1 root root 501K sep. 13 16:18 0000000000011623-0000000000011673
-rw------- 1 root root 8,0M sep. 21 08:18 0000000000011674-0000000000012487
-rw------- 1 root root 8,0M sep. 28 21:18 0000000000012488-0000000000013292
-rw------- 1 root root 7,7M okt. 6 05:45 0000000000013293-0000000000014063
-rw------- 1 root root 8,0M okt. 13 23:45 0000000000014064-0000000000014878
-rw------- 1 root root 1,5M okt. 15 09:30 0000000000014879-0000000000015031
-rw------- 1 root root 8,0M okt. 22 16:31 0000000000015032-0000000000015832
-rw------- 1 root root 1,3M okt. 23 19:45 0000000000015833-0000000000015954
-rw------- 1 root root 8,0M nov. 8 10:56 0000000000015955-0000000000016768
-rw------- 1 root root 606K nov. 8 23:15 0000000000016769-0000000000016828
-rw------- 1 root root 452K nov. 8 23:15 db.bin
-rw------- 1 root root 32 mai 12 10:33 metadata1
-rw------- 1 root root 27K okt. 18 09:30 snapshot-1-15360-3022981382
-rw------- 1 root root 56 okt. 18 09:30 snapshot-1-15360-3022981382.meta
-rw------- 1 root root 63K nov. 4 16:56 snapshot-1-16384-4522128417
-rw------- 1 root root 56 nov. 4 16:56 snapshot-1-16384-4522128417.meta

I don’t think there’s been any power loss as the node has been running since 56 days till now and still going on. Also there is 894 GB disk space left over right now in root volume.

Should I perform a system reboot as a last resort? Would that help? @stgraber
I don’t want the configuration or the lxd container to be spoiled by the reboot.

Any possible solutions on this issue?

Any ideas @mbordere ?

The good news is that you have 2 snapshots that contain data after the corrupt segment (15360 > 12487 and 16384 > 12487), so you will likely not incur data loss.

Let me double check why it’s actually trying to load that segment and come back to you.

Well I am not sure if I ever made any snapshots. I am hoping that there’s no data loss because of this issue as there’s important user data in it.

I’m talking about database snapshots that are transparently taken behind your back, sorry for the confusion.

  1. Make sure you have made a backup of /var/snap/lxd/common/lxd/database
  2. Delete the following files in /var/snap/lxd/common/lxd/database/global/
-rw------- 1 root root 8,0M aug. 6 11:13 0000000000007683-0000000000008349
-rw------- 1 root root 6,3M aug. 10 05:55 0000000000008350-0000000000008883
-rw------- 1 root root 8,0M aug. 19 09:57 0000000000008884-0000000000009602
-rw------- 1 root root 3,8M aug. 30 08:31 0000000000009603-0000000000009980
-rw------- 1 root root 5,0M sep. 2 07:12 0000000000009981-0000000000010381
-rw------- 1 root root 263K sep. 2 07:16 0000000000010382-0000000000010408
-rw------- 1 root root 4,9M sep. 6 20:52 0000000000010409-0000000000010902
-rw------- 1 root root 7,0M sep. 13 09:47 0000000000010903-0000000000011597
-rw------- 1 root root 243K sep. 13 09:50 0000000000011598-0000000000011622
-rw------- 1 root root 501K sep. 13 16:18 0000000000011623-0000000000011673
-rw------- 1 root root 8,0M sep. 21 08:18 0000000000011674-0000000000012487
  1. Try to start LXD again.

Oh wow. It just did the trick. Thanks a lot.

Could you help me understand what was the issue and how you figured out that these files were to be deleted only?

We store checksums when writing the 000000xxx-000000xxx files and recalculate and compare the checksums when loading those files. When the checksums don’t match we report the error that you saw. One of the possible causes is an issue with your disk, or a bug in the implementation. Because this is an operation that is carried out frequently, I suspect a problem with your disk is more likely.

In /var/snap/lxd/common/lxd/database/global/ there are basically 2 types of database files.

  • segment files e.g. 0000000000007683-0000000000008349. These contain a limited amount of database entries numbered from 7683 until 8349 in the example.
  • snapshot files e.g. snapshot-1-15360-3022981382. These contain all of the database entries up until a certain index, in this case 15360. The snapshot files are generally small because they are compressed.

When the database starts up, it will load the latest snapshot file, in this case snapshot-1-16384-4522128417 and then load all entries in the segment files overlapping with the snapshot file. In your case it will load the entries in segment files.

0000000000015955-0000000000016768 /* Because 15955 < 16384 < 16768*/
0000000000016769-0000000000016828 /* Because 16384 < 16769 */

The entries in the other segment files are already contained in the snapshot and their information is not needed, that is why they could be deleted.

I decided to let you delete all segment files up until the problematic segment file, because it’s the most conservative approach. In theory you could have deleted all segment files, except
0000000000015955-0000000000016768 and 0000000000016769-0000000000016828 and you could have deleted snapshot-1-15360-3022981382 and snapshot-1-15360-3022981382.meta.

The startup logic of the database could be improved, because the problematic segment you encountered was technically not needed to start up the database and could have been ignored.

3 Likes

You can send me the problematic segment if you want at mathieu.bordere@canonical.com , I could investigate it when I have some time to try and find out what could have went wrong when loading it in case it wasn’t a disk failure.

1 Like

Thanks @mbordere for the explanation. Understood the concept.

Regarding the problematic segment I had mentioned above in post all the information I could collect at that moment. Let me know if I can share anything more for your investigation.

@genesis96839 You can send me the file itself if you want (if it doesn’t contain sensitive information for you).

1 Like