LXC won't start. Looks like a corrupted DB

Pucky_wins · October 21, 2020, 8:01am

Hi

I can’t start lxd. There was a hard drive failure and despite my drives being mirrored lxd died and won’t restart. Here are the debug logs. Please advise?

DBUG[10-21|09:57:52] Connecting to a local LXD over a Unix socket 
DBUG[10-21|09:57:52] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[10-21|09:57:52] LXD 4.0.3 is starting in normal mode     path=/var/snap/lxd/common/lxd
INFO[10-21|09:57:52] Kernel uid/gid map: 
INFO[10-21|09:57:52]  - u 0 0 4294967295 
INFO[10-21|09:57:52]  - g 0 0 4294967295 
INFO[10-21|09:57:52] Configured LXD uid/gid map: 
INFO[10-21|09:57:52]  - u 0 1000000 1000000000 
INFO[10-21|09:57:52]  - g 0 1000000 1000000000 
INFO[10-21|09:57:52] Kernel features: 
INFO[10-21|09:57:52]  - closing multiple file descriptros efficiently: no 
INFO[10-21|09:57:52]  - netnsid-based network retrieval: yes 
INFO[10-21|09:57:52]  - pidfds: yes 
INFO[10-21|09:57:52]  - uevent injection: yes 
INFO[10-21|09:57:52]  - seccomp listener: yes 
INFO[10-21|09:57:52]  - seccomp listener continue syscalls: yes 
INFO[10-21|09:57:52]  - seccomp listener add file descriptors: no 
INFO[10-21|09:57:52]  - safe native terminal allocation : no 
INFO[10-21|09:57:52]  - unprivileged file capabilities: yes 
INFO[10-21|09:57:52]  - cgroup layout: hybrid 
WARN[10-21|09:57:52]  - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored 
WARN[10-21|09:57:52]  - Couldn't find the CGroup memory swap accounting, swap limits will be ignored 
INFO[10-21|09:57:52]  - shiftfs support: yes 
INFO[10-21|09:57:52] Initializing local database 
DBUG[10-21|09:57:52] Initializing database gateway 
DBUG[10-21|09:57:52] Start database node                      id=6 address=10.3.0.60:8443 role=spare
EROR[10-21|09:57:52] Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000077825-0000000000079365: entries batch 58 starting at byte 3071496: entries count in preamble is zero 
INFO[10-21|09:57:52] Starting shutdown sequence 
DBUG[10-21|09:57:52] Not unmounting temporary filesystems (containers are still running) 
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000077825-0000000000079365: entries batch 58 starting at byte 3071496: entries count in preamble is zero

Pucky_wins · October 21, 2020, 11:24am

Rebooting the server fixed the problem. I’m still actually curious as to what you would do if the database on a node got corrupted. Is there a way to rebuild things?

freeekanayaka · October 21, 2020, 11:52am

It’s quite weird that just rebooting fixed the problem. The error you reported should happen all the times you try to restart LXD from a corrupted on-disk state. Are you sure that the data on disk didn’t change?

Pucky_wins · October 22, 2020, 7:01am

There is a possibility that after the reboot it came back on the other drive in the raid and rolled back the change.