Snap.lxd.daemon fails to reload

LXD daemon became unresponsive while containers were still running during a server uptime of 15+ days.

Currently unsure what further troubleshooting steps to take next. LXD on this server contains firewall/router/dhcp functions along with a number of other production services which are down since the server reboot.

Exact failure date/time unknown because containers & services continued to function after the daemon/socket failure. Logged in to handle an administration task within a container and found:
root@r910n01:~# lxc list
LXD socket not found; is LXD installed and running?

OS: Ubuntu 16.04

  • lxc --version
  • 3.0.0.beta6

root@r910n01:~# snap list --all core
Name Version Rev Developer Notes
core 16-2.31 4017 canonical core,disabled
core 16-2.31.1 4110 canonical core,disabled
core 16-2.31.2 4206 canonical core

Gist of troubleshooting to date

0 systemctl reload snap.lxd.daemon
1 systemctl status snap.lxd.daemon
2 journalctl -u snap.lxd.daemon
3 journalctl -xe
4 lxc list
5 cat /var/snap/lxd/common/lxd/logs/lxd.log
6 snap revert lxd
7 reboot

Logs and output here:

[EDIT: Additional info]
lxc_info_–debug ==
snap info lxd ==
.setup_mode file check ==
zpool check ==

This got resolved on IRC, for anyone with similar issues, the short version of the fix was:

  • systemctl stop snap.lxd.daemon
  • cp /var/snap/lxd/common/lxd/lxd.db.bak /var/snap/lxd/common/lxd/lxd.db.good
  • cp /var/snap/lxd/common/lxd/lxd.db.bak /var/snap/lxd/common/lxd/lxd.db
  • rm -rf /var/snap/lxd/common/lxd/raft
  • systemctl start snap.lxd.daemon

This effectively reverts the database to its last backup and then causes LXD to re-upgrade to the new database format (raft). This should only ever be needed if your system broke when upgrading to one of the first LXD betas.

Should those steps not restore all your containers for some reason, you should be able to lxd import whatever containers are missing.