I have broken LXD, but my Zpool is ok, How do I reinstall LXD or get it working again

My LXD broke after a reboot. And yes my Zpool is fine. but it doesn’t get past lxd waiting
Can I reinstall LXD without wiping out Zpool?
How do I make a new server on snap 3.13 talk to my others in old school apt install 3.03?

ok, so this is a LXD deb install of 3.0.3 then?

Can you do:

  • systemctl stop lxd.socket lxd.service
  • lxd --debug --group lxd

This may tell us a bit more about what’s making it fail to start.

Yes, we can reset the database and then re-import the containers, but that’s a disaster recovery procedure, so if we can just have it come back to life, that’d be preferable.

lxd --debug --group lxd
DBUG[05-13|18:59:05] Connecting to a local LXD over a Unix socket
DBUG[05-13|18:59:05] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[05-13|18:59:05] LXD 3.0.3 is starting in normal mode path=/var/lib/lxd
INFO[05-13|18:59:05] Kernel uid/gid map:
INFO[05-13|18:59:05] - u 0 0 4294967295
INFO[05-13|18:59:05] - g 0 0 4294967295
INFO[05-13|18:59:05] Configured LXD uid/gid map:
INFO[05-13|18:59:05] - u 0 100000 65536
INFO[05-13|18:59:05] - g 0 100000 65536
WARN[05-13|18:59:05] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[05-13|18:59:05] Kernel features:
INFO[05-13|18:59:05] - netnsid-based network retrieval: no
INFO[05-13|18:59:05] - unprivileged file capabilities: yes
INFO[05-13|18:59:05] Initializing local database
DBUG[05-13|18:59:05] Initializing database gateway
DBUG[05-13|18:59:05] Start database node id=2 address=64.71.77.32:8443
DBUG[05-13|18:59:05] Raft: Restored from snapshot 1432-5065700-1549596208911
panic: log not found

goroutine 1 [running]:
github.com/hashicorp/raft.NewRaft(0xc420204480, 0x112daa0, 0xc42035a260, 0x1135e60, 0xc42035a120, 0x1132ca0, 0xc42035a120, 0x112df60, 0xc42035a220, 0x1139220, …)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/api.go:491 +0x11ba
github.com/lxc/lxd/lxd/cluster.raftInstanceInit(0xc42032c800, 0xc42032d880, 0xc4203291f0, 0x4008000000000000, 0x1, 0x70f7aa, 0xc4202c84a0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/raft.go:191 +0x5c9
github.com/lxc/lxd/lxd/cluster.newRaft(0xc42032c800, 0xc4203291f0, 0x4008000000000000, 0x0, 0x0, 0xc42032d580)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/raft.go:71 +0x24d
github.com/lxc/lxd/lxd/cluster.(*Gateway).init(0xc420317380, 0xc42032d580, 0xc4200392c0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/gateway.go:448 +0x84
github.com/lxc/lxd/lxd/cluster.NewGateway(0xc42032c800, 0xc4203291f0, 0xc4202dd718, 0x2, 0x2, 0x0, 0x24, 0xf7efc0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/gateway.go:57 +0x1f2
main.(*Daemon).init(0xc4201e1c20, 0xc4202dd8d0, 0x40e936)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/daemon.go:502 +0x645
main.(*Daemon).Init(0xc4201e1c20, 0xc4201e1c20, 0xc420038c60)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/daemon.go:390 +0x2f
main.(*cmdDaemon).Run(0xc4202fc040, 0xc4202fa280, 0xc42026df80, 0x0, 0x3, 0x0, 0x0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main_daemon.go:82 +0x37a
main.(*cmdDaemon).Run-fm(0xc4202fa280, 0xc42026df80, 0x0, 0x3, 0x0, 0x0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main_daemon.go:42 +0x52
github.com/spf13/cobra.(*Command).execute(0xc4202fa280, 0xc4200ee050, 0x3, 0x3, 0xc4202fa280, 0xc4200ee050)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:762 +0x468
github.com/spf13/cobra.(*Command).ExecuteC(0xc4202fa280, 0x0, 0xc420302780, 0xc420302780)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:852 +0x30a
github.com/spf13/cobra.(*Command).Execute(0xc4202fa280, 0xc4202dde08, 0x1)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main.go:160 +0xe26
root@MOE:/home/ic2000# lxc list
Error: Get http://unix.socket/1.0: dial unix /var/lib/lxd/unix.socket: connect: connection refused

Ok, your database does look a bit sad, can you show: find /var/lib/lxd/database ?
This is a clustered environment, how many other nodes do you have and what’s their state?

/var/lib/lxd/database
/var/lib/lxd/database/local.db
/var/lib/lxd/database/global
/var/lib/lxd/database/global/db.bin
/var/lib/lxd/database/global/db.bin-wal
/var/lib/lxd/database/global/logs.db
/var/lib/lxd/database/global/snapshots
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659/state.bin
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659/meta.json
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911/state.bin
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911/meta.json
/var/lib/lxd/database/global/db.bin-shm
/var/lib/lxd/database/local.d

lxc cluster list
±---------±-------------------------±---------±--------±---------------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±---------±-------------------------±---------±--------±---------------------------------------+
| chemp | https://64.71.77.18:8443 | NO | ONLINE | fully operational |
±---------±-------------------------±---------±--------±---------------------------------------+
| curlyjoe | https://64.71.77.29:8443 | YES | ONLINE | fully operational |
±---------±-------------------------±---------±--------±---------------------------------------+
| joe | https://64.71.77.13:8443 | NO | ONLINE | fully operational |
±---------±-------------------------±---------±--------±---------------------------------------+
| larry | https://64.71.77.80:8443 | YES | ONLINE | fully operational |
±---------±-------------------------±---------±--------±---------------------------------------+
| moe | https://64.71.77.32:8443 | YES | OFFLINE | no heartbeat since 74h44m21.555716969s |
±---------±-------------------------±---------±--------±---------------------------------------+

So what can be done. Been Down for close to 90 hours. I have to pick a solution or erase everything and reload.

Some this might help you.

systemctl status lxd
● lxd.service - LXD - main daemon
Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2019-05-13 19:05:42 EDT; 15h ago
Docs: man:lxd(1)
Main PID: 5470 (code=exited, status=2)

May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: github.com/spf13/cobra.(*Command).ExecuteC(0xc4202c2c80, 0x0, 0xc4202d7180
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: github.com/spf13/cobra.(*Command).Execute(0xc4202c2c80, 0xc4202bde08, 0x1)
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: main.main()
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lx
May 13 19:04:17 MOE systemd[1]: lxd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 13 19:05:42 MOE systemd[1]: lxd.service: Failed with result ‘exit-code’.
May 13 19:05:42 MOE systemd[1]: Stopped LXD - main daemon.

ls -alh /var/lib/lxd/
total 2.8M
drwxr-xr-x 14 root root 4.0K May 13 19:04 .
drwxr-xr-x 38 root root 4.0K May 11 13:18 …
-rw-r–r-- 1 root root 1.9K Jun 19 2018 cluster.crt
-rw------- 1 root root 3.2K Jun 19 2018 cluster.key
drwx–x--x 2 root root 4.0K May 10 18:42 containers
drwx------ 3 root root 4.0K May 10 18:02 database
drwx------ 3 root root 4.0K Feb 7 23:55 database.after-upgrade
drwx–x--x 28 root root 4.0K May 9 23:16 devices
drwxr-xr-x 2 root root 4.0K Jun 19 2018 devlxd
drwx------ 2 root root 4.0K Jun 19 2018 disks
drwx------ 2 root root 4.0K Nov 6 2018 images
-rw-r–r-- 1 root root 2.7M Feb 8 10:46 moe.database.tar.gz
drwx–x--x 3 root root 4.0K Jun 19 2018 networks
drwx------ 4 root root 4.0K Jun 19 2018 security
-rw-r–r-- 1 root root 1.9K Jun 19 2018 server.crt
-rw------- 1 root root 3.2K Jun 19 2018 server.key
drwx–x--x 2 root root 4.0K Jun 19 2018 shmounts
drwx------ 2 root root 4.0K May 10 12:36 snapshots
drwx–x--x 3 root root 4.0K Jun 19 2018 storage-pools
srw-rw---- 1 root lxd 0 May 13 19:04 unix.socket

Ok, with your cluster functional (2 DB nodes active), it should be possible to re-connect that broken node without its database and then promote it back to DB node again.

@freeekanayaka might have pointers on how to do that.

Though my understanding is that you have since completely wiped that 5th node, so probably not an option anymore.

Yeah, this was taking to long. When you have 150 people screaming at you for days you have to do something. I am moving the simple websites manually meanwhile. I still have a complete backup of everything the whole /var and zpool too.