I have broken LXD, but my Zpool is ok, How do I reinstall LXD or get it working again

Tony_Anytime · May 13, 2019, 1:18pm

My LXD broke after a reboot. And yes my Zpool is fine. but it doesn’t get past lxd waiting
Can I reinstall LXD without wiping out Zpool?
How do I make a new server on snap 3.13 talk to my others in old school apt install 3.03?

stgraber · May 13, 2019, 10:00pm

ok, so this is a LXD deb install of 3.0.3 then?

Can you do:

systemctl stop lxd.socket lxd.service
lxd --debug --group lxd

This may tell us a bit more about what’s making it fail to start.

Yes, we can reset the database and then re-import the containers, but that’s a disaster recovery procedure, so if we can just have it come back to life, that’d be preferable.

Tony_Anytime · May 13, 2019, 11:00pm

lxd --debug --group lxd
DBUG[05-13|18:59:05] Connecting to a local LXD over a Unix socket
DBUG[05-13|18:59:05] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[05-13|18:59:05] LXD 3.0.3 is starting in normal mode path=/var/lib/lxd
INFO[05-13|18:59:05] Kernel uid/gid map:
INFO[05-13|18:59:05] - u 0 0 4294967295
INFO[05-13|18:59:05] - g 0 0 4294967295
INFO[05-13|18:59:05] Configured LXD uid/gid map:
INFO[05-13|18:59:05] - u 0 100000 65536
INFO[05-13|18:59:05] - g 0 100000 65536
WARN[05-13|18:59:05] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[05-13|18:59:05] Kernel features:
INFO[05-13|18:59:05] - netnsid-based network retrieval: no
INFO[05-13|18:59:05] - unprivileged file capabilities: yes
INFO[05-13|18:59:05] Initializing local database
DBUG[05-13|18:59:05] Initializing database gateway
DBUG[05-13|18:59:05] Start database node id=2 address=64.71.77.32:8443
DBUG[05-13|18:59:05] Raft: Restored from snapshot 1432-5065700-1549596208911
panic: log not found

goroutine 1 [running]:
github.com/hashicorp/raft.NewRaft(0xc420204480, 0x112daa0, 0xc42035a260, 0x1135e60, 0xc42035a120, 0x1132ca0, 0xc42035a120, 0x112df60, 0xc42035a220, 0x1139220, …)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/hashicorp/raft/api.go:491 +0x11ba
github.com/lxc/lxd/lxd/cluster.raftInstanceInit(0xc42032c800, 0xc42032d880, 0xc4203291f0, 0x4008000000000000, 0x1, 0x70f7aa, 0xc4202c84a0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/raft.go:191 +0x5c9
github.com/lxc/lxd/lxd/cluster.newRaft(0xc42032c800, 0xc4203291f0, 0x4008000000000000, 0x0, 0x0, 0xc42032d580)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/raft.go:71 +0x24d
github.com/lxc/lxd/lxd/cluster.(*Gateway).init(0xc420317380, 0xc42032d580, 0xc4200392c0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/gateway.go:448 +0x84
github.com/lxc/lxd/lxd/cluster.NewGateway(0xc42032c800, 0xc4203291f0, 0xc4202dd718, 0x2, 0x2, 0x0, 0x24, 0xf7efc0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/cluster/gateway.go:57 +0x1f2
main.(*Daemon).init(0xc4201e1c20, 0xc4202dd8d0, 0x40e936)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/daemon.go:502 +0x645
main.(*Daemon).Init(0xc4201e1c20, 0xc4201e1c20, 0xc420038c60)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/daemon.go:390 +0x2f
main.(*cmdDaemon).Run(0xc4202fc040, 0xc4202fa280, 0xc42026df80, 0x0, 0x3, 0x0, 0x0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main_daemon.go:82 +0x37a
main.(*cmdDaemon).Run-fm(0xc4202fa280, 0xc42026df80, 0x0, 0x3, 0x0, 0x0)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main_daemon.go:42 +0x52
github.com/spf13/cobra.(*Command).execute(0xc4202fa280, 0xc4200ee050, 0x3, 0x3, 0xc4202fa280, 0xc4200ee050)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:762 +0x468
github.com/spf13/cobra.(*Command).ExecuteC(0xc4202fa280, 0x0, 0xc420302780, 0xc420302780)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:852 +0x30a
github.com/spf13/cobra.(*Command).Execute(0xc4202fa280, 0xc4202dde08, 0x1)
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
/build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lxc/lxd/lxd/main.go:160 +0xe26
root@MOE:/home/ic2000# lxc list
Error: Get http://unix.socket/1.0: dial unix /var/lib/lxd/unix.socket: connect: connection refused

stgraber · May 13, 2019, 11:13pm

Ok, your database does look a bit sad, can you show: find /var/lib/lxd/database ?
This is a clustered environment, how many other nodes do you have and what’s their state?

Tony_Anytime · May 13, 2019, 11:24pm

/var/lib/lxd/database
/var/lib/lxd/database/local.db
/var/lib/lxd/database/global
/var/lib/lxd/database/global/db.bin
/var/lib/lxd/database/global/db.bin-wal
/var/lib/lxd/database/global/logs.db
/var/lib/lxd/database/global/snapshots
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659/state.bin
/var/lib/lxd/database/global/snapshots/1432-5064633-1549591932659/meta.json
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911/state.bin
/var/lib/lxd/database/global/snapshots/1432-5065700-1549596208911/meta.json
/var/lib/lxd/database/global/db.bin-shm
/var/lib/lxd/database/local.d

Tony_Anytime · May 14, 2019, 12:20pm

So what can be done. Been Down for close to 90 hours. I have to pick a solution or erase everything and reload.

Tony_Anytime · May 14, 2019, 3:04pm

Some this might help you.

systemctl status lxd
● lxd.service - LXD - main daemon
Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2019-05-13 19:05:42 EDT; 15h ago
Docs: man:lxd(1)
Main PID: 5470 (code=exited, status=2)

May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: github.com/spf13/cobra.(*Command).ExecuteC(0xc4202c2c80, 0x0, 0xc4202d7180
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: github.com/spf13/cobra.(*Command).Execute(0xc4202c2c80, 0xc4202bde08, 0x1)
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/sp
May 13 19:04:17 MOE lxd[5470]: main.main()
May 13 19:04:17 MOE lxd[5470]: /build/lxd-j7VLB_/lxd-3.0.3/obj-x86_64-linux-gnu/src/github.com/lx
May 13 19:04:17 MOE systemd[1]: lxd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 13 19:05:42 MOE systemd[1]: lxd.service: Failed with result ‘exit-code’.
May 13 19:05:42 MOE systemd[1]: Stopped LXD - main daemon.

ls -alh /var/lib/lxd/
total 2.8M
drwxr-xr-x 14 root root 4.0K May 13 19:04 .
drwxr-xr-x 38 root root 4.0K May 11 13:18 …
-rw-r–r-- 1 root root 1.9K Jun 19 2018 cluster.crt
-rw------- 1 root root 3.2K Jun 19 2018 cluster.key
drwx–x--x 2 root root 4.0K May 10 18:42 containers
drwx------ 3 root root 4.0K May 10 18:02 database
drwx------ 3 root root 4.0K Feb 7 23:55 database.after-upgrade
drwx–x--x 28 root root 4.0K May 9 23:16 devices
drwxr-xr-x 2 root root 4.0K Jun 19 2018 devlxd
drwx------ 2 root root 4.0K Jun 19 2018 disks
drwx------ 2 root root 4.0K Nov 6 2018 images
-rw-r–r-- 1 root root 2.7M Feb 8 10:46 moe.database.tar.gz
drwx–x--x 3 root root 4.0K Jun 19 2018 networks
drwx------ 4 root root 4.0K Jun 19 2018 security
-rw-r–r-- 1 root root 1.9K Jun 19 2018 server.crt
-rw------- 1 root root 3.2K Jun 19 2018 server.key
drwx–x--x 2 root root 4.0K Jun 19 2018 shmounts
drwx------ 2 root root 4.0K May 10 12:36 snapshots
drwx–x--x 3 root root 4.0K Jun 19 2018 storage-pools
srw-rw---- 1 root lxd 0 May 13 19:04 unix.socket

stgraber · May 14, 2019, 8:47pm

Ok, with your cluster functional (2 DB nodes active), it should be possible to re-connect that broken node without its database and then promote it back to DB node again.

@freeekanayaka might have pointers on how to do that.

Though my understanding is that you have since completely wiped that 5th node, so probably not an option anymore.

Tony_Anytime · May 14, 2019, 8:55pm

Yeah, this was taking to long. When you have 150 people screaming at you for days you have to do something. I am moving the simple websites manually meanwhile. I still have a complete backup of everything the whole /var and zpool too.