How do i detach a cluster node of a dead non-recoverable server, from my main LXD/LXC3.0.3 .DEB server, when I cant use lxc comman-line tools to do it

,

Hi, I was hoping somebody could help me, as I’m at a loss I’ve not been able to use LXC on my main server, I think due to the second server been dead, due to 3 drive’s failing on its raid 6 array, it’s non-recoverable. My main server now fails to link to the cluster as there is no node top connect to, all the ZFS container data is on main sever that still live and assessable through ZFS mount tools but I can’t get LXC to run them or for that matter even start any command due to connection issue, this is only my assume but do not have access to the command-line tool to disconnect cluster I’ve removed LXD and re-installed .deb again with apt.

#Here is the outputs I am getting

sudo lxc list
Error: Get http://unix.socket/1.0: EOF

sudo lxc profile show default
Error: Get http://unix.socket/1.0: EOF

sudo lxd cluster list-database --debug
DBUG[05-18|13:17:13] Connecting to a local LXD over a Unix socket
DBUG[05-18|13:17:13] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=

journalctl -u lxd -n 300
May 18 14:34:56 zenicmain1 lxd[28758]: t=2020-05-18T14:34:56+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 14:34:57 zenicmain1 lxd[28758]: t=2020-05-18T14:34:57+0100 lvl=warn msg="Failed connecting to global database (attempt 10): failed to create dqlite

sudo systemctl status lxd.socket
● lxd.socket - LXD - unix socket
Loaded: loaded (/lib/systemd/system/lxd.socket; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-05-18 12:19:37 BST; 7min ago
Docs: man:lxd(1)
Listen: /var/lib/lxd/unix.socket (Stream)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/lxd.socket

sudo systemctl status lxd
● lxd.service - LXD - main daemon
Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
Active: activating (start-post) since Mon 2020-05-18 15:02:39 BST; 6min ago
Docs: man:lxd(1)
Process: 29067 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
Main PID: 29083 (lxd); Control PID: 29084 (lxd)
Tasks: 40
CGroup: /system.slice/lxd.service
├─29083 /usr/lib/lxd/lxd --group lxd --logfile=/var/log/lxd/lxd.log
└─29084 /usr/lib/lxd/lxd waitready --timeout=600

May 18 15:08:33 zenicmain1 lxd[29083]: t=2020-05-18T15:08:33+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:08:38 zenicmain1 lxd[29083]: t=2020-05-18T15:08:38+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:08:44 zenicmain1 lxd[29083]: t=2020-05-18T15:08:44+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:08:50 zenicmain1 lxd[29083]: t=2020-05-18T15:08:50+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:08:54 zenicmain1 lxd[29083]: t=2020-05-18T15:08:54+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:08:59 zenicmain1 lxd[29083]: t=2020-05-18T15:08:59+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:09:02 zenicmain1 lxd[29083]: t=2020-05-18T15:09:02+0100 lvl=warn msg="Failed connecting to global database (attempt 30): failed to create dqlite
May 18 15:09:02 zenicmain1 lxd[29083]: t=2020-05-18T15:09:02+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:09:06 zenicmain1 lxd[29083]: t=2020-05-18T15:09:06+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”
May 18 15:09:10 zenicmain1 lxd[29083]: t=2020-05-18T15:09:10+0100 lvl=warn msg=“Raft: Election timeout reached, restarting election”

sudo systemctl status lxc
● lxc.service - LXC Container Initialization and Autoboot Code
Loaded: loaded (/lib/systemd/system/lxc.service; enabled; vendor preset: enabled)
Active: active (exited) since Sat 2020-05-16 23:51:54 BST; 1 day 15h ago
Docs: man:lxc-autostart
man:lxc
Process: 2736 ExecStart=/usr/lib/x86_64-linux-gnu/lxc/lxc-containers start (code=exited, status=0/SUCCESS)
Process: 2719 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
Main PID: 2736 (code=exited, status=0/SUCCESS)

May 16 23:51:53 zenicmain1 systemd[1]: Starting LXC Container Initialization and Autoboot Code…
May 16 23:51:54 zenicmain1 systemd[1]: Started LXC Container Initialization and Autoboot Code.

ps aux | grep -i lxd
zenic 27328 0.0 0.0 14428 1104 pts/0 S+ 12:22 0:00 grep --color=auto -i lxd

#Here is things I’ve tried

ps aux | grep -i lxc
root 1974 0.0 0.0 235764 2148 ? Ssl May16 0:01 /usr/bin/lxcfs /var/lib/lxcfs/
lxc-dns+ 2714 0.0 0.0 52880 376 ? S May16 0:00 dnsmasq -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --listen-address 10.0.2.1 --dhcp-range 10.0.2.2,10.0.2.254 --dhcp-lease-max=253 --dhcp-no-override --except-interface=lo --interface=lxcbr0 --dhcp-leasefile=/var/lib/misc/dnsmasq.lxcbr0.leases --dhcp-authoritative
root 28097 0.0 0.0 65612 4344 pts/0 T 13:24 0:00 sudo lxc config set core.https_address [::]:8443
root 28098 0.0 0.0 396516 12848 pts/0 Tl 13:24 0:00 lxc config set core.https_address [::]:8443
root 28121 0.0 0.0 65612 4216 pts/0 T 13:25 0:00 sudo lxc config show
root 28122 0.0 0.0 394852 12268 pts/0 Tl 13:25 0:00 lxc config show
zenic 29347 0.0 0.0 14428 1136 pts/0 S+ 15:27 0:00 grep --color=auto -i lxc

sudo systemctl reload lxc
Failed to reload lxc.service: Job type reload is not applicable for unit lxc.service.
See system logs and ‘systemctl status lxc.service’ for details.

systemctl status lxc.service
● lxc.service - LXC Container Initialization and Autoboot Code
Loaded: loaded (/lib/systemd/system/lxc.service; enabled; vendor preset: enabled)
Active: active (exited) since Sat 2020-05-16 23:51:54 BST; 1 day 16h ago
Docs: man:lxc-autostart
man:lxc
Process: 2736 ExecStart=/usr/lib/x86_64-linux-gnu/lxc/lxc-containers start (code=exited, status=0/SUCCESS)
Process: 2719 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
Main PID: 2736 (code=exited, status=0/SUCCESS)

May 16 23:51:53 zenicmain1 systemd[1]: Starting LXC Container Initialization and Autoboot Code…
May 16 23:51:54 zenicmain1 systemd[1]: Started LXC Container Initialization and Autoboot Code.

sudo systemctl status lxd-containers.service
● lxd-containers.service - LXD - container startup/shutdown
Loaded: loaded (/lib/systemd/system/lxd-containers.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Sun 2020-05-17 00:01:46 BST; 1 day 17h ago
Docs: man:lxd(1)
Process: 1919 ExecStart=/usr/bin/lxd activateifneeded (code=killed, signal=TERM)
Main PID: 1919 (code=killed, signal=TERM)

May 16 23:51:46 zenicmain1 systemd[1]: Starting LXD - container startup/shutdown…
May 17 00:01:46 zenicmain1 systemd[1]: lxd-containers.service: Start operation timed out. Terminating.
May 17 00:01:46 zenicmain1 systemd[1]: lxd-containers.service: Main process exited, code=killed, status=15/TERM
May 17 00:01:46 zenicmain1 systemd[1]: lxd-containers.service: Failed with result ‘timeout’.
May 17 00:01:46 zenicmain1 systemd[1]: Failed to start LXD - container startup/shutdown.

systemctl stop lxc.service
systemctl stop lxd.service
systemctl stop lxd.socket

LXD Hangs on restart
systemctl start lxd.service

LXD socket did need to be restarted as systemctl start lxd.service started it even though it hangs
systemctl status lxd.socket
● lxd.socket - LXD - unix socket
Loaded: loaded (/lib/systemd/system/lxd.socket; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-05-18 16:57:38 BST; 5min ago
Docs: man:lxd(1)
Listen: /var/lib/lxd/unix.socket (Stream)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/lxd.socket

May 18 16:57:38 zenicmain1 systemd[1]: Starting LXD - unix socket.
May 18 16:57:38 zenicmain1 systemd[1]: Listening on LXD - unix socket.

LXC starts fine
systemctl status lxc.service
● lxc.service - LXC Container Initialization and Autoboot Code
Loaded: loaded (/lib/systemd/system/lxc.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2020-05-18 17:01:23 BST; 19s ago
Docs: man:lxc-autostart
man:lxc
Process: 30223 ExecStop=/usr/lib/x86_64-linux-gnu/lxc/lxc-containers stop (code=exited, status=1/FAILURE)
Process: 30436 ExecStart=/usr/lib/x86_64-linux-gnu/lxc/lxc-containers start (code=exited, status=0/SUCCESS)
Process: 30419 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
Main PID: 30436 (code=exited, status=0/SUCCESS)

May 18 17:01:23 zenicmain1 systemd[1]: Starting LXC Container Initialization and Autoboot Code…
May 18 17:01:23 zenicmain1 systemd[1]: Started LXC Container Initialization and Autoboot Code.

sudo lxc config set core.https_address [::]:8443 --debug
DBUG[05-18|17:19:21] Connecting to a local LXD over a Unix socket
DBUG[05-18|17:19:21] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
Error: Get http://unix.socket/1.0: EOF

just hangs until I get the error message

Still can’t get LXC to work :cry:

#Here is things am looking at next but really could do with some advice/help

https://github.com/lxc/lxd/issues/5802
https://lxd.readthedocs.io/en/latest/database/

I’ve also restarted all the service for LXD LXC, I’ve tried killing LXD processes but doesn’t seem to stop even when LXD is stoped, I think I am looking to edit a .yaml or SQLite database to remove connection as I cant do it from LXC command line any help would be greatly appreciated many thanks ben

@freeekanayaka

I assume your cluster was of 2 nodes, is that correct?

With the LXD 4.0.x series we now prevent the internal dqlite database to be clustered when you have only 2 nodes, it will be clustered starting at 3. That’s because with 2 nodes, if you lose one then you can’t get the other working, since quorum is lost. With LXD 4.0 we also have recovery tooling to fix situations like the one you’re reporting.

However the LXD 3.0.x series is quite old in that regard and its clustering implementation is much more limited than it’s today.

If it’s a viable option for you, I’d suggest to stop LXD, move away your database directory (e.g. mv /var/lib/lxd/database/ /some/back-up/dir/), then restart LXD (which will be empty) and import your existing containers with lxd import.

1 Like

If you build a new cluster, please consider having at least 3 nodes, see also https://lxd.readthedocs.io/en/latest/clustering/#clustering.

It is strongly recommended that the number of nodes in the cluster be at least three, so the cluster can survive the loss of at least one node and still be able to establish quorum for its distributed state (which is kept in a SQLite database replicated using the Raft algorithm). If the number of nodes is less than three, then only one node in the cluster will store the SQLite database. When the third node joins the cluster, both the second and third nodes will receive a replica of the database.

1 Like

@freeekanayaka Thanks so much for your pointer and recommendation, I am doing a full system update ubuntu server 20.04 and update to lxd 4.0.x and will have 2x pie clusters to maintain cluster node stability, thanks again you saved the day.

#Steps taken to recovery:

sudo su

root@hostname:/#:/home/zenic# mkdir backups

root@hostname:/# mv /var/lib/lxd/database/ /backups/

sudo apt remove lxd

sudo apt install lxd

sudo zfs mount zenlxd/containers/wgnextcloud

sudo lxd import wgnextcloud
Kicks out profile not available

#So to build profile I did
sudo lxd init

#be careful not to right over zfs storage your trying to recover, make a new setting and change .yaml to sync with the old layout. Later I might be wrong here but was taking the cautious path here.

Setup with new names for storage (newname) storage location (newname) and all network setup how you want it ect

sudo su

#Output file for editing this next part can be done easier but its what i did at the time.

sudo lxc profile show default > lxd-profile-default.yaml

nano lxd-profile-default.yaml

Save as:

lxd-profile-clusterlxd.yaml

sudo lxc profile edit clusterlxd < lxd-profile-clusterlxd.yaml

Add storage name needed and location of storage into
sudo lxc profile edit clusterlxd

Setup with clusterlxd names for storage (zenlxd) storage location (…) and all network setup how you want it ect

Exit root

sudo lxc profile edit clusterlxd

sudo lxd import wgnextcloud

sudo lxc start wgnextcloud

Screenshot from 2020-05-20 00-26-00

1 Like