@stgraber Do you know how LXD should have behaved in this case? Is the node still considered clustered but it’s just the only node in the cluster? Or is it non-clustered after removal of the second node in the cluster?
LXD doesn’t support turning clustering off, the closest you can get is operate it as a one-node cluster. This does require the remaining node to have a cluster.https_address which is valid and that it can connect to.
I think that replacing the address column of the raft_nodes
entry with 192.168.1.10:8443
could solve this issue based on the startup logic in https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/info.go#L18 that is called by https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/gateway.go#L789.
Question remains why its address is 1 suddenly in the the raft_nodes table.
If the configuration in raft is also id:1 address:1 then the issue will reoccur and we probably need a cluster edit
step.
@tomp What do you think?
Yeah sounds like lxd cluster edit
(after shutting down LXD, and taking a backup of the database) is the way forward here. As for the special “1” value, I don’t know why that is used or how it got in there. I suspect some problem caused by the cluster going back to a single member (something that AFAIK isn’t really supported ATM).
I’m not so sure how I should edit it with lxd cluster edit
Here the full content
# Latest dqlite segment ID: 5971997
members:
- id: 1
address: "1"
role: voter
Why is address is 1
?
According to Clustering - LXD documentation it looks like it should be 192.168.1.10:8443
Try replacing the "1"
with "192.168.1.10:8443"
and see if that helps somehow.
I feel I’m missing something with lxd cluster edit
.
The editor is nano. I edit the file. Save it with Ctrl+O, then Ctrl+X to exit, but if I edit again, I can see that my modifications are discarded. I still have the "1"
% sudo lxd cluster edit -d
DBUG[12-15|21:56:18] Connecting to a local LXD over a Unix socket
DBUG[12-15|21:56:18] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
I assume I need snap.lxd.daemon.unix.socket
started and systemctl stop snap.lxd.daemon.service
stopped. However, starting the former start the latter. I cannot stop the latter. It just hang forever.
I’d do:
- systemctl stop snap.lxd.daemon &
- kill -9 $(cat /var/snap/lxd/common/lxd.pid)
- systemctl stop snap.lxd.daemon.unix.socket
Then make sure it’s all stopped with “systemctl -a | grep snap.lxd”
If it is, then lxd cluster edit
should behave.
The reason why you see lxd cluster edit
trying to talk to LXD is it’s checking if it’s running, so you actually want that connection to fail.
% systemctl -a | grep snap.lxd
snap-lxd-21902.mount loaded active mounted Mount unit for lxd, revision 21902
snap-lxd-22114.mount loaded active mounted Mount unit for lxd, revision 22114
● snap.lxd.activate.service loaded failed failed Service for snap application lxd.activate
● snap.lxd.daemon.service loaded failed failed Service for snap application lxd.daemon
snap.lxd.daemon.unix.socket loaded inactive dead Socket unix for snap application lxd.daemon
My edits still seems to be ignored
@mbordere any idea why editing the address isn’t working?
@Clem Can you show the output of ls -al /var/snap/lxd/common/lxd/database/global/
?
You will normally see a.o. a list of files named like
0000000000000001-0000000000000001
0000000000000002-0000000000000007
0000000000000008-0000000000000008
The cluster edit command should have created such a file where the part before the -
equals the part after the dash, in this case that would be 0000000000000008-0000000000000008
. It will be the newest such file in the directory.
Can you also perform a hexdump -C
on that file (if one is there) and paste the output here.
I cannot see a file with matching name before and after the dash
% sudo ls -al /var/snap/lxd/common/lxd/database/global/
total 41288
drwxr-x--- 1 root root 604 Dec 10 21:21 .
drwx------ 1 root root 70 Dec 9 10:11 ..
-rw------- 1 root root 8386448 Nov 26 10:49 0000000005962303-0000000005964268
-rw------- 1 root root 8385800 Nov 26 16:10 0000000005964269-0000000005966225
-rw------- 1 root root 8385584 Nov 26 21:29 0000000005966226-0000000005968179
-rw------- 1 root root 8386448 Nov 27 02:52 0000000005968180-0000000005970145
-rw------- 1 root root 7947320 Dec 7 13:49 0000000005970146-0000000005971997
-rw------- 1 root root 577536 Dec 10 21:21 db.bin
-rw------- 1 root root 32 Jan 26 2020 metadata1
-rw------- 1 root root 86309 Nov 27 03:14 snapshot-1-5970288-116438914
-rw------- 1 root root 56 Nov 27 03:14 snapshot-1-5970288-116438914.meta
-rw------- 1 root root 90874 Nov 27 06:02 snapshot-1-5971312-126502986
-rw------- 1 root root 56 Nov 27 06:02 snapshot-1-5971312-126502986.meta
Are we missing error output from the command?
cluster edit
will call https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/cluster/recover.go#L133 that in return calls https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/node/raft.go#L30 and based on the input Clem has given us, DetermineRaftNode
will return nil, nil
because it will not find a node whose address equals the cluster address resulting in an early exit and not running dqlite.ReconfigureMembershipExt
.
There is a certainly a problem there, as errors.Wrapf
will return nil if an nil error is passed in.
So if err == nill but info == nil, then nil will be returned rather than the error.
@masnax can you take a look at this please? Thanks