@stgraber Do you know how LXD should have behaved in this case? Is the node still considered clustered but it’s just the only node in the cluster? Or is it non-clustered after removal of the second node in the cluster?
LXD doesn’t support turning clustering off, the closest you can get is operate it as a one-node cluster. This does require the remaining node to have a cluster.https_address which is valid and that it can connect to.
I think that replacing the address column of the
raft_nodes entry with
192.168.1.10:8443 could solve this issue based on the startup logic in https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/info.go#L18 that is called by https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/gateway.go#L789.
Question remains why its address is 1 suddenly in the the raft_nodes table.
If the configuration in raft is also id:1 address:1 then the issue will reoccur and we probably need a
cluster edit step.
@tomp What do you think?
Yeah sounds like
lxd cluster edit (after shutting down LXD, and taking a backup of the database) is the way forward here. As for the special “1” value, I don’t know why that is used or how it got in there. I suspect some problem caused by the cluster going back to a single member (something that AFAIK isn’t really supported ATM).
I’m not so sure how I should edit it with
lxd cluster edit
Here the full content
# Latest dqlite segment ID: 5971997 members: - id: 1 address: "1" role: voter
Why is address is
According to Clustering - LXD documentation it looks like it should be
Try replacing the
"192.168.1.10:8443" and see if that helps somehow.
I feel I’m missing something with
lxd cluster edit.
The editor is nano. I edit the file. Save it with Ctrl+O, then Ctrl+X to exit, but if I edit again, I can see that my modifications are discarded. I still have the
% sudo lxd cluster edit -d DBUG[12-15|21:56:18] Connecting to a local LXD over a Unix socket DBUG[12-15|21:56:18] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
I assume I need
snap.lxd.daemon.unix.socket started and
systemctl stop snap.lxd.daemon.service stopped. However, starting the former start the latter. I cannot stop the latter. It just hang forever.
- systemctl stop snap.lxd.daemon &
- kill -9 $(cat /var/snap/lxd/common/lxd.pid)
- systemctl stop snap.lxd.daemon.unix.socket
Then make sure it’s all stopped with “systemctl -a | grep snap.lxd”
If it is, then
lxd cluster edit should behave.
The reason why you see
lxd cluster edit trying to talk to LXD is it’s checking if it’s running, so you actually want that connection to fail.
% systemctl -a | grep snap.lxd snap-lxd-21902.mount loaded active mounted Mount unit for lxd, revision 21902 snap-lxd-22114.mount loaded active mounted Mount unit for lxd, revision 22114 ● snap.lxd.activate.service loaded failed failed Service for snap application lxd.activate ● snap.lxd.daemon.service loaded failed failed Service for snap application lxd.daemon snap.lxd.daemon.unix.socket loaded inactive dead Socket unix for snap application lxd.daemon
My edits still seems to be ignored
@mbordere any idea why editing the address isn’t working?
@Clem Can you show the output of
ls -al /var/snap/lxd/common/lxd/database/global/ ?
You will normally see a.o. a list of files named like
0000000000000001-0000000000000001 0000000000000002-0000000000000007 0000000000000008-0000000000000008
The cluster edit command should have created such a file where the part before the
- equals the part after the dash, in this case that would be
0000000000000008-0000000000000008. It will be the newest such file in the directory.
Can you also perform a
hexdump -C on that file (if one is there) and paste the output here.
I cannot see a file with matching name before and after the dash
% sudo ls -al /var/snap/lxd/common/lxd/database/global/ total 41288 drwxr-x--- 1 root root 604 Dec 10 21:21 . drwx------ 1 root root 70 Dec 9 10:11 .. -rw------- 1 root root 8386448 Nov 26 10:49 0000000005962303-0000000005964268 -rw------- 1 root root 8385800 Nov 26 16:10 0000000005964269-0000000005966225 -rw------- 1 root root 8385584 Nov 26 21:29 0000000005966226-0000000005968179 -rw------- 1 root root 8386448 Nov 27 02:52 0000000005968180-0000000005970145 -rw------- 1 root root 7947320 Dec 7 13:49 0000000005970146-0000000005971997 -rw------- 1 root root 577536 Dec 10 21:21 db.bin -rw------- 1 root root 32 Jan 26 2020 metadata1 -rw------- 1 root root 86309 Nov 27 03:14 snapshot-1-5970288-116438914 -rw------- 1 root root 56 Nov 27 03:14 snapshot-1-5970288-116438914.meta -rw------- 1 root root 90874 Nov 27 06:02 snapshot-1-5971312-126502986 -rw------- 1 root root 56 Nov 27 06:02 snapshot-1-5971312-126502986.meta
Are we missing error output from the command?
cluster edit will call https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/cluster/recover.go#L133 that in return calls https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/node/raft.go#L30 and based on the input Clem has given us,
DetermineRaftNode will return
nil, nil because it will not find a node whose address equals the cluster address resulting in an early exit and not running
There is a certainly a problem there, as
errors.Wrapf will return nil if an nil error is passed in.
So if err == nill but info == nil, then nil will be returned rather than the error.
@masnax can you take a look at this please? Thanks