Lxc commands hang. Containers are not running

mbordere · December 13, 2021, 6:21pm

@stgraber Do you know how LXD should have behaved in this case? Is the node still considered clustered but it’s just the only node in the cluster? Or is it non-clustered after removal of the second node in the cluster?

stgraber · December 13, 2021, 6:41pm

LXD doesn’t support turning clustering off, the closest you can get is operate it as a one-node cluster. This does require the remaining node to have a cluster.https_address which is valid and that it can connect to.

mbordere · December 14, 2021, 8:32am

I think that replacing the address column of the raft_nodes entry with 192.168.1.10:8443 could solve this issue based on the startup logic in https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/info.go#L18 that is called by https://github.com/lxc/lxd/blob/8e6a5ea574ab1c89a6886478f3f94e7438406c5f/lxd/cluster/gateway.go#L789.

Question remains why its address is 1 suddenly in the the raft_nodes table.
If the configuration in raft is also id:1 address:1 then the issue will reoccur and we probably need a cluster edit step.

@tomp What do you think?

tomp · December 15, 2021, 9:46am

Yeah sounds like lxd cluster edit (after shutting down LXD, and taking a backup of the database) is the way forward here. As for the special “1” value, I don’t know why that is used or how it got in there. I suspect some problem caused by the cluster going back to a single member (something that AFAIK isn’t really supported ATM).

Clem · December 15, 2021, 6:46pm

I’m not so sure how I should edit it with lxd cluster edit

Here the full content

# Latest dqlite segment ID: 5971997

members:
- id: 1
  address: "1"
  role: voter

Why is address is 1?
According to Clustering - LXD documentation it looks like it should be 192.168.1.10:8443

stgraber · December 15, 2021, 8:28pm

Try replacing the "1" with "192.168.1.10:8443" and see if that helps somehow.

Clem · December 15, 2021, 8:34pm

I feel I’m missing something with lxd cluster edit.
The editor is nano. I edit the file. Save it with Ctrl+O, then Ctrl+X to exit, but if I edit again, I can see that my modifications are discarded. I still have the "1"

% sudo lxd cluster edit -d                      
DBUG[12-15|21:56:18] Connecting to a local LXD over a Unix socket 
DBUG[12-15|21:56:18] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=

I assume I need snap.lxd.daemon.unix.socket started and systemctl stop snap.lxd.daemon.service stopped. However, starting the former start the latter. I cannot stop the latter. It just hang forever.

stgraber · December 15, 2021, 9:35pm

I’d do:

systemctl stop snap.lxd.daemon &
kill -9 $(cat /var/snap/lxd/common/lxd.pid)
systemctl stop snap.lxd.daemon.unix.socket

Then make sure it’s all stopped with “systemctl -a | grep snap.lxd”

If it is, then lxd cluster edit should behave.

The reason why you see lxd cluster edit trying to talk to LXD is it’s checking if it’s running, so you actually want that connection to fail.

Clem · December 16, 2021, 8:45am

% systemctl -a | grep snap.lxd
  snap-lxd-21902.mount                                                                                           loaded    active   mounted   Mount unit for lxd, revision 21902
  snap-lxd-22114.mount                                                                                           loaded    active   mounted   Mount unit for lxd, revision 22114
● snap.lxd.activate.service                                                                                      loaded    failed   failed    Service for snap application lxd.activate
● snap.lxd.daemon.service                                                                                        loaded    failed   failed    Service for snap application lxd.daemon
  snap.lxd.daemon.unix.socket                                                                                    loaded    inactive dead      Socket unix for snap application lxd.daemon

My edits still seems to be ignored

stgraber · December 16, 2021, 6:38pm

@mbordere any idea why editing the address isn’t working?

mbordere · December 17, 2021, 8:05am

@Clem Can you show the output of ls -al /var/snap/lxd/common/lxd/database/global/ ?

You will normally see a.o. a list of files named like

0000000000000001-0000000000000001 
0000000000000002-0000000000000007
0000000000000008-0000000000000008

The cluster edit command should have created such a file where the part before the - equals the part after the dash, in this case that would be 0000000000000008-0000000000000008. It will be the newest such file in the directory.

Can you also perform a hexdump -C on that file (if one is there) and paste the output here.

Clem · December 17, 2021, 12:43pm

I cannot see a file with matching name before and after the dash

% sudo ls -al /var/snap/lxd/common/lxd/database/global/
total 41288
drwxr-x--- 1 root root     604 Dec 10 21:21 .
drwx------ 1 root root      70 Dec  9 10:11 ..
-rw------- 1 root root 8386448 Nov 26 10:49 0000000005962303-0000000005964268
-rw------- 1 root root 8385800 Nov 26 16:10 0000000005964269-0000000005966225
-rw------- 1 root root 8385584 Nov 26 21:29 0000000005966226-0000000005968179
-rw------- 1 root root 8386448 Nov 27 02:52 0000000005968180-0000000005970145
-rw------- 1 root root 7947320 Dec  7 13:49 0000000005970146-0000000005971997
-rw------- 1 root root  577536 Dec 10 21:21 db.bin
-rw------- 1 root root      32 Jan 26  2020 metadata1
-rw------- 1 root root   86309 Nov 27 03:14 snapshot-1-5970288-116438914
-rw------- 1 root root      56 Nov 27 03:14 snapshot-1-5970288-116438914.meta
-rw------- 1 root root   90874 Nov 27 06:02 snapshot-1-5971312-126502986
-rw------- 1 root root      56 Nov 27 06:02 snapshot-1-5971312-126502986.meta

mbordere · December 17, 2021, 1:28pm

@stgraber @tomp

Are we missing error output from the command?
cluster edit will call https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/cluster/recover.go#L133 that in return calls https://github.com/lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/node/raft.go#L30 and based on the input Clem has given us, DetermineRaftNode will return nil, nil because it will not find a node whose address equals the cluster address resulting in an early exit and not running dqlite.ReconfigureMembershipExt.

tomp · December 17, 2021, 2:30pm

There is a certainly a problem there, as errors.Wrapf will return nil if an nil error is passed in.

So if err == nill but info == nil, then nil will be returned rather than the error.

github.com

lxc/lxd/blob/a0d0d4e965e865b9182861661417e338f0719f2f/lxd/cluster/recover.go#L137-L139

    
      
          	if err != nil || info == nil {
          		return errors.Wrap(err, "Failed to determine node role")
          	}

tomp · December 17, 2021, 2:31pm

@masnax can you take a look at this please? Thanks