Host rejected when adding to cluster

Pucky_wins · October 14, 2020, 9:11am

Hi

I removed my host from the cluster to do maintenance and now I can’t get it back on. The problem doesn’t seem to be the trust password as when I paste something random I get the following in the logs.

t=2020-10-14T11:05:46+0200 lvl=warn msg="Bad trust password" ip=10.3.0.58:55838 url=/1.0/certificates

However, when I use the correct password I get the following.

t=2020-10-14T11:06:23+0200 lvl=warn msg="Rejecting request from untrusted client" ip=10.3.0.58:55848

So it seems that the password is correct but maybe there is some sort of blacklist at play here or a registry this is conflicting with?

I’m using the following snap version
lxd 4.6 17738 latest/stable

Any ideas?

freeekanayaka · October 14, 2020, 9:43am

What you mean with “removed my host from the cluster”?

To do maintenance of a LXD node in a cluster you can just shut it down, no need to remove it. You should remove it only when you don’t want it to belong to the cluster anymore.

There isn’t any blacklist in place.

Pucky_wins · October 14, 2020, 9:45am

I ran

lxc cluster remove container6

I was waiting for new hardware and tired new containers being added there. In retrospect I should have just shut the server down.

freeekanayaka · October 14, 2020, 10:39am

In that case you need to join the node again as a brand new node. If you get a password error, I’d suggest to change your cluster password (to be really sure you are not entering the wrong password because perhaps you forgot it), then remove and reinstall the snap from the node to ensure you start from a fresh state, and then join the node again with lxd init.

Pucky_wins · October 14, 2020, 10:46am

That’s what I’ve done though. I manged to connect once and then had to downgrade to 4.0.3 because of incompatibility problems. Now when I try I get connection refused as per my initial post.
When I use the incorrect password it’s quite explicit.

freeekanayaka · October 14, 2020, 11:55am

All nodes in the cluster must be at the very same snap version. Does that hold?

Pucky_wins · October 14, 2020, 1:23pm

Yes, they are all on 4.0.3.

Pucky_wins · October 14, 2020, 8:19pm

I got some debug info, should I just upgrade to 4.6?

DBUG[10-14|22:18:17]                                                                                                                                                                                       
        {                                                                                                                                                                                                  
                "name": "lxd.cluster.d6da027133b21fbb9c53aec915cfebeb7fde2ff538004c6d53d0f5f88dc66ada",                                                                                                    
                "type": "client",                                                                                                                                                                          
                "certificate": "MIICTjCCAdWgAwIBAgIRAOdhb1F3JupF3/oVAA8qrCIwCgYIKoZIzj0EAwMwTTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEtMCsGA1UEAwwkcm9vdEBjb250YWluZXI2LmNvbm5lY3QtbW9iaWxlLmNvLnphMB4XDTI
wMTAxNTA0MDYyOFoXDTMwMTAxMzA0MDYyOFowTTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEtMCsGA1UEAwwkcm9vdEBjb250YWluZXI2LmNvbm5lY3QtbW9iaWxlLmNvLnphMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAETOQcxbvAYMqvxYd0v3QhMrpdkB/Z0W
TgK6OqVzmV8j4znJ2IPe9Uhh0LmfV8rWJeMTXruEw4pg4FrWpaMg3iccTZUhCSDI9jUf0gqrdOWRq3P3+VGfkI+QvnTarH7Zufo3kwdzAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADBCBgNVHREEOzA5gh9jb250YWluZXI2L
mNvbm5lY3QtbW9iaWxlLmNvLnphhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2cAMGQCMExkfTV2X9WvWgVXckwt0Bim3HxlKb+j4oHjJcrb8iVJGNAtYyw/2Qzs0dYgaNvLawIwL8jwG/6j6ukdRvbtBNAy5tNCcnIKGII9XU7vEx4W7g1L8vm1zXge
5/N1DbBh9aBh",                                                                                                                                                                                             
                "password": "Thou5rea6eiteZiXeeTh3shiidoo5e"                                                                                                                                               
        }                                                                                                                                                                                                  
DBUG[10-14|22:18:17] Database error: &errors.errorString{s:"No such object"}

Pucky_wins · October 15, 2020, 8:45am

Sorry to bump this.
Any ideas? Is is possible to change channel on snap without restarting my containers? Would an upgrade to 4.6 help?

stgraber · October 15, 2020, 3:33pm

In a cluster you’d need to refresh all systems to 4.6. Containers do not get restarted during refreshes so that’d be fine.

But at the same time, there is no difference in the clustering logic in 4.0.3 and 4.6 so I don’t see how this would solve anything.

Can you show:

lxc cluster list
lxd sql global “SELECT * FROM nodes;”
lxd sql global “SELECT * FROM config;”

Also, did you try to run systemctl reload snap.lxd.daemon on all cluster nodes prior to attempting another join?

Pucky_wins · October 15, 2020, 6:46pm

Reloading lxd on all the containers worked. Thanks! They are all in the cluster. Just container6 doesn’t have a database. Is that correct?

stgraber · October 15, 2020, 6:49pm

That’s normal, you’ll normally only ever see 3 database servers no matter the number of servers. Which one is a database server may change over time as the role is transferred during restarts.

Pucky_wins · October 15, 2020, 7:19pm

Good to know, thanks.

Unfortunately, now I can’t do anything with that container6 node.

lxc mv cluster-new:kibana-logging cluster-new: --target=container6
Error: Migration API failure: Failed to get address of instance's node: No such object

lxc list on container6 just shows errors

lxd sql global "SELECT * FROM nodes;"
+----+---------------------------------+-------------+----------------+--------+----------------+-------------------------------------+---------+------+
| id |              name               | description |    address     | schema | api_extensions |              heartbeat              | pending | arch |
+----+---------------------------------+-------------+----------------+--------+----------------+-------------------------------------+---------+------+
| 1  | container5                      |             | 10.3.0.57:8443 | 30     | 189            | 2020-10-15T21:18:10.031843111+02:00 | 0       | 2    |
| 3  | container7.connect-mobile.co.za |             | 10.3.0.59:8443 | 30     | 189            | 2020-10-15T21:18:10.031936519+02:00 | 0       | 2    |
| 4  | container8.connect-mobile.co.za |             | 10.3.0.60:8443 | 30     | 189            | 2020-10-15T21:18:10.032024057+02:00 | 0       | 2    |
| 5  | container6                      |             | 10.3.0.58:8443 | 30     | 189            | 2020-10-15T21:18:10.031715786+02:00 | 0       | 2    |

lxc cluster list
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
|              NAME               |          URL           | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container5                      | https://10.3.0.57:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container6                      | https://10.3.0.58:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container7.connect-mobile.co.za | https://10.3.0.59:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container8.connect-mobile.co.za | https://10.3.0.60:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+

stgraber · October 15, 2020, 7:21pm

Can you try just lxc mv cluster-new:kibana-logging --target=container6 see if that does the same thing?

Pucky_wins · October 15, 2020, 7:24pm

Yes, same thing. I also can’t deploy to that node. All the rest work fine. This is really bizarre

stgraber · October 15, 2020, 9:16pm

Very weird… Maybe do another round of systemctl reload snap.lxd.daemon on the various nodes? I’m not sure why that would help though, we send heartbeats every 10s or so and your output above shows that those have been sent an received properly so I’m not sure why you’d have cluster nodes that are unaware of others.

Pucky_wins · October 15, 2020, 9:39pm

I tried that, and restarted the container6 box. It can’t see anything at all, despite no naming issues. I went in and added all the hosts to each others host files just in-case. This is what lxc list looks like on container6.

Pucky_wins · October 15, 2020, 9:52pm

The time was WAAAAY off. I fixed that. I can now launch, just still can’t mv containers but that is now a general problem with the cluster, not associated with that node. #winning

Pucky_wins · October 15, 2020, 9:58pm

It looks like that container broke Thanks for all the help with this. It looks like it’s sorted.