Host rejected when adding to cluster

Hi

I removed my host from the cluster to do maintenance and now I can’t get it back on. The problem doesn’t seem to be the trust password as when I paste something random I get the following in the logs.

t=2020-10-14T11:05:46+0200 lvl=warn msg="Bad trust password" ip=10.3.0.58:55838 url=/1.0/certificates

However, when I use the correct password I get the following.

t=2020-10-14T11:06:23+0200 lvl=warn msg="Rejecting request from untrusted client" ip=10.3.0.58:55848

So it seems that the password is correct but maybe there is some sort of blacklist at play here or a registry this is conflicting with?

I’m using the following snap version
lxd 4.6 17738 latest/stable

Any ideas?

What you mean with “removed my host from the cluster”?

To do maintenance of a LXD node in a cluster you can just shut it down, no need to remove it. You should remove it only when you don’t want it to belong to the cluster anymore.

There isn’t any blacklist in place.

I ran

lxc cluster remove container6

I was waiting for new hardware and tired new containers being added there. In retrospect I should have just shut the server down.

In that case you need to join the node again as a brand new node. If you get a password error, I’d suggest to change your cluster password (to be really sure you are not entering the wrong password because perhaps you forgot it), then remove and reinstall the snap from the node to ensure you start from a fresh state, and then join the node again with lxd init.

That’s what I’ve done though. I manged to connect once and then had to downgrade to 4.0.3 because of incompatibility problems. Now when I try I get connection refused as per my initial post.
When I use the incorrect password it’s quite explicit.

All nodes in the cluster must be at the very same snap version. Does that hold?

Yes, they are all on 4.0.3.

I got some debug info, should I just upgrade to 4.6?

DBUG[10-14|22:18:17]                                                                                                                                                                                       
        {                                                                                                                                                                                                  
                "name": "lxd.cluster.d6da027133b21fbb9c53aec915cfebeb7fde2ff538004c6d53d0f5f88dc66ada",                                                                                                    
                "type": "client",                                                                                                                                                                          
                "certificate": "MIICTjCCAdWgAwIBAgIRAOdhb1F3JupF3/oVAA8qrCIwCgYIKoZIzj0EAwMwTTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEtMCsGA1UEAwwkcm9vdEBjb250YWluZXI2LmNvbm5lY3QtbW9iaWxlLmNvLnphMB4XDTI
wMTAxNTA0MDYyOFoXDTMwMTAxMzA0MDYyOFowTTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEtMCsGA1UEAwwkcm9vdEBjb250YWluZXI2LmNvbm5lY3QtbW9iaWxlLmNvLnphMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAETOQcxbvAYMqvxYd0v3QhMrpdkB/Z0W
TgK6OqVzmV8j4znJ2IPe9Uhh0LmfV8rWJeMTXruEw4pg4FrWpaMg3iccTZUhCSDI9jUf0gqrdOWRq3P3+VGfkI+QvnTarH7Zufo3kwdzAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADBCBgNVHREEOzA5gh9jb250YWluZXI2L
mNvbm5lY3QtbW9iaWxlLmNvLnphhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2cAMGQCMExkfTV2X9WvWgVXckwt0Bim3HxlKb+j4oHjJcrb8iVJGNAtYyw/2Qzs0dYgaNvLawIwL8jwG/6j6ukdRvbtBNAy5tNCcnIKGII9XU7vEx4W7g1L8vm1zXge
5/N1DbBh9aBh",                                                                                                                                                                                             
                "password": "Thou5rea6eiteZiXeeTh3shiidoo5e"                                                                                                                                               
        }                                                                                                                                                                                                  
DBUG[10-14|22:18:17] Database error: &errors.errorString{s:"No such object"}

Sorry to bump this.
Any ideas? Is is possible to change channel on snap without restarting my containers? Would an upgrade to 4.6 help?

In a cluster you’d need to refresh all systems to 4.6. Containers do not get restarted during refreshes so that’d be fine.

But at the same time, there is no difference in the clustering logic in 4.0.3 and 4.6 so I don’t see how this would solve anything.

Can you show:

  • lxc cluster list
  • lxd sql global “SELECT * FROM nodes;”
  • lxd sql global “SELECT * FROM config;”

Also, did you try to run systemctl reload snap.lxd.daemon on all cluster nodes prior to attempting another join?

Reloading lxd on all the containers worked. Thanks! They are all in the cluster. Just container6 doesn’t have a database. Is that correct?

image

That’s normal, you’ll normally only ever see 3 database servers no matter the number of servers. Which one is a database server may change over time as the role is transferred during restarts.

Good to know, thanks.

Unfortunately, now I can’t do anything with that container6 node.

lxc mv cluster-new:kibana-logging cluster-new: --target=container6
Error: Migration API failure: Failed to get address of instance's node: No such object

lxc list on container6 just shows errors

lxd sql global "SELECT * FROM nodes;"
+----+---------------------------------+-------------+----------------+--------+----------------+-------------------------------------+---------+------+
| id |              name               | description |    address     | schema | api_extensions |              heartbeat              | pending | arch |
+----+---------------------------------+-------------+----------------+--------+----------------+-------------------------------------+---------+------+
| 1  | container5                      |             | 10.3.0.57:8443 | 30     | 189            | 2020-10-15T21:18:10.031843111+02:00 | 0       | 2    |
| 3  | container7.connect-mobile.co.za |             | 10.3.0.59:8443 | 30     | 189            | 2020-10-15T21:18:10.031936519+02:00 | 0       | 2    |
| 4  | container8.connect-mobile.co.za |             | 10.3.0.60:8443 | 30     | 189            | 2020-10-15T21:18:10.032024057+02:00 | 0       | 2    |
| 5  | container6                      |             | 10.3.0.58:8443 | 30     | 189            | 2020-10-15T21:18:10.031715786+02:00 | 0       | 2    |

lxc cluster list
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
|              NAME               |          URL           | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container5                      | https://10.3.0.57:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container6                      | https://10.3.0.58:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container7.connect-mobile.co.za | https://10.3.0.59:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+
| container8.connect-mobile.co.za | https://10.3.0.60:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------------------------------+------------------------+----------+--------+-------------------+--------------+

Can you try just lxc mv cluster-new:kibana-logging --target=container6 see if that does the same thing?

Yes, same thing. I also can’t deploy to that node. All the rest work fine. This is really bizarre

Very weird… Maybe do another round of systemctl reload snap.lxd.daemon on the various nodes? I’m not sure why that would help though, we send heartbeats every 10s or so and your output above shows that those have been sent an received properly so I’m not sure why you’d have cluster nodes that are unaware of others.

I tried that, and restarted the container6 box. It can’t see anything at all, despite no naming issues. I went in and added all the hosts to each others host files just in-case. This is what lxc list looks like on container6.
image

The time was WAAAAY off. I fixed that. I can now launch, just still can’t mv containers but that is now a general problem with the cluster, not associated with that node. #winning

It looks like that container broke :frowning: Thanks for all the help with this. It looks like it’s sorted.