Have a site with a single node that’s LXD cluster enabled at the mo, and more will join the cluster as we migrate. Needed to rename the server a month or two ago due to a change.
I cannot get the update to the record in the raft_nodes table to stick… it keeps changing back to the original name, although not experiencing issues at the mo. I’ve looked in global but cannot find it so missing I’m something, if you could point me in the right direction please.
Jul 28 21:16:59 newservername lxd.daemon[3149354]: time="2022-07-28T21:16:59Z" level=warning msg="Cluster member info not found" address="oldservername.domain.tld:8443"
Jul 28 21:16:59 newservername lxd.daemon[3149354]: time="2022-07-28T21:16:59Z" level=error msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: {NodeInfo:{ID:1 Address:oldservername.domain.tld:8443 Role:voter} Name:}"
Jul 28 21:16:59 newservername multipathd[1362]: zd0: unusable path (wild) - checker failed
Jul 28 21:17:00 newservername multipathd[1362]: zd128: unusable path (wild) - checker failed
Had run these:
/var/snap/lxd/common/lxd/database/patch.local.sql
UPDATE config SET value='newservername.domain.tld:8443' WHERE key='cluster.https_address';
UPDATE config SET value='newservername.domain.tld:8443' WHERE key='core.https_address';
UPDATE raft_nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;
/var/snap/lxd/common/lxd/database/patch.global.sql
UPDATE nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;
Current info:
lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------------------+------+------+
| id | address | role | name |
+----+-------------------------------+------+------+
| 1 | oldservername.domain.tld:8443 | 0 | |
+----+-------------------------------+------+------+
lxd sql local "SELECT * FROM config;"
+----+-----------------------+-------------------------------+
| id | key | value |
+----+-----------------------+-------------------------------+
| 2 | cluster.https_address | newservername.domain.tld:8443 |
| 3 | core.https_address | newservername.domain.tld:8443 |
+----+-----------------------+-------------------------------+
lxd sql global "SELECT * FROM nodes;"
+----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| id | name | description | address | schema | api_extensions | heartbeat | state | arch | failure_domain_id |
+----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| 1 | newservername.domain.tld | | newservername.domain.tld:8443 | 62 | 317 | 2022-07-28T21:58:39.788766423Z | 0 | 2 | <nil> |
+----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
There is an an effect and that is when saving changes to a profile the command hangs, however ctrl+c and then lxc profile show default shows the changes have committed… phew. I suspect it’s waiting for a confirmation from the missing cluster node (the old server name).
Will do more digging to see whether I can find where it’s referenced other than the aforementioned tables.
No matter what I can think of doing, the old server name gets put back into the ‘nodes’ table as described. Should I raise a bug, or battle it out here because it’s something I’m missing (more likely)?:
level=warning msg="Cluster member info not found" address="oldservername.domain.tld:8443"
level=error msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: {NodeInfo:{ID:1 Address:oldservername.domain.tld:8443 Role:voter} Name:}"
I thought it might be the server.{crt,key} or the cluster pair, so generated new pairs and deployed:
cluster pair via lxc cluster update-certificate
server pair via physical file replacements in /var/snap/lxd/common/lxd/server.{crt,key}
in both pairs included updated SAN & Subject info to suite
Shut down snap.lxd.daemon.service & snap.lxd.daemon.unix.socket and recreated the db patch files:
/var/snap/lxd/common/lxd/database/patch.local.sql
UPDATE config SET value='newservername.domain.tld:8443' WHERE key='cluster.https_address';
UPDATE config SET value='newservername.domain.tld:8443' WHERE key='core.https_address';
UPDATE raft_nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;
/var/snap/lxd/common/lxd/database/patch.global.sql
UPDATE nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;
Started the service & socket again but the error returned.
There is only 1 physical server in this cluster at the moment as there are no free servers until enough instances have been migrated over to rebuild the others. I can’t trash the LXD instances (26) with custom storage volumes and custom images (2) on this host.
Happy to reinit LXD if there is a safe way to readd all the instances, their configs and their pool/volumes.