Unable to complete LXD cluster node rename

Hi

Have a site with a single node that’s LXD cluster enabled at the mo, and more will join the cluster as we migrate. Needed to rename the server a month or two ago due to a change.

I cannot get the update to the record in the raft_nodes table to stick… it keeps changing back to the original name, although not experiencing issues at the mo. I’ve looked in global but cannot find it so missing I’m something, if you could point me in the right direction please.

Jul 28 21:16:59 newservername lxd.daemon[3149354]: time="2022-07-28T21:16:59Z" level=warning msg="Cluster member info not found" address="oldservername.domain.tld:8443"
Jul 28 21:16:59 newservername lxd.daemon[3149354]: time="2022-07-28T21:16:59Z" level=error msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: {NodeInfo:{ID:1 Address:oldservername.domain.tld:8443 Role:voter} Name:}"
Jul 28 21:16:59 newservername multipathd[1362]: zd0: unusable path (wild) - checker failed
Jul 28 21:17:00 newservername multipathd[1362]: zd128: unusable path (wild) - checker failed

Had run these:

/var/snap/lxd/common/lxd/database/patch.local.sql
    UPDATE config SET value='newservername.domain.tld:8443' WHERE key='cluster.https_address';
    UPDATE config SET value='newservername.domain.tld:8443' WHERE key='core.https_address';
    UPDATE raft_nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;

/var/snap/lxd/common/lxd/database/patch.global.sql
    UPDATE nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;

Current info:

lxd sql local "SELECT * FROM raft_nodes;"
    +----+-------------------------------+------+------+
    | id |            address            | role | name |
    +----+-------------------------------+------+------+
    | 1  | oldservername.domain.tld:8443 | 0    |      |
    +----+-------------------------------+------+------+


lxd sql local "SELECT * FROM config;"
    +----+-----------------------+-------------------------------+
    | id |          key          |              value            |
    +----+-----------------------+-------------------------------+
    | 2  | cluster.https_address | newservername.domain.tld:8443 |
    | 3  | core.https_address    | newservername.domain.tld:8443 |
    +----+-----------------------+-------------------------------+

lxd sql global "SELECT * FROM nodes;"
    +----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
    | id |            name          | description |             address           | schema | api_extensions |           heartbeat            | state | arch | failure_domain_id |
    +----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
    | 1  | newservername.domain.tld |             | newservername.domain.tld:8443 | 62     | 317            | 2022-07-28T21:58:39.788766423Z | 0     | 2    | <nil>             |
    +----+--------------------------+-------------+-------------------------------+--------+----------------+--------------------------------+-------+------+-------------------+

snap info lxd:
installed: 5.4-82d05d6

Thanks

There is an an effect and that is when saving changes to a profile the command hangs, however ctrl+c and then lxc profile show default shows the changes have committed… phew. I suspect it’s waiting for a confirmation from the missing cluster node (the old server name).

Will do more digging to see whether I can find where it’s referenced other than the aforementioned tables.

I have run the following 2 commands to go through all tables in the database and print out all records and output them to a file:

for i in $(sqlite3 -batch /var/snap/lxd/common/lxd/database/global/db.bin ".tables") ; do printf "\n\nTable Name: $i \n" && sqlite3 -header -column /var/snap/lxd/common/lxd/database/global/db.bin "SELECT * FROM $i" ; done > global.txt

for i in $(sqlite3 -batch /var/snap/lxd/common/lxd/database/local.db ".tables") ; do printf "\n\nTable Name: $i \n" && sqlite3 -header -column /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM $i" ; done > local.txt

The only results for the old name of all the output are:
global.txt:

  • certificates table

local.txt:

  • certificates table
  • raft_nodes table (this is the one that keeps reverting to the old name)
Table Name: raft_nodes
id          address                         role        name
----------  ------------------------------  ----------  ----------
1           oldsevername.domain.tld:8443  0

I’ve put the patch.local.sql back in and rebooted the physical server but the error comes returns:

"Unaccounted raft node(s) not found in 'nodes' table for heartbeat: {NodeInfo:{ID:1 Address:oldservername.domain.tld:8443 Role:voter}...

The good news is that changes to a profile don’t hang anymore, so that must be related to the reboot.

Just need help to find out what’s putting that old name back into the raft_nodes table to finish off the rename process.

No matter what I can think of doing, the old server name gets put back into the ‘nodes’ table as described. Should I raise a bug, or battle it out here because it’s something I’m missing (more likely)?:

level=warning msg="Cluster member info not found" address="oldservername.domain.tld:8443"
level=error msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: {NodeInfo:{ID:1 Address:oldservername.domain.tld:8443 Role:voter} Name:}"

I thought it might be the server.{crt,key} or the cluster pair, so generated new pairs and deployed:

  • cluster pair via lxc cluster update-certificate
  • server pair via physical file replacements in /var/snap/lxd/common/lxd/server.{crt,key}
  • in both pairs included updated SAN & Subject info to suite

Shut down snap.lxd.daemon.service & snap.lxd.daemon.unix.socket and recreated the db patch files:

/var/snap/lxd/common/lxd/database/patch.local.sql
    UPDATE config SET value='newservername.domain.tld:8443' WHERE key='cluster.https_address';
    UPDATE config SET value='newservername.domain.tld:8443' WHERE key='core.https_address';
    UPDATE raft_nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;

/var/snap/lxd/common/lxd/database/patch.global.sql
    UPDATE nodes SET address = 'newservername.domain.tld:8443' WHERE id = 1;

Started the service & socket again but the error returned.

There is only 1 physical server in this cluster at the moment as there are no free servers until enough instances have been migrated over to rebuild the others. I can’t trash the LXD instances (26) with custom storage volumes and custom images (2) on this host.

Happy to reinit LXD if there is a safe way to readd all the instances, their configs and their pool/volumes.

Thanks