Adding a node to my cluster, and it gives me cert errors

dlb · January 6, 2025, 3:57am

I spent all weekend migrating off truenas scale to Incus. I’m trying to now join my original box as a cluster. It originally worked, but when I was trying to migrate VMs from node0 to node1 it somehow ended up ghosting instances on both nodes after it failed to migrate the ‘custom’ data disks. I couldn’t get rid of the instances on node1, so I tried to get rid of node1 and well broke stuff.

On node1 I stopped incus and the socket. I rm -rf’d /var/lib/incus. I reran admin init with all the same stuff, it completed. But now my log is full of these messages:

Invalid client certificate CN=root@node0,O=Linux Containers (big long string of numbers).

Node0 says the node is OFFLINE and No Heartbeat.

Commands ran from either node0 or node1 (like list) work. I’m not sure what or where the ‘bad’ client cert is, and I haven’t been able to figure out how to fix it. If I try to ‘remove’ node1, I get ‘Error: not authorized’ from either node.

Any idea on how I can fix the client cert and get node1 to join back?

dlb · January 6, 2025, 3:48pm

So I’m still pretty stuck. I’ve created a CA, made some certs/keys, but don’t have any idea on how to tell incus how to trust them. I cam import them, but the error logs are still all:

Jan 06 07:45:06 storag incusd[390041]: time="2025-01-06T07:45:06-08:00" level=warning msg="Failed adding member event listener client" err="not authorized" local="[::]:8443" remote="192.168.1.168:8443"
Jan 06 07:45:16 storag incusd[390041]: time="2025-01-06T07:45:16-08:00" level=warning msg="Rejecting request from untrusted client" ip="192.168.1.168:50502"

Which is on node0 now, talking to itself, with no other members joined. If I init a cluster I see that it creates certs signed by something, and it trusts those, but I have no idea where that CA is, or how to create/sign a new cert using it.

stgraber · January 6, 2025, 5:40pm

So where are your instances in all that?

Given a two nodes cluster, we can certainly reset things to have the one server you want to keep be back to working order, then reset and add back the other.

dlb · January 6, 2025, 6:05pm

Hey. So I started by first migrating raw images over to the new server. Then I used incus-migrate to import the ‘boot’ disks and make the vms.

Then I used ‘incus storage volume create’ to make new volumes, and added them to the vms.

Then I used dd to write the ‘data’ disks from the raw images.

Booted up the VMs and blammo it all worked (except for the windows VM, i had to do the nvme device thing. Also it lost the activation (sad) and the time keeps wildly drifting on reboots).

Anyway I was excited! So then I rebuilt my original server as a new Incus server, added it to the cluster (totally worked), and then tried to migrate a vm.

It choked on the ‘custom volume’ and left the instance on both cluster nodes. The ‘boot’ volumes where on both zfs pools. I googled, saw some stuff about sql that looked shady, and just decided to remove the cluster node. But it error’d saying I have instances on it still. I then used --force to remove the second node, blew away /var/lib/incus, and re-ran the init steps and got the cert errors.

I’ve since done terrible things to node0 certs trying to get it to work, and failing. I’ve reran init on the second node as a stand alone new cluster just to see what the certs should look like.

dlb · January 6, 2025, 11:03pm

I deleted cluster.crt|key and server.crt|key on node0 (storag, 192.168.1.168) and restarted incus. It re-created server.crt|key for me, yet I still get:

Jan 06 14:56:10 storag incusd[586404]: time="2025-01-06T14:56:10-08:00" level=warning msg="Rejecting request from untrusted client" ip="192.168.1.168:37488"
Jan 06 14:56:10 storag incusd[586404]: time="2025-01-06T14:56:10-08:00" level=warning msg="Failed adding member event listener client" err="not authorized" local="[::]:8443" remote="192.168.1.168:8443"

All my VMs are still running on this node. It didn’t recreate cluster.crt|key.

 root@storag  ~incus  incus cluster list                                                                                                                                                                        ✔  ⚡  1756  15:03:23
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME  |            URL             |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS |      MESSAGE      |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node0 | https://192.168.1.168:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|       |                            | database        |              |                |             |        |                   |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

dlb · January 7, 2025, 12:36am

Wooo i fixed it!

First I imported the server.crt from node0:

incus config trust add-certificate server.crt

That gave me the entry in db. Then I updated it’s type to ‘server’.

incus admin sql global "SELECT * FROM certificates"
incus admin sql global "udpate certificates set type = 2 where id = 26;"

Then I restarted node1 and BLAM they both work meow!