So I tried to upgrade and it went wrong. In the end I got everything running, had to migrate some machines but cluster needs repairs and reaasembly.
For archival purposes I write down what went wrong and how I fixed it even if it is not current problem.
When lxd.migrate is executed dont press yes, before you execute it on every machine, it will fail on last few, because cluster is down, i guess duh? But I didint think about it.
Two cluster members migrated successfully containers were up and running.
cluster member failed to migrate. It was up on lxc cluster list. But lxc commands gave errors and since it had only 1 container, I tried just reinstall it, but it failed too with something like dqlite no leader what I noticed that it seems it tried to contact itself. I specified correct ip addresses, and all lxd init went as usual but at the end I saw leader related errors. Could it be that after forced cluster member removal some entries left it db and now I cant readd same member(with same cluster name and ip address)?
second cluster member had bizarre problem even before migration which I didint notice.
This is part of containers backup.yaml from old backup image in lxd 3.0.3
All containers were properly stored where they should be on normal filesystem, but config had these values and after import all containers of this member broke and didint start. I tried to mount /var/lib/lxd/disks/ssd.img and it was almost empty, had just container names and backup.yaml in them. No idea how this happened or why everything even worked before.
And then suffering started I tried various ways to remove this ssd (and hdd.img).
- Tried with
lxd sql update rename this pool to proper /srv/lxd/ssd it didint work.
- Tried deleting container then reimporting, but lxd.import /srv/lxd/ssd didint detect as a pool.
- Then created borkssd pool and tried to import from there, got errors that container is on multiple pools.
After all that I gave up ( it was 22h + at that point) so I removed that completely borked instance reinstalled it as non clustered one and migrated all containers via btrfs send and lxd.import, at least that worked So after few 3hours (thanks god for fast ssds, only few containers on hdd and recent resources upgrades so everything fit) I finished.
In the end in the morning nobody noticed my suffering. And life goes on
Now I will try
- reinstall member with borked ssd.img ( I hope I will get around “no dqlite leader” problem.)
- migrate vms from non clustered member
- readd that non clustered member back to cluster
- balance resources
- finally enjoy lxd qemu support and start migrating windows vms
I hope read was not too lengthy.