Lxd-to-incus failed, lxd cluster lost

G’day!

Tried migrating lxd to incus, however have not been much luck

Prior to running the tool, LXD was upgraded to 6.1 and I initially installed 6.0.1 incus, and later tried 6.3.

When first running lxd-to-incus --ignore-version-check it complained about Source server is using incompatible configuration in a few of the profiles. The problem key was removed, however the tool started but eventually got stuck.

The current state being:

  • the lxc/d cluster can no longer be accessed.
  • One of the nodes (node A) is complaining about the version mismatch and needing to be upgraded (however its version is the same as the other nodes)
  • another node (node B) shows nothing under lxd cluster ls whilst the other have 3 or more ip address listed. lxd cluster show for this node has a list of numbers whereas the others have the expected config yaml
  • lxc cluster ls returns LXD unix socket not accessible (except for the node B above which says LXD server isn't part of a cluster)
  • lxd sql global .dump > lxd.global.backup fails and returns Error: failed to request dump: Get "http://unix.socket/internal/sql?database=global": EOF I guess because of the socket not being accessible

From what I can tell, the containers are still in /var/snap/lxd/common/lxd/storage-pools/local/containers

lxd recover on the node B complains about some missing profiles so I haven’t completed that.

I know this forum isn’t LXD focused anymore, but I would like to be able to migrate over to incus, so any tips would be appreciated :slight_smile:

Cheers!

Further update:

  • in /var/snap/lxd/common/lxd/database/global/db.bin node B has api_extensions 406, and rest has 387. All the schemas are 73

Hmm, so what did lxd-to-incus do exactly in this cluster environment?

Because normally it should have showed a message saying that this is a cluster and that it will need you to run the migration tool on all the servers, doing that after it has evacuated all the workloads, …

Just trying to figure out if any of the migration did occur already or what’s going on exactly here.

This is definitely recoverable but it’s hard to figure out how and how hard it will be without a lot more details :slight_smile:

Thanks @stgraber - I assumed more details would be needed but wasn’t where to start!

also - there’s 10 nodes in the cluster which I Ieft out in the OP.

Just found the logs for the failed attempt of lxd-to-incus which the command was issued on node B:

Source server: snap package
Target server: systemd
Source server is a cluster
Source server paths: &{daemon:/var/snap/lxd/common/lxd logs:/var/snap/lxd/common/lxd/logs cache:/var/snap/lxd/common/lxd/cache}
Target server paths: &{daemon:/var/lib/incus logs:/var/log/incus cache:/var/cache/incus}
Rewrite SQL statements:
 - does a bunch of SQL rewrites
Migration started
Stopping instances on server "node B" # <-- this is where it got stuck
ERROR: websocket: close 1006 (abnormal closure): unexpected EOF # (I assume this is the Ctrl+C which eventually I issued)

and there’s another log a bit later with:

Running in cluster member mode
Source server: snap package
Target server: systemd
Source server is a cluster
Source server paths: &{daemon:/var/snap/lxd/common/lxd logs:/var/snap/lxd/common/lxd/logs cache:/var/snap/lxd/common/lxd/cache}
Target server paths: &{daemon:/var/lib/incus logs:/var/log/incus cache:/var/cache/incus}
Rewrite SQL statements:
Rewrite commands:
Migration started
Stopping the source server
Stopping the target server
Unmounting "/var/lib/incus/devlxd"
Unmounting "/var/lib/incus/shmounts"
Wiping the target server
Migrating the data
Moving data over
Migrating database files
Cleaning up target paths
Cleaning up path "/var/lib/incus/backups"
Cleaning up path "/var/lib/incus/images"
Cleaning up path "/var/lib/incus/devices"
Cleaning up path "/var/lib/incus/devlxd"
Cleaning up path "/var/lib/incus/security"
Cleaning up path "/var/lib/incus/shmounts"
Rewrite symlinks:
 - # container rewrites
Rewrite symlinks:
Rewrite symlinks:
Rewrite symlinks:
Starting the target server

The lxd-to-incus tool never got a point where it prompted me to enter commands on the other nodes

Okay, so for the first log, nothing would have actually happened to your cluster at this point, other than a stuck evacuation. Any data/database migration would have occurred after all servers are properly evacuated.

The second log is where things went pretty bad as the error shows that the command was run with --cluster-member. That should NEVER be done directly by you, it should only be done when instructed by the tool to run it that way on the remaining servers.

Running with --cluster-member bypassed all the checks and went straight on to data migration on that server, moving everything when the cluster is not at all ready to be converted…

This then caused that server to go back online running Incus when the rest still runs LXD.
The mistmatch in DB schema and API extensions made that server hang because it wants all servers to have a matching version and will cause all remaining servers to eventually fail due to noticing that they’re behind but have no way to catch up.

As it’s only one server that’s been incorrectly converted, the easiest way to recover it is most likely:

  • Stop Incus on the server
  • Stop LXD on the server (if running for some reason)
  • Make sure LXD version matches the rest of the servers
  • Manually move all the data back to the LXD paths
  • Manually re-create all the symlinks to have them point to the correct LXD paths
  • Restore the DB backup that was made by lxd-to-incus
  • Alter the LXD global database from the remaining servers to roll back the API and DB version of that first server
  • Start LXD back on
  • Confirm the cluster is back to being fully functional

And then go back to running lxd-to-incus, let it deal with the evacuation and if the evacuation gets stuck, stop and figure out why your instances can’t be shut down.

right… no idea why but i definitely did run on node B

lxd-to-incus --cluster-member node A

and node B is the one that has issue (I was reading the logs wrong on node A - all the other servers actually say that “this version is behind” which makes sense cause incus on node B is ahead now)

so far, on the node B server:

  • I’ve added all the symlinks back for the containers that were still in local and were reference in the logs
  • there’s two other storage pools should exists but are not at either at /var/lib/incus/storage-pools or /var/snap/lxd/common/lxd/storage-pools
    • one was a “dir” mount and still exists in its source location (ie. /mnt/data) - should this exist in the storage_pools directory as a bind mount?
    • and the other was a “loop” and that wasn’t actually used at all but there’s no .img in /var/lib/incus/disks or /var/snap/lxd/common/lxd/disks
  • there doesn’t seem to be anything that has moved over to /var/lib/incus. the global db and local db don’t seem to reference any data which the other servers do.
$ tree /var/lib/incus
/var/lib/incus
├── backups
│   ├── custom
│   └── instances
├── containers
├── containers-snapshots
├── database
│   ├── global
│   │   ├── 0000000000000001-0000000000000001
│   │   ├── 0000000000000002-0000000000000245
│   │   ├── db.bin
│   │   └── metadata1
│   └── local.db
├── devices
├── disks
├── guestapi
├── images
├── networks
├── security
│   ├── apparmor
│   │   ├── cache
│   │   └── profiles
│   └── seccomp
├── server.crt
├── server.key
├── shmounts
├── storage-pools
├── unix.socket
├── unix.socket.user
├── virtual-machines
└── virtual-machines-snapshots
  • I’m not sure where lxd-to-incus has created the global/local backups. Though I can see the following ovn backups, I havent been able to find the local.db.bak anywhere:
$ ls /var/backups -lah
-rw-------  1 root root  46K Jul 26 13:56 lxd-to-incus.ovn-nb.1718726.backup
-rw-------  1 root root  46K Jul 26 13:57 lxd-to-incus.ovn-nb.1720663.backup
-rw-------  1 root root  44K Jul 26 14:05 lxd-to-incus.ovn-nb.1724055.backup
-rw-------  1 root root  44K Jul 26 14:23 lxd-to-incus.ovn-nb.1725408.backup
-rw-------  1 root root  44K Jul 26 14:29 lxd-to-incus.ovn-nb.1748429.backup
-rw-------  1 root root 387K Jul 26 13:56 lxd-to-incus.ovn-sb.1718726.backup
-rw-------  1 root root 387K Jul 26 13:57 lxd-to-incus.ovn-sb.1720663.backup
-rw-------  1 root root 354K Jul 26 14:05 lxd-to-incus.ovn-sb.1724055.backup
-rw-------  1 root root 354K Jul 26 14:23 lxd-to-incus.ovn-sb.1725408.backup
-rw-------  1 root root 354K Jul 26 14:29 lxd-to-incus.ovn-sb.1748429.backup

and thats kinda where I’m at for tonight. Worst case, can I copy the global db from another server and the local.db? (which the local would require some changes to the config table but the rest of the tables look the same with the exception of the timestamp in some tables)

Look in /var/lib/incus/database, there will be pre-migrate directories/files in there for the database.

It doesn’t look like it’s created the pre-migrate directory

tree /var/lib/incus/database
/var/lib/incus/database
├── global
│   ├── 0000000000000001-0000000000000001
│   ├── 0000000000000002-0000000000000245
│   ├── db.bin
│   └── metadata1
└── local.db

1 directory, 5 files

That’s odd as the log you’ve shown previously had:

Migrating database files

Which is what we log when we get to this part of the code:

	// Migrate database format.
	fmt.Println("=> Migrating database")
	_, _ = logFile.WriteString("Migrating database files\n")

	_, err = subprocess.RunCommand("cp", "-R", filepath.Join(targetPaths.daemon, "database"), filepath.Join(targetPaths.daemon, "database.pre-migrate"))
	if err != nil {
		_, _ = logFile.WriteString(fmt.Sprintf("ERROR: %v\n", err))
		return fmt.Errorf("Failed to backup the database: %w", err)
	}

	err = migrateDatabase(filepath.Join(targetPaths.daemon, "database"))
	if err != nil {
		_, _ = logFile.WriteString(fmt.Sprintf("ERROR: %v\n", err))
		return fmt.Errorf("Failed to migrate database in %q: %w", filepath.Join(targetPaths.daemon, "database"), err)
	}

Actually, I think I had my path wrong earlier. The logic above shows it copying /var/lib/incus/database to /var/lib/incus/database.pre-migrate, so that’s where your full backup of global and local databases would be.

Ok, so it was meant to move the logs, cache, then the daemon directories from lxd to incus, then copy the incus/database to incus/database.pre-mirgate.

And if it failed, it should have logged that. But from what I can see nothing got moved over (otherwise why would I still have the container instances in /var/snap/lxd/common/lxd/storage-pools/local/containers

if i run lxd-to-incus --cluster-member node A on node B… it’s still node B that its migrating data over right?

Edit: ok --cluster-member is a bool flag so imagine the node name string just gets ignored.

Those two steps, on the server where you ran with --cluster-member definitely indicates that the data was moved.

	// Migrate data.
	fmt.Println("=> Migrating the data")
	_, _ = logFile.WriteString("Migrating the data\n")

	_, err = subprocess.RunCommand("mv", sourcePaths.logs, targetPaths.logs)
	if err != nil {
		_, _ = logFile.WriteString(fmt.Sprintf("ERROR: %v\n", err))
		return fmt.Errorf("Failed to move %q to %q: %w", sourcePaths.logs, targetPaths.logs, err)
	}

	_, err = subprocess.RunCommand("mv", sourcePaths.cache, targetPaths.cache)
	if err != nil {
		_, _ = logFile.WriteString(fmt.Sprintf("ERROR: %v\n", err))
		return fmt.Errorf("Failed to move %q to %q: %w", sourcePaths.cache, targetPaths.cache, err)
	}

		_, _ = logFile.WriteString("Moving data over\n")

		_, err = subprocess.RunCommand("mv", sourcePaths.daemon, targetPaths.daemon)
		if err != nil {
			_, _ = logFile.WriteString(fmt.Sprintf("ERROR: %v\n", err))
			return fmt.Errorf("Failed to move %q to %q: %w", sourcePaths.daemon, targetPaths.daemon, err)
		}

So the Migrating the data happens just before the log and cache is moved over, Moving data over happens just before the data is moved from /var/snap/lxd/common/lxd to /var/lib/incus.

After that we get to Migrating database files, so since you got that far, all those mv commands have been run and didn’t fail.

So whatever system ran the --cluster-member command should have had its data re-shuffled.

well that’s got me stumped why the system which ran the command still has the instances in the lxd daemon path and querying the database in in both lxd and incus appear to be empty:

sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "select * from nodes;"
1|none||0.0.0.0|73|387|2024-07-26 14:54:52|0|2|
sqlite3 /var/lib/incus/database/global/db.bin "select * from nodes;"
1|none||0.0.0.0|73|406|2024-07-27 03:22:52|0|2|

and no /var/lib/incus/data.pre-migrate

if the case is that there’s no global or local database - what would you advise the best way forward from here is?

Get a full bash history on the system to figure out what exactly was done on this system, because it’s not possible to have reached the end of that migration stage without the backup pre-migrate directory having been created.

So something must have happened afterwards, whether it’s reinstalling incus with --purge or something which would have then wiped /var/lib/incus clean.

In any case, it doesn’t look like this system has any usable data anymore.

Depending on how the instances are stored (storage pool setup), it may be possible to forcefully remove this machine from the LXD cluster, add it back as an empty server and then use the disaster recovery procedure to get its instances back into LXD.

So I managed to get the LXD cluster back by:

  • patching the global db to rollback the version/api number
  • forcefully removing the problem node from the cluster
  • re adding the node as a blank server
  • running lxd recover

I had to fix some database entries for some volumes and trouble shoot some ovn configuration but it all from what I see was working as it should.

running lxd-to-incus, im still having problems with the same node

$ lxd-to-incus --ignore-version-check
=> Looking for source server
==> Detected: snap package
=> Looking for target server
==> Detected: systemd
=> Connecting to source server
=> Connecting to the target server
=> Checking server versions
==> Source version: 6.1
==> Target version: 6.3
=> Validating version compatibility
==> WARNING: User asked to bypass version check
=> Checking that the source server isn't empty
=> Checking that the target server is empty
=> Validating source server configuration

The migration is now ready to proceed.

A cluster environment was detected.
Manual action will be needed on each of the server prior to Incus being functional.
The migration will begin by shutting down instances on all servers.

It will then convert the current server over to Incus and then wait for the other servers to be converted.

Do not attempt to manually run this tool on any of the other servers in the cluster.
Instead this tool will be providing specific commands for each of the servers.
Proceed with the migration? [default=no]: yes
=> Stopping all workloads on the cluster
==> Stopping all workloads on server "chribro-ed1"
==> Stopping all workloads on server "chribro-ed2"
==> Stopping all workloads on server "chribro-ed0"
==> Stopping all workloads on server "chribro-ed3"
==> Stopping all workloads on server "chribro-ed4"
==> Stopping all workloads on server "chribro-ed6"
==> Stopping all workloads on server "chribro-ed5"
==> Stopping all workloads on server "chribro-ed7"
==> Stopping all workloads on server "chribro-ed20"
==> Stopping all workloads on server "chribro-ws"
Error: Failed to stop workloads "chribro-ws": websocket: close 1006 (abnormal closure): unexpected EOF

the logs from the problem server:

$ journalctl -xe
Aug 12 16:18:14 chribro-ws ovsdb-client[118895]: ovs|00001|reconnect|INFO|tcp:192.168.1.96:6641: connecting...
Aug 12 16:18:14 chribro-ws ovsdb-client[118895]: ovs|00002|reconnect|INFO|tcp:192.168.1.96:6641: connected
Aug 12 16:18:14 chribro-ws ovsdb-client[118896]: ovs|00001|reconnect|INFO|tcp:192.168.1.96:6642: connecting...
Aug 12 16:18:14 chribro-ws ovsdb-client[118896]: ovs|00002|reconnect|INFO|tcp:192.168.1.96:6642: connected
Aug 12 16:18:14 chribro-ws ovsdb-client[118903]: ovs|00001|reconnect|INFO|tcp:192.168.1.96:6641: connecting...
Aug 12 16:18:14 chribro-ws ovsdb-client[118903]: ovs|00002|reconnect|INFO|tcp:192.168.1.96:6641: connected
Aug 12 16:18:14 chribro-ws ovsdb-client[118904]: ovs|00001|reconnect|INFO|tcp:192.168.1.96:6642: connecting...
Aug 12 16:18:14 chribro-ws ovsdb-client[118904]: ovs|00002|reconnect|INFO|tcp:192.168.1.96:6642: connected
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00154|raft|INFO|Dropped 4 log messages in last 274 seconds (most recently, 274 seconds ago) due to excessive rate
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00155|raft|INFO|current entry eid 16aa2585-a484-4d00-a0d9-70b0a0c8f88e does not match prerequisite 78577767-9b0a-49de-a096-644df7f8fed5 in execute_command_request
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00156|raft|INFO|current entry eid 8840febd-da8f-4475-837c-b3c395e2f7ce does not match prerequisite 16aa2585-a484-4d00-a0d9-70b0a0c8f88e in execute_command_request
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00157|raft|INFO|current entry eid 8840febd-da8f-4475-837c-b3c395e2f7ce does not match prerequisite 16aa2585-a484-4d00-a0d9-70b0a0c8f88e in execute_command_request
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00158|raft|INFO|current entry eid 8840febd-da8f-4475-837c-b3c395e2f7ce does not match prerequisite 16aa2585-a484-4d00-a0d9-70b0a0c8f88e in execute_command_request
Aug 12 16:18:47 chribro-ws ovsdb-server[2597]: ovs|00159|raft|INFO|current entry eid 8840febd-da8f-4475-837c-b3c395e2f7ce does not match prerequisite 16aa2585-a484-4d00-a0d9-70b0a0c8f88e in execute_command_request
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: time="2024-08-12T16:18:53Z" level=warning msg="Instance will not be migrated because its device cannot be migrated" device=test_zvol instance=docker-development project=default
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: time="2024-08-12T16:18:53Z" level=warning msg="Instance will not be migrated because its device cannot be migrated" device=docker instance=docker-manager project=default
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: time="2024-08-12T16:18:53Z" level=warning msg="Instance will not be migrated because its device cannot be migrated" device=pixel instance=docker0 project=default
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x156d60c]
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: goroutine 701779 [running]:
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: github.com/canonical/lxd/lxd/device.(*disk).CanMigrate(0xc0009df650)
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/device/disk.go:99 +0x8c
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: github.com/canonical/lxd/lxd/instance/drivers.(*common).canMigrate(0xc001adb800, {0x2678a00, 0xc001adb800})
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/instance/drivers/driver_common.go:1112 +0x234
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: github.com/canonical/lxd/lxd/instance/drivers.(*lxc).CanMigrate(0x1ea5da0?)
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/instance/drivers/driver_lxc.go:8043 +0x1f
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: main.evacuateInstances({0x2638108, 0x3cc6660}, {0xc001f16d20, 0xc0006b4e00, 0xc001cca360, {0xc0008f0b40, 0xa, 0xa}, {0xc001b6f778, 0x4}, ...})
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/api_cluster.go:3260 +0x297
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: main.evacuateClusterMember.func2(0xc001633900)
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/api_cluster.go:3231 +0x371
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: github.com/canonical/lxd/lxd/operations.(*Operation).Start.func1(0xc001633900)
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/operations/operations.go:287 +0x26
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]: created by github.com/canonical/lxd/lxd/operations.(*Operation).Start in goroutine 701574
Aug 12 16:18:53 chribro-ws lxd.daemon[2930]:         github.com/canonical/lxd/lxd/operations/operations.go:286 +0x105
Aug 12 16:18:53 chribro-ws lxd.daemon[2773]: => LXD failed with return code 2
Aug 12 16:18:53 chribro-ws systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE

Running a second time (after restoring the evacuated members) got further and for the most part managed to get through but not without some showing some errors

$ lxd-to-incus --ignore-version-check
=> Looking for source server
==> Detected: snap package
=> Looking for target server
==> Detected: systemd
=> Connecting to source server
=> Connecting to the target server
=> Checking server versions
==> Source version: 6.1
==> Target version: 6.3
=> Validating version compatibility
==> WARNING: User asked to bypass version check
=> Checking that the source server isn't empty
=> Checking that the target server is empty
=> Validating source server configuration

The migration is now ready to proceed.

A cluster environment was detected.
Manual action will be needed on each of the server prior to Incus being functional.
The migration will begin by shutting down instances on all servers.

It will then convert the current server over to Incus and then wait for the other servers to be converted.

Do not attempt to manually run this tool on any of the other servers in the cluster.
Instead this tool will be providing specific commands for each of the servers.
Proceed with the migration? [default=no]: yes
=> Stopping all workloads on the cluster
==> Stopping all workloads on server "chribro-ed1"
==> Stopping all workloads on server "chribro-ed2"
==> Stopping all workloads on server "chribro-ed0"
==> Stopping all workloads on server "chribro-ed3"
==> Stopping all workloads on server "chribro-ed4"
==> Stopping all workloads on server "chribro-ed6"
==> Stopping all workloads on server "chribro-ed5"
==> Stopping all workloads on server "chribro-ed7"
==> Stopping all workloads on server "chribro-ed20"
==> Stopping all workloads on server "chribro-ws"
=> Stopping the source server
=> Stopping the target server
=> Wiping the target server
=> Migrating the data
=> Migrating database
=> Writing database patch
=> Running data migration commands
==> WARNING: 313 commands out of 1416 succeeded (1103 failures)
    Please review the log file for details.
    Note that in OVN environments, it's normal to see some failures
    related to Flow Rules and Switch Ports as those often change during the migration.
=> Cleaning up target paths
=> Starting the target server
=> Waiting for other cluster servers

Please run `lxd-to-incus --cluster-member` on all other servers in the cluster

The command has been started on all other servers? [default=no]: yes

=> Waiting for cluster to be fully migrated
=> Checking the target server
=> Restoring the cluster
Error: Failed to retrieve the list of cluster members

The other members finished without that last error. However all members are still evacuated. The tool also knocked out the network (open_vswitch) at this point on each member:

=> Checking the target server

and I had to jump on each server and run netplan apply

the members which are the ovn controllers have the following error:

$ incus cluster ls
Error: context deadline exceeded

Update: a reboot fixed :slight_smile:

Okay, it’s normal that they’d be in evacuated state after that failure, did you bring them all back up manually with incus cluster restore?

How are things looking once that’s done, is OVN still misbehaving?