If I try this when the container is running I get:
Error: Container is running
As it seems it doesn’t even try to migrate the container. Do I have to enable live migration in lxd somehow?
BTW: I can take a statefull snapshot, move the container to the other machine and restore this statefull snapshot on the other machine and it runs just fine.
So the main issue here is that LXD right now can’t move a container from one node to another without having it be renamed too and renaming isn’t supported when using CRIU.
We do have planned work to get a new API specifically to move containers within the cluster, this could internally be made to trigger CRIU for running containers. Whether we’ll do that part or not, I’m not sure given the current state of CRIU (we pretty much can’t test it since it really can’t migrate a whole lot these days).
ah, ok, so live migration in an lxd cluster is just not supported at the moment. I was starting to feel quite stupid that I can’t get it to work.
Well, I’m quite happy that offline migration on an lxd cluster with ceph storage works so easily. Live migration would have been a nice bonus, but since I’m getting weird ext4 errors in the one container I moved via stateful snapshots it might be a good idea to stay away from live migration anyway.
No, CRIU is the only option and we’re not the ones developing that, it’s an external project.
It’s an extremely complex piece of work and not something we intend to work on ourselves, we do make sure that LXD properly interacts with CRIU but don’t work on adding missing features to CRIU ourselves.
Apologies to be a pain. Just wanted to understand what lxd needs from criu to do live migrations. I am trying to open an issue and start a discussion there.
Also in theory if lxd can move a container from one node to another without renaming it, will criu work in that case?
The moving from one node to another without renaming is indeed a LXD issue and is being worked on.
The CRIU issues will happen after that where you’ll more than likely notice a wide range of processes in your containers using kernel features that CRIU doesn’t know how to handle.
In another issue I noticed that your profile allows for containers to use /dev/kvm, that’d be an example of something that CRIU doesn’t know how to serialize, so as long as anything in the container uses /dev/kvm, it’ll fail live migration. Current systemd is also a problem and can’t be live migrated.
The easiest way for you to see what will happen once we fix the LXD move issue would be to have a node that is not in your cluster and doesn’t use CEPH, then attempt to move one of your containers over to it. That will have LXD use good old rsync for data, which is slow but should work fine and then call CRIU for you which will most likely fail and give you some hints as to what’s missing in CRIU for your particular use case.
I dont do clustering, but i have 3 lxd nodes where two run containers and one functions as a backup.
What i do is create snapshots on running containers and copy the snapshot’s to the backup without renaming.
This way i can spinn up containers on the backup if there is a problem on the running host’s.
To streamline this prosess i made a scrip to run on the backup host that i can run manually or as a cron job.
#!/bin/sh
#variable
ct1=$(lxc list | grep STOPPED | awk '{print $2}')
ct2=$(lxc list lxchost2: | grep RUNNING | awk '{print $2}')
ct3=$(lxc list lxchost3: | grep RUNNING | awk '{print $2}')
# Delete local STOPPED containers
for lxc1 in $(echo "$ct1");
do lxc delete $lxc1;
done
# lxchost2 - Delete, create and copy snapshot
for lxc2 in $(echo "$ct2");
do lxc delete lxchost2:$lxc2/$lxc2 &&
lxc snapshot lxchost2:$lxc2 $lxc2 &&
lxc copy lxchost2:$lxc2/$lxc2 $lxc2;
done
#lxchost3 - Delete, create and copy snapshot
for lxc3 in $(echo "$ct3");
do lxc delete lxchost3:$lxc3/$lxc3 &&
lxc snapshot lxchost3:$lxc3 $lxc3 &&
lxc copy lxchost3:$lxc3/$lxc3 $lxc3;
done
What are the benefits of running lxd in a cluster?
centralized cli, shared clustered dns, prob more things, but those things spring to mind. I don’t use clustering either but I can see the point of it for simplifying the management of say more than 3 LXD hosts…