Enable live migration?

Hi,

I’d like to try out live migration but some how I can’t get it to work.

I’m running Ubuntu 18.04 and installed lxd 3.0.1 and criu 3.6 via apt.

I’ve setup my two ubuntu hosts infra1 and infra2 as a lxd cluster. The storage pool is based on ceph.

I can create a container e.g. bionic and move it from one server to the other as long as it is stopped.

lxc stop bionic
lxc move bionic --target infra2
lxc start bionic

If I try this when the container is running I get:

Error: Container is running

As it seems it doesn’t even try to migrate the container. Do I have to enable live migration in lxd somehow?

BTW: I can take a statefull snapshot, move the container to the other machine and restore this statefull snapshot on the other machine and it runs just fine.

Regards

Chris

So the main issue here is that LXD right now can’t move a container from one node to another without having it be renamed too and renaming isn’t supported when using CRIU.

We do have planned work to get a new API specifically to move containers within the cluster, this could internally be made to trigger CRIU for running containers. Whether we’ll do that part or not, I’m not sure given the current state of CRIU (we pretty much can’t test it since it really can’t migrate a whole lot these days).

ah, ok, so live migration in an lxd cluster is just not supported at the moment. I was starting to feel quite stupid that I can’t get it to work.

Well, I’m quite happy that offline migration on an lxd cluster with ceph storage works so easily. Live migration would have been a nice bonus, but since I’m getting weird ext4 errors in the one container I moved via stateful snapshots it might be a good idea to stay away from live migration anyway.

Thanks for the clarification.

Regards

Chris

@stgraber: Is there any alternative to move live containers in ceph environment?

Nope, moving live processes can only be done with CRIU, there aren’t any other kernel features or userspace tools that can do that at this time.

Your only other alternative is to temporarily shut down the container and move it while stopped.

@stgraber Thanks.
I will wait for you guys to implement this support. Is it planned for any release right now?

No, CRIU is the only option and we’re not the ones developing that, it’s an external project.

It’s an extremely complex piece of work and not something we intend to work on ourselves, we do make sure that LXD properly interacts with CRIU but don’t work on adding missing features to CRIU ourselves.

Time to bug CRIU devs then :grinning:

Apologies to be a pain. Just wanted to understand what lxd needs from criu to do live migrations. I am trying to open an issue and start a discussion there.
Also in theory if lxd can move a container from one node to another without renaming it, will criu work in that case?

The moving from one node to another without renaming is indeed a LXD issue and is being worked on.

The CRIU issues will happen after that where you’ll more than likely notice a wide range of processes in your containers using kernel features that CRIU doesn’t know how to handle.

In another issue I noticed that your profile allows for containers to use /dev/kvm, that’d be an example of something that CRIU doesn’t know how to serialize, so as long as anything in the container uses /dev/kvm, it’ll fail live migration. Current systemd is also a problem and can’t be live migrated.

The easiest way for you to see what will happen once we fix the LXD move issue would be to have a node that is not in your cluster and doesn’t use CEPH, then attempt to move one of your containers over to it. That will have LXD use good old rsync for data, which is slow but should work fine and then call CRIU for you which will most likely fail and give you some hints as to what’s missing in CRIU for your particular use case.

1 Like

Hi @stgraber any updates on this, is it available now? Thanks.

Not sure. CRIU with LXD (Snap).
I still have problems with it. But Stephane is typing now… :slight_smile:

Moving containers between nodes within a cluster can be done with renames now.

CRIU integration for clustering hasn’t been touched though and isn’t a big priority for us given how unreliable CRIU generally is,.

1 Like

Yes i realise that now, hopefully the criu team work on a solution together, i feel like this is a big part missing from clustering in LXD 3.0.

1 Like

I dont do clustering, but i have 3 lxd nodes where two run containers and one functions as a backup.
What i do is create snapshots on running containers and copy the snapshot’s to the backup without renaming.
This way i can spinn up containers on the backup if there is a problem on the running host’s.

To streamline this prosess i made a scrip to run on the backup host that i can run manually or as a cron job.

#!/bin/sh
  

#variable
ct1=$(lxc list | grep STOPPED | awk '{print $2}')
ct2=$(lxc list lxchost2: | grep RUNNING | awk '{print $2}')
ct3=$(lxc list lxchost3: | grep RUNNING | awk '{print $2}')

# Delete local STOPPED containers
for lxc1 in $(echo "$ct1");
        do lxc delete $lxc1;
done

# lxchost2 - Delete, create and copy snapshot

for lxc2 in $(echo "$ct2");
        do lxc delete lxchost2:$lxc2/$lxc2 && 
                lxc snapshot lxchost2:$lxc2 $lxc2 &&
                lxc copy lxchost2:$lxc2/$lxc2 $lxc2;
done

#lxchost3 - Delete, create and copy snapshot

for lxc3 in $(echo "$ct3");
        do lxc delete lxchost3:$lxc3/$lxc3 &&
                lxc snapshot lxchost3:$lxc3 $lxc3 &&
                lxc copy lxchost3:$lxc3/$lxc3 $lxc3;
done

What are the benefits of running lxd in a cluster?

centralized cli, shared clustered dns, prob more things, but those things spring to mind. I don’t use clustering either but I can see the point of it for simplifying the management of say more than 3 LXD hosts…