Delta Live Migration between two LXD daemons on separate nodes

mouri11 · July 3, 2020, 1:29pm

I have set up two separate AWS nodes, each having LXD installed, and LXC running on one node. I want to transfer snapshot of the LXC from one node to another. The first copy will be the whole snapshot, but consecutive copies will be the difference between previous and current snapshot. Is it possible in LXD? This is just a precursor to the actual setup, where I plan to use Raspberry Pi. Thanks.

stgraber · July 3, 2020, 2:04pm

lxc copy --refresh

mouri11 · July 6, 2020, 1:15pm

I am trying the following command and facing an unknown flag error:

lxc copy mycontainer standy:container1 --refresh

standby is a remote AWS node where I am copying mycontainer to. A normal copy worked just fine and copied the container there.

Also, I was running a script in the background in mycontainer. Can it be done that when the snapshot is copied to the remote, and started, the script starts automatically and picks up where it left off? The script is a simple increment operation.

mouri11 · July 6, 2020, 1:31pm

I just faced this error while trying to perform live migration:

~$ lxc copy mycontainer standby:mycontainer
Error: Failed container creation: Error transferring container data: migration pre-dump failed
(00.059899) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5074 with interrupted system call
(00.068908) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5075 with interrupted system call
(00.083650) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5092 with interrupted system call
(00.096142) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5095 with interrupted system call
(00.112829) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5113 with interrupted system call
(00.133126) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 5127 with interrupted system call
(00.144114) Warn  (compel/arch/x86/src/lib/infect.c:249): Will restore 6076 with interrupted system call
(00.169389) Error (criu/mount.c:1062): mnt: The file system 0x35 0x35 (0x3d) btrfs ./run/systemd/unit-root is inaccessible
(00.169393) Error (criu/fsnotify.c:209): fsnotify: Mount root for 0x000035 not found
(00.169395) Warn  (criu/fsnotify.c:283): fsnotify: 	Handle 0x35:0x65dd cannot be opened
(00.178722) Error (criu/irmap.c:86): irmap: Can't stat /no-such-path: No such file or directory
(00.178726) Error (criu/fsnotify.c:286): fsnotify: 	Can't dump that handle
(00.178728) Error (criu/irmap.c:360): irmap: Failed to resolve 35:65dd
(00.178749) Error (criu/cr-dump.c:1533): Pre-dumping FAILED.

mouri11 · July 7, 2020, 2:21pm

I faced an error while taking a stateful snapshot on an arm64 AWS node. It is similar to the one described here. Turning off apparmor didn’t help. Any help would be appreciated. Thanks.

mouri11 · July 24, 2020, 6:21am

Hi. I was able to fix the problem with taking stateful snapshots, thanks to Adrian Reber at CRIU. I am noticing that once the first copy of the snapshot is sent, I am not able to copy any further snapshots as one copy already exists at the remote. Could you explain or provide a link regarding how live migration works in LXD?

Below are the steps to how I am trying to do this:

I am taking a stateful snapshot.
lxc snapshot mycontainer snap0 --stateful
I copy this snapshot to a remote container named as active-1
lxc copy mycontainer/snap0 active-1:mycontainer
I take another stateful snapshot live1
I try to copy this too in the same way, but
Error: Copying stateful containers requires that source "mycontainer" and target "mycontainer/live1" name be identical
Trying
lxc copy mycontainer active-1:mycontainer
gives the following error:
Error: Failed container creation: Container 'mycontainer' already exists
Trying
lxc copy --refresh mycontainer active-1:mycontainer
or
lxc copy mycontainer active-1:mycontainer --refresh
gives
Error: unknown flag: --refresh

Any help will be appreciated. Thanks.

stgraber · July 24, 2020, 5:43pm

--refresh is what you want here, but you need a reasonably modern LXD for this, sounds like you’re on 3.0.x rather than 4.0 or higher here.

mouri11 · July 25, 2020, 6:34am

You were right. Thanks for your help.

mouri11 · July 27, 2020, 5:42am

I was going through the pylxd API documentation, and could not find any information on container.freeze() and container.delete(). If it is supposed to stop and delete a container respectively, I noticed that after running the script, the container was in stopped state, but then started running after a few seconds. Could you elaborate on this? Also, is there any way to use the --refresh option in the copy() function in the API? Thanks.