Images & Scenarios Known to work with CRIU snapshotting

I’ve been trying to test out the stateful snapshotting via CRIU but it seems that the common ubuntu images aren’t working with criu out of the box. I was wondering what successes other people have had with this feature so I can find something that works for me.

I have tried and failed with:

  • ubuntu:18.04 : fail to dump

  • ubuntu:20.04 : fail to dump

  • images:voidlinux : fail to dump

  • images:archlinux : fail to dump

  • images:debian/buster : fail to dump

  • images:centos/8 : fail to dump

  • images:alpine/3.11 : fail to restore

  • images:devuan/ascii : fail to restore

Thansk!

I think there’s some issue with networking. I suspect the bare minimum that should work fine would be an Alpine container without a nic.

lxc init images:alpine/edge a1
lxc config device add a1 eth0 none
lxc start a1
lxc snapshot a1 --stateful

(Note that I’m not saying that this is particularly useful, just that based on current known limitations and issues with CRIU, I suspect that this would be a working case.)

You suggestion worked until I tried to actually run something in it. I’ve been just putting a sleep process into containers just to convince myself this works. When I do this it fails:

--> lxc stop a1 --stateful
Error: snapshot dump failed
(00.694770) Warn  (compel/arch/x86/src/lib/infect.c:281): Will restore 19918 with interrupted system call
(01.135229) Warn  (compel/arch/x86/src/lib/infect.c:281): Will restore 20255 with interrupted system call
(01.136364) Error (criu/files-reg.c:1372): Can't lookup mount=638 for fd=0 path=/dev/pts/3
(01.136408) Error (criu/cr-dump.c:1348): Dump files (pid: 20255) failed with -1
(01.144135) Error (criu/cr-dump.c:1764): Dumping FAILED.
Try `lxc info --show-log a1` for more info

Log:

lxc a1 20200508014735.621 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1143 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.a1"

How did you spawn that sleep process?

From what I remember, CRIU doesn’t like processes which were started in the container from an lxc exec session. Instead try starting your sleep from an init script, that should have a better chance of serializing.

Ya I did:

lxc exec a1 -- /bin/sh

then

sleep 10000 &

My use case is to be able to have long running things started by a user be migrated so that solution wouldn’t get me that far unfortunately.

I will try other ways (console and ssh) and see if that is any better.

I don’t know if Alpine comes with script but if it does, it should help detach from the terminal.

script /dev/null -c sh
(sleep 5m&)

Or something along those lines.

SSH is indeed how we were doing those demos back when we were heavily investing in CRIU.

Seems like SSH is the best way to work in containes then.

Are you no longer investing in CRIU in terms of development? It seems like stateful snapshotting is a big advantage of LXD.

We’re still supporting and occasionally updating the glue code between LXD/LXC and CRIU but we no longer have a full-time engineer working on CRIU.

Our goal at the time was to get CRIU to work with most common workloads on modern distributions (at the time Ubuntu 16.04 LTS). This turned into a bit of a losing battle as every time we’d add support for some new kernel feature in CRIU, the upstream Linux kernel would grow support for a dozen more features that CRIU didn’t understand.

So CRIU is certainly viable for very specific environments where the user is in complete control of the workload and distribution they run it on, but that market isn’t sufficient for us to justify very costly engineering efforts.

So right now, we tend to redirect requests for missing CRIU features directly to upstream CRIU where there is an active community of contributors that eventually tackle the most common limitations. There is a fair amount of investment in CRIU coming from academia, HPC and some big organizations like Google also actively make use of it and contribute fixes to it.

So it’s certainly not a dead end but for us as a generic container tool that focuses on generic workloads on modern distros, it’s not a current focus.

Thanks for this response and, this is kind of what I needed to hear. Which is: “Its there and it /can/ be useful but its pretty fragile and your going to have to put a lot of energy in to make it work.”

Its not a critical need for me so I guess I wait until CRIU works out of the box. Although at this point (and given what you said about the linux kernel) an entire OS based on orthogonal persistence sounds more practical unfortunately. Not something LXD can really fix.