Ceph Snapshots on Stopped Containers Freeze Hosts

lxc snapshot container

when the container is stopped freezes the node the container is running on. Otherwise container snapshots behave as expected.

I’ve reproduced this with the packaged Luminous version in the Bionic repos (apt version as well as snap version.)

Additionally, testing on a Mimic cluster with upstream packages/repos also experiences the same phenomena (again both apt and snap versions.)

It should be noted these nodes are running stock settings ceph and lxd wise and have lxd, osd, and mon services running. I don’t have enough spare dev hardware to test without monitors/osd’s installed.

Does the machine eventually come back online?
Anything in the kernel log?

Requires manual intervention via IPMI/console and logs are empty after reboot.

Comes back cleanly and the new snapshot is created(as seen by both LXD and Ceph)

The terminal did mention an NBD device on shutdown, is the kernel module not used to snapshot when the container is stopped? When manually taking snapshots with sudo rbd snap container@manual snap in ceph it’s instant and doesn’t freeze the host.

If LXD is using librados/nbd instead of the kernel module that might explain it.

I think I found the issue that you’re running into.

A configurable option for RBD vs librados would be much appreciated in the storage volume settings.

Not sure how much work that would take to implement and if lxc snapshot does anything aside from reference a snapshot. The RBD module is way faster to perform image manipulation for clones/snapshots/rollback etc than firing up an nbd.

Making a snapshot doesn’t cause any mount, we just fire a rbd subcommand to have it be done on the cluster.

The problem is that to avoid fs corruptions in the snapshot, we sync all data and freeze the filesystem during that operation.

The issue I think is that if the container isn’t running, then it’s not mounted and so the filesystem freeze actually hits the parent mount which in your case is your entire host filesystem, causing the issue you’re seeing.

Will this hit 3.0.2?

3.0.2 has already been tagged but I’ll include the fix in the package we upload to Ubuntu as I’m still preparing it

Incredible, thank you so much.