Snapshots on ZFS, restore/rollback doesn't seem to work as expected?

Earlier today I had a running virtual machine, with an old snap0 snapshot from when it was set up. I took a second snapshot and upgraded a whole load of packages on the VM. This broke some customisations and so I needed to roll back, but I wanted to keep a copy of the upgraded VM alongside so I could compare. I had at this point also taken a newer snapshot of the same VM.

I copied the VM to a new instance, changed its IP address (we’re using static IPs on our VMs) and it fired up fine. Then I attempted to rollback to my earlier (good) snapshot from today on the original VM. I encountered the “ZFS won’t let you do that” error widely reported here, so I deleted my most recent snapshot and tried again. This appears to work on the console, but gives me the most recent state when I start it up, and not the “last known good” state which was the whole point of the snapshot!

What am I doing wrong? Or does the ZFS issue in fact mean that even deleting subsequent snapshots won’t let me roll back? That can’t be right surely, or there’s no point in having multiple snapshot capability at all on ZFS ???

Can I create a new VM using the correct snapshot as a starting point, and if so, what is the exact syntax to do this. Somewhere along the way I seem to have also created a -bak with no snapshots? So now I have 3 instances of this VM.

More on the above… I figured out how to copy to a new host using a snapshot as my starting point. But guess what? This doesn’t work either - I get the latest version not the required snapshot again.

lxc copy [container]/[snapshot] [host]:[new_container_name] --verbose --mode=push

…ignores the snapshot instruction completely and copies the latest version.

Since no-one is commenting let me ask a simple question…

Q: If you copy a VM with existing snapshots to a new host and then delete the most recent snapshot of the copy, will this result in different behaviour to doing the same thing on the original?

My tests suggest that this is the case - the copied VM is “pinned” at the latest snapshot and cannot be rolled back by deleting later snapshots even though all the sequential snapshots appear to be present? Is this correct? If so, it would explain my issues perfectly.

I’d appreciated confirmation or denial from anyone who (better) understands the guts of snapshots with LXD on ZFS.

Cheers.

I test ZFS snapshots a lot with containers, but not VMs.

I can restore from any snapshot in point of time, you just have to delete later snapshots.

When I work with LXD and ZFS, i always set these options like this.

volume.zfs.remove_snapshots -> true
zfs.clone_copy -> false

Thank you for this, I will try those options and see if this improves things.

This all seems to be working as expected, one thing you may have found is causing unexpected content in snapshots is if you were not running sync inside the VM before taking the snapshot to ensure changes that were in the OS buffers were flushed to the storage layer.

Here’s some examples of what is expected:

# Create ZFS VM.
lxc launch images:ubuntu/focal vtest --vm -s zfs

# Create a file before any snapshot and sync it to storage.
lxc exec vtest -- touch /root/before-snap0.txt
lxc exec vtest -- sync
lxc exec vtest -- ls /root
before-snap0.txt

# Create 1st snapshot.
lxc snapshot vtest snap0

# Add a file after snap0 and sync it to storage.
lxc exec vtest -- touch /root/after-snap0.txt
lxc exec vtest -- sync
lxc exec vtest -- ls /root
after-snap0.txt before-snap0.txt

# Create 2nd snapshot.
lxc snapshot vtest snap1

# Add a file after snap1 and sync it to storage.
lxc exec vtest -- touch /root/after-snap1.txt
lxc exec vtest -- sync
lxc exec vtest -- ls /root
after-snap0.txt  after-snap1.txt  before-snap0.txt

# Try to restore snap0, but expect it to fail due to subsequent snapshot snap1.
lxc restore vtest snap0
Error: Snapshot "snap0" cannot be restored due to subsequent snapshot(s). Set zfs.remove_snapshots to override

# Copy snap1 to new instance and check content matches expected.
lxc copy vtest/snap1 vtest-snap1
lxc start vtest-snap1
lxc exec vtest-snap1 -- ls /root
after-snap0.txt  before-snap0.txt

# Copy snap0 to new instance and check content matches expected.
lxc copy vtest/snap0 vtest-snap0
lxc start vtest-snap0
lxc exec vtest-snap0 -- ls /root
before-snap0.txt

# Restore snap1 and check content matches expected.
lxc restore vtest snap1
lxc start vtest
lxc exec vtest -- ls /root
after-snap0.txt  before-snap0.txt

# Delete snap1 and try restoring snap0, but expect to fail due to subsequent internal snapshot for copy of snap1 to vtest-snap1.
lxc delete vtest/snap1
lxc restore vtest snap0
Error: Snapshot "snap0" cannot be restored due to subsequent internal snapshot(s) (from a copy)

# Delete vtest-snap1 and try restoring snap0,  and check content matches expected.
lxc delete vtest-snap1 -f
lxc restore vtest snap0
lxc start vtest
lxc exec vtest -- ls /root
before-snap0.txt

This is really useful, thank you.

The “sync” command was indeed not being used and perhaps this finally explains why my snapshots weren’t behaving as expected - they simply didn’t contain what I was anticipating.

I’m also suspicious that cloning a VM without the lxc/zfs options posted by Jimbo will have resulted in any roll back (ie deletion of later snapshots in order to access an earlier one and or a snapshot specific copy command) on the clone not behaving as I had anticipated.

I’ve been “a little busy” rebuilding the broken VM from backups, but that’s all done now and it’s time to explore exactly what went wrong and replicate the issues as a first step in understanding how to definitely avoid anything like this in future :wink: