Should lxd check disk space on targets before evacuating cluster node

I have been testing “failure modes” this week and writing docs for rebuilding/replacing a cluster node. During this mornings evacuation the migration filled the targets local pool (in this case a new system with an unexpanded zfs pool). This not only halted the evacuation but wedged both the source and target in ways that required snapshotting and copying the vm manually evacuating the remaining containers and and then removing and wiping the source node. It also required cleanup on the target node.
.
ubuntu-server-lts(22-04), lxd current/stable(5.6),zfs

Yes ZFS has this rather peculiar behaviour where as it starts to run out of space, rather than reporting a disk full error and ending the command LXD is calling (in this case most likely zfs recv or rsync) instead as it approaches its maximum utilisation the write operations slow to a crawl and eventually effectively block.

This means that LXD won’t know if the disk space has been filled or whether its just slow I/O.

See

I think the request about checking target disk space has come up before, but it is non-trivial to do so, especially on ZFS, due to the relationship between the parent snapshots and the difference in copy modes based on whether the source pool is ZFS or not.

Thanks tomp: Right now I am trying to weigh the pros and cons of clustering.

So, it wasn’t so much that it failed, it was the way it failed.

If the transfer fails perhaps it should flag that target as full, clean up, and try the next host.

Evacuations should go cleanly or the advantages of clustering are significantly dimminished.

Yes I agree in principle. But if the underlying storage subsystem (ZFS) doesn’t generate a failure, but instead just hangs, there’s not a lot that can be done as LXD won’t know its failed so cannot clean up.

Other storage pool drivers would not have this issue I suspect.

What you’ve not said so far is what steps you took to cancel the evacuating when they “wedged”?

The evacuation stopped itself. It prevented me from moving the vm (which was an active system, already stopped) (I will see if this still in my terminal history). So I snapshotted and moved the snapshot to a server with plenty of room, and started it so the downtime was significant but not enough to get me in trouble :). Then I expanded the filesystem on the failed target and tried to remove the the problem so I could evacuate the remaining active containers. When I couldn’t do this quickly I manually stopped and moved them. The evacuation was to rebuild the host so at this point I wiped the disks and purged the lxd data. There was still a partial transfer on the other host that I had to clean up before it would let me move the vm to its original target.

I will see if I can find the terminal history if you need more detail.

Yes if you have a reproducer with errors from the command itself and the contents of /var/snap/lxd/common/lxd/logs/lxd.log from the relevant servers that would be useful to see if there is something we can do to clean up here.

What storage pool driver would you recommend?

Zfs has been the default and has served me really well (on ubuntu/fbsd/solaris). However when lxd and zfs fail they tend to do so spectacularly.

Thank you. Unfortunately i did not copy the logs for the evacuated system before doing the snap remove --purge lxd and reinstalling. I will try to do that in the future.

1 Like