We have a total of 40+ nodes in a number of LXD clusters and 200+ LXD VMs on those clusters. All LXD nodes were on 5.0 (5.0/stable)
Earlier this week, an auto snap refresh broke host/guest communication (LXD 5.0.1 has been released) on all of our clusters. We now can’t exec into any of our existing VMs. We don’t see IPs of the VMs as part of “lxc ls” anymore.
Name Version Rev Tracking Publisher Notes
lxd 5.0.1-9dcf35b 23541 5.0/stable canonical✓ -
These are production VMs, what can we do to fix without restarting the hundreds of VMs we have? Manually replacing lxd-agent on each host will be very difficult due to the number of VMs and we don’t have remote access to all of our user’s VMs.
What can we do to prevent something like this from happening again in the future?
Blackhole api.snapcraft.io to block snap auto-refresh is a short term hack, and will need to unblock if we need to upgrade lxd. Doesn’t work at scale/long term. And there’s no way to disable auto-refresh.