Snap auto refresh broke LXD clusters (VMs not reachable in LXD 5.0.1)

We have a total of 40+ nodes in a number of LXD clusters and 200+ LXD VMs on those clusters. All LXD nodes were on 5.0 (5.0/stable)

Earlier this week, an auto snap refresh broke host/guest communication (LXD 5.0.1 has been released) on all of our clusters. We now can’t exec into any of our existing VMs. We don’t see IPs of the VMs as part of “lxc ls” anymore.

snap list lxd

Name Version Rev Tracking Publisher Notes
lxd 5.0.1-9dcf35b 23541 5.0/stable canonical✓ -

These are production VMs, what can we do to fix without restarting the hundreds of VMs we have? Manually replacing lxd-agent on each host will be very difficult due to the number of VMs and we don’t have remote access to all of our user’s VMs.

What can we do to prevent something like this from happening again in the future?
Blackhole api.snapcraft.io to block snap auto-refresh is a short term hack, and will need to unblock if we need to upgrade lxd. Doesn’t work at scale/long term. And there’s no way to disable auto-refresh.

1 Like

It sounds like you’ve been affected by the vsock TLS change, see the LXD 5.0.1 release notes for details:

There is a workaround described in there too to avoid needing to restart the VMs.

Also as you’re running LXD in a production environment its important to setup your snap environment to suit your update schedule, see Managing the LXD snap for more details.

1 Like