As an aside, I tried to have the above in the title to make it easier for people to find, but the system apparently didn’t like that.
On several of the nodes in my LXD cluster, I’m getting looping messages similar to, but not always identical to this:
Error: Failed applying patch "dnsmasq_entries_include_device_name": Failed to load network "default" in project "lanbridge" for dnsmasq update: Network not found
The “project” named sometimes shifts. In any case, this causes the node to fail to start and it repeats the loop over and over. Is this related in some way to 4.22? There haven’t been any config changes to this cluster in weeks, probably longer. Restarts of LXD of all the nodes have had no effect
Can you locate one of the server with a recent version of /var/snap/lxd/common/lxd/global/db.bin (in theory 3 of the servers should have a recent version of that).
Then install sqlite3 and run:
sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM nodes”
sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM networks”
sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM networks_nodes”
This should give us enough information to sort out what’s going on.
I booped those directories out of where they were, storing them in /tmp in case they were needed, restarted LXD on the effected nodes and everything came up correctly. That’s a damned weird failure
They seem to date back to when the cluster was first setup.
I wonder if those were failed attempts which errored and didn’t get cleaned up properly somehow?
Agreed and seems likely. That said though, seems odd it would try and patch networks that the cluster didn’t think existed based on the contents of a directory. Why doesn’t it ask the cluster which ones exist and patch that instead?
Yeah, it’s a bit odd. Patches run pretty early in startup, before a lot of our internal DB functions are functional, so I suspect that’s why it’s basing its targets on what was found on disk at the time.