Error applying patch dnsmasq entries include

Error: Failed applying patch “dnsmasq_entries_include_device_name”

As an aside, I tried to have the above in the title to make it easier for people to find, but the system apparently didn’t like that.

On several of the nodes in my LXD cluster, I’m getting looping messages similar to, but not always identical to this:

Error: Failed applying patch "dnsmasq_entries_include_device_name": Failed to load network "default" in project "lanbridge" for dnsmasq update: Network not found

The “project” named sometimes shifts. In any case, this causes the node to fail to start and it repeats the loop over and over. Is this related in some way to 4.22? There haven’t been any config changes to this cluster in weeks, probably longer. Restarts of LXD of all the nodes have had no effect

A reboot of one of the nodes also has no positive effect

Can you locate one of the server with a recent version of /var/snap/lxd/common/lxd/global/db.bin (in theory 3 of the servers should have a recent version of that).

Then install sqlite3 and run:

  • sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM nodes”
  • sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM networks”
  • sqlite3 /var/snap/lxd/common/lxd/global/db.bin “SELECT * FROM networks_nodes”

This should give us enough information to sort out what’s going on.

Here you go. Though, the path wasn’t quite correct, as it was actually /var/snap/lxd/common/lxd/database/global/db.bin

sqlite> select * from nodes;
1|illian|||56|285|2022-01-19 17:03:21.928530413+00:00|0|2|
2|lxdnode1|||57|285|2022-01-19 17:03:22.033254979+00:00|0|2|
3|lxdnode2|||57|285|2022-01-19 17:03:22.039370622+00:00|0|2|
4|lxdnode3|||57|285|2022-01-19 17:03:22.036791677+00:00|0|2|
5|lxdnode4|||56|285|2022-01-19 11:03:00.177025751-06:00|0|2|
7|lxdnode6|||56|285|2022-01-19 17:03:13.421220614+00:00|0|2|
9|lxdnode8|||56|285|2022-01-19 17:03:09.166138292+00:00|0|2|
10|lxdisonodecl|||56|285|2022-01-19 17:03:10.778439133+00:00|0|2|
11|lxdnode7|||56|285|2022-01-19 17:03:22.051348907+00:00|0|2|
12|lxdnode5|||56|285|2022-01-19 17:03:22.028254088+00:00|0|2|
sqlite> select * from networks;
sqlite> select * from networks_nodes;

For completeness, I checked on one of the failing nodes and the output appears identical

Can you show ls -lh /var/snap/lxd/common/lxd/networks/ on all systems?

I wonder if you don’t have systems with leftover entries in there, so anything other than internal per the output above.

Well, this is interesting. I wonder where these extra ones have come from on the failing systems. Failing on the left, working on the right

I booped those directories out of where they were, storing them in /tmp in case they were needed, restarted LXD on the effected nodes and everything came up correctly. That’s a damned weird failure

They seem to date back to when the cluster was first setup.
I wonder if those were failed attempts which errored and didn’t get cleaned up properly somehow?

Agreed and seems likely. That said though, seems odd it would try and patch networks that the cluster didn’t think existed based on the contents of a directory. Why doesn’t it ask the cluster which ones exist and patch that instead?

Yeah, it’s a bit odd. Patches run pretty early in startup, before a lot of our internal DB functions are functional, so I suspect that’s why it’s basing its targets on what was found on disk at the time.

@markylaing @stgraber maybe we should update the patch to load networks from db and check if associated Dir exists.

Yeah, that’d be fine.

@markylaing can you send a PR?

1 Like

Yay, that’s what I was thinking myself haha

Sure thing :+1: