LXC Failed to fetch field Config after upgrade to 3.15

I currently have 3 lxd servers in a cluster. One of the servers had and issue starting lxd that I was in the process of resolving but other two were running fine. The two running hosts upgraded from 3.14 to 3.15 which then caused the entire cluster to stop working and forced me to get the 3rd host back up and running. I got lxd running on the host upgraded it to 3.15 and followed the advice in the thread " Cluster node appears offline after upgrade to 3.15" (thanks for that btw). I am now able to do a lxc cluster list and have all 3 report being online. What is happening now is that when attempt to start a container or list the containers I’m getting the following error: “Error: Failed to fetch field Config: Failed to fetch ref for containers: project”

Right now I’m lost as to what to try in order to fix this.

You probably did that already, but if you haven’t, try running systemctl reload snap.lxd.daemon on all your nodes.

If the problem still happens, can you attempt lxd sql global .schema so we can check if there’s maybe some schema problem with the database (as the error makes it sound like).

I have indeed ran systemctl reload snap.lxd.daemon on all the nodes.

Here is the dump of global schema. https://pastebin.com/t5muxDmY

In my troubleshooting I’ve successfully broken node 2 now but with 2 of the 3 nodes still running I"m still getting the same error.

Ok, your schema looks okay, it matches what I’m running locally, so there’s no problem there.

Can you run lxd sql global .dump so we have a view of the entirety of the database?

Also, can you try running:

  • lxd sql global “SELECT * FROM containers;”
  • lxd sql global “SELECT * FROM containers_config;”
  • lxd sql global “SELECT * FROM projects;”
  • lxd sql global “SELECT * FROM projects_config;”

Here are the data dumps.

global dump https://pastebin.com/nXvgLANd

containers https://pastebin.com/nXH9EHiF
containers_config https://pastebin.com/c0KHESa9
projects https://pastebin.com/rWZWJxBk
projects_config https://pastebin.com/UqBhPPMT

Ok, that looks pretty good. I would have expected the database to fail with those queries if things were in particularly bad shape in the database…

Can you try:

  • lxc query /1.0/containers
  • lxc query /1.0/containers?recursion=1
  • lxc query /1.0/containers?recursion=2

See which of those (if not all of them) is failing.

lxc query /1.0/containers worked and returned a list of containers.

lxc query /1.0/containers?recursion=1
lxc query /1.0/containers?recursion=2
Both return “Error: Failed to fetch field Config: Failed to fetch ref for containers: project”

Ok, interesting. Can you check if all containers cause the issue or if it can be tracked down to a particular one?

Run lxc query /1.0/containers/NAME for each of them and see if any (or all) get you an error.

Ok. I’ve tried a few off the hop and there seems to be some working and some not. It seems to not be specific to a node as I’ve queried successfully on all nodes and have had failures on 2 of the 3 nodes. It will take a bit to work through the entire list of machines just thought I’d give a quick update.

EDIT. As a new user I’ve reach my post limit for the day. So to answer your question below yes I get the same error message as before.

EDIT2: Here are the containers that failed with the Error “Error: Failed to fetch container “adlab01” in project “default”: Failed to fetch Container: Failed to fetch field Config: Failed to fetch ref for containers: project”

    "/1.0/containers/adlab01",
    "/1.0/containers/adlab02",
    "/1.0/containers/adlab03",
    "/1.0/containers/adlab04",
    "/1.0/containers/adlab05",
    "/1.0/containers/edmcfg01",
    "/1.0/containers/edmgax01",
    "/1.0/containers/edmlic01",
    "/1.0/containers/edmmsg01",
    "/1.0/containers/edmurs01",

Ok, cool and I’m assuming that when it fails, they fail with the same error you got earlier?
Hopefully we can track down those failures to something those containers have in common.

Oops, the post edit doesn’t trigger a notification so I only saw this now.
I’ll take a look through the earlier dump to see if there’s anything unique about those containers that may explain why the database code is unhappy with them.

Ok, I found at least one data consistency issue in your database. Looking at adlab01 it has a duplicate volatile.last_state.power key which may be causing some problems.

Can you try:

  • lxd sql global “DELETE FROM containers_config WHERE container_id=78 AND key=‘volatile.last_state.power’;”

This should take care of that duplicate entry, then try accessing:

  • lxc query /1.0/containers/adlab01

If that works fine, you can attempt at just fixing everything else (assuming the same issue) with:

  • lxd sql global “DELETE FROM containers_config WHERE key=‘volatile.last_state.power’;”

The key isn’t really used during runtime and will get re-added as needed upon container shutdown/reboot.

Hurray I can finaly post again!
Upon running the command I get “Error: Failed to exec query: database disk image is malformed”. However I’m still able to dump and query the database.

root@lxdlab01:/var/snap/lxd/common/lxd/database/global# lxd sql global “DELETE FROM containers_config WHERE container_id=78 AND key=‘volatile.last_state.power’;”
Error: Failed to exec query: database disk image is malformed

Hmm, okay, that’s odd.

On one of the database nodes, can you create `/var/snap/lxd/common/lxd/database/patch.global.sql containing:

DELETE FROM containers_config WHERE key='volatile.last_state.power';

Then do systemctl reload snap.lxd.daemon. This will use LXD’s early database query mechanism to try to execute that query before the database goes fully online.

I suspect it will fail in the same way, but it’s still worth a shot.

Would you mind making a tarball of /var/snap/lxd/common/lxd/database from all your database nodes and sending me that to stgraber at ubuntu dot com?

I’ll run it on one of our test clusters to replicate the issue and forward that to @freeekanayaka so he can figure out how that might have happened and how to make it consistent again.

I tried the patch file and same result.

I have emailed the tarballs from the 3 hosts.

I have the same problem.
In my case, the reason was the inconsistency of the database.


Here is what i’ve got for container “develop”:


How I solve this problem:

Thank you! This worked for me as well.

root@lxdhome01:~# lxc list
Error: Failed to fetch field Config: Failed to fetch ref for containers: project
root@lxdhome01:~# lxd sql global “drop table containers_config;”
Rows affected: 1

root@lxdhome01:~# lxd sql global “CREATE TABLE containers_config (id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, container_id INTEGER NOT NULL, key TEXT NOT NULL, value TEXT, FOREIGN KEY (container_id) REFERENCES containers (id) ON DELETE CASCADE, UNIQUE (container_id, key));”
Rows affected: 1

root@lxdhome01:~# lxd sql global “select containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=‘test02’;”
±-----±—±----+
| name | id | key |
±-----±—±----+
±-----±—±----+

And I can now list and start the container and it repopulates the containers_config table.

lxc list test02
±-------±--------±-----±-----±-----------±----------±---------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------±--------±-----±-----±-----------±----------±---------+
| test02 | STOPPED | | | PERSISTENT | 0 | lxdlab02 |
±-------±--------±-----±-----±-----------±----------±---------+
root@lxdhome01:~# lxc start test02
root@lxdhome01:~# lxc list test02
±-------±--------±-------------------±-----±-----------±----------±---------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------±--------±-------------------±-----±-----------±----------±---------+
| test02 | RUNNING | 10.9.10.171 (eth0) | | PERSISTENT | 0 | lxdlab02 |
±-------±--------±-------------------±-----±-----------±----------±---------+

root@lxdhome01:~# lxd sql global “select containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=‘test02’;”
±-------±—±--------------------------+
| name | id | key |
±-------±—±--------------------------+
| test02 | 9 | volatile.eth0.host_name |
| test02 | 6 | volatile.eth0.hwaddr |
| test02 | 7 | volatile.idmap.current |
| test02 | 8 | volatile.last_state.power |
±-------±—±--------------------------+

So I’m thinking once I start all the containers I should be back to a working state.

Thanks again.

Ok not quite there yet. While that did get my hosts to list and start there must be some non volatile data in that table as I’m getting setuid errors within the containers. What I’m going to have to do is white a script to dump that table, recreate it and import the data back in. Possibly removing the ‘volatile.last_state.power’ keys upon reimport.

Yes, that is what I do at last step: