LXC Failed to fetch field Config after upgrade to 3.15

hobiga · July 19, 2019, 12:52am

I currently have 3 lxd servers in a cluster. One of the servers had and issue starting lxd that I was in the process of resolving but other two were running fine. The two running hosts upgraded from 3.14 to 3.15 which then caused the entire cluster to stop working and forced me to get the 3rd host back up and running. I got lxd running on the host upgraded it to 3.15 and followed the advice in the thread " Cluster node appears offline after upgrade to 3.15" (thanks for that btw). I am now able to do a lxc cluster list and have all 3 report being online. What is happening now is that when attempt to start a container or list the containers I’m getting the following error: “Error: Failed to fetch field Config: Failed to fetch ref for containers: project”

Right now I’m lost as to what to try in order to fix this.

stgraber · July 19, 2019, 2:30am

You probably did that already, but if you haven’t, try running systemctl reload snap.lxd.daemon on all your nodes.

If the problem still happens, can you attempt lxd sql global .schema so we can check if there’s maybe some schema problem with the database (as the error makes it sound like).

hobiga · July 19, 2019, 3:00am

I have indeed ran systemctl reload snap.lxd.daemon on all the nodes.

Here is the dump of global schema. https://pastebin.com/t5muxDmY

In my troubleshooting I’ve successfully broken node 2 now but with 2 of the 3 nodes still running I"m still getting the same error.

stgraber · July 19, 2019, 3:10am

Ok, your schema looks okay, it matches what I’m running locally, so there’s no problem there.

Can you run lxd sql global .dump so we have a view of the entirety of the database?

Also, can you try running:

lxd sql global “SELECT * FROM containers;”
lxd sql global “SELECT * FROM containers_config;”
lxd sql global “SELECT * FROM projects;”
lxd sql global “SELECT * FROM projects_config;”

hobiga · July 19, 2019, 3:29am

Here are the data dumps.

global dump https://pastebin.com/nXvgLANd

containers https://pastebin.com/nXH9EHiF
containers_config https://pastebin.com/c0KHESa9
projects https://pastebin.com/rWZWJxBk
projects_config https://pastebin.com/UqBhPPMT

stgraber · July 19, 2019, 3:34am

Ok, that looks pretty good. I would have expected the database to fail with those queries if things were in particularly bad shape in the database…

Can you try:

lxc query /1.0/containers
lxc query /1.0/containers?recursion=1
lxc query /1.0/containers?recursion=2

See which of those (if not all of them) is failing.

hobiga · July 19, 2019, 3:42am

lxc query /1.0/containers worked and returned a list of containers.

lxc query /1.0/containers?recursion=1
lxc query /1.0/containers?recursion=2
Both return “Error: Failed to fetch field Config: Failed to fetch ref for containers: project”

stgraber · July 19, 2019, 4:01am

Ok, interesting. Can you check if all containers cause the issue or if it can be tracked down to a particular one?

Run lxc query /1.0/containers/NAME for each of them and see if any (or all) get you an error.

hobiga · July 19, 2019, 4:08am

Ok. I’ve tried a few off the hop and there seems to be some working and some not. It seems to not be specific to a node as I’ve queried successfully on all nodes and have had failures on 2 of the 3 nodes. It will take a bit to work through the entire list of machines just thought I’d give a quick update.

EDIT. As a new user I’ve reach my post limit for the day. So to answer your question below yes I get the same error message as before.

EDIT2: Here are the containers that failed with the Error “Error: Failed to fetch container “adlab01” in project “default”: Failed to fetch Container: Failed to fetch field Config: Failed to fetch ref for containers: project”

    "/1.0/containers/adlab01",
    "/1.0/containers/adlab02",
    "/1.0/containers/adlab03",
    "/1.0/containers/adlab04",
    "/1.0/containers/adlab05",
    "/1.0/containers/edmcfg01",
    "/1.0/containers/edmgax01",
    "/1.0/containers/edmlic01",
    "/1.0/containers/edmmsg01",
    "/1.0/containers/edmurs01",

stgraber · July 19, 2019, 4:10am

Ok, cool and I’m assuming that when it fails, they fail with the same error you got earlier?
Hopefully we can track down those failures to something those containers have in common.

stgraber · July 19, 2019, 1:08pm

Oops, the post edit doesn’t trigger a notification so I only saw this now.
I’ll take a look through the earlier dump to see if there’s anything unique about those containers that may explain why the database code is unhappy with them.

stgraber · July 19, 2019, 6:56pm

Ok, I found at least one data consistency issue in your database. Looking at adlab01 it has a duplicate volatile.last_state.power key which may be causing some problems.

Can you try:

lxd sql global “DELETE FROM containers_config WHERE container_id=78 AND key=‘volatile.last_state.power’;”

This should take care of that duplicate entry, then try accessing:

lxc query /1.0/containers/adlab01

If that works fine, you can attempt at just fixing everything else (assuming the same issue) with:

lxd sql global “DELETE FROM containers_config WHERE key=‘volatile.last_state.power’;”

The key isn’t really used during runtime and will get re-added as needed upon container shutdown/reboot.

hobiga · July 20, 2019, 1:45am

Hurray I can finaly post again!
Upon running the command I get “Error: Failed to exec query: database disk image is malformed”. However I’m still able to dump and query the database.

root@lxdlab01:/var/snap/lxd/common/lxd/database/global# lxd sql global “DELETE FROM containers_config WHERE container_id=78 AND key=‘volatile.last_state.power’;”
Error: Failed to exec query: database disk image is malformed

stgraber · July 20, 2019, 3:01am

Hmm, okay, that’s odd.

On one of the database nodes, can you create `/var/snap/lxd/common/lxd/database/patch.global.sql containing:

DELETE FROM containers_config WHERE key='volatile.last_state.power';

Then do systemctl reload snap.lxd.daemon. This will use LXD’s early database query mechanism to try to execute that query before the database goes fully online.

I suspect it will fail in the same way, but it’s still worth a shot.

Would you mind making a tarball of /var/snap/lxd/common/lxd/database from all your database nodes and sending me that to stgraber at ubuntu dot com?

I’ll run it on one of our test clusters to replicate the issue and forward that to @freeekanayaka so he can figure out how that might have happened and how to make it consistent again.

hobiga · July 20, 2019, 3:12am

I tried the patch file and same result.

I have emailed the tarballs from the 3 hosts.

KYuri · July 20, 2019, 12:26pm

I have the same problem.
In my case, the reason was the inconsistency of the database.

Here is what i’ve got for container “develop”:

Exec failed: https://pastebin.com/raw/0bcSJuYe
Exec successed: https://pastebin.com/raw/DqeEDZ9P (!!! row with key “volatile.eth0.host_name” present !!!)
Exec successed: https://pastebin.com/raw/uKqQmh6r (!!! row with key “volatile.eth0.host_name” absent !!!)
Row with key “volatile.eth0.host_name” is not selectable nor by key nor by id: https://pastebin.com/raw/5KXLguxz

KYuri · July 20, 2019, 12:27pm

How I solve this problem:

deleting all records from table containers_config (of course dump lxd sql global .dump was saved before) not leads to invalid records gone: https://pastebin.com/raw/CDHcJj2J
but dropping && recreating table containers_config leads us to success: https://pastebin.com/raw/jdRigExi
this: https://pastebin.com/raw/9QkhTD5q part of DB dump was converted into this: https://pastebin.com/raw/x8sz3S5J bash script and executed

hobiga · July 20, 2019, 11:40pm

Thank you! This worked for me as well.

root@lxdhome01:~# lxc list
Error: Failed to fetch field Config: Failed to fetch ref for containers: project
root@lxdhome01:~# lxd sql global “drop table containers_config;”
Rows affected: 1

root@lxdhome01:~# lxd sql global “CREATE TABLE containers_config (id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, container_id INTEGER NOT NULL, key TEXT NOT NULL, value TEXT, FOREIGN KEY (container_id) REFERENCES containers (id) ON DELETE CASCADE, UNIQUE (container_id, key));”
Rows affected: 1

root@lxdhome01:~# lxd sql global “select containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=‘test02’;”
±-----±—±----+
| name | id | key |
±-----±—±----+
±-----±—±----+

And I can now list and start the container and it repopulates the containers_config table.

root@lxdhome01:~# lxd sql global “select containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=‘test02’;”
±-------±—±--------------------------+
| name | id | key |
±-------±—±--------------------------+
| test02 | 9 | volatile.eth0.host_name |
| test02 | 6 | volatile.eth0.hwaddr |
| test02 | 7 | volatile.idmap.current |
| test02 | 8 | volatile.last_state.power |
±-------±—±--------------------------+

So I’m thinking once I start all the containers I should be back to a working state.

Thanks again.

hobiga · July 21, 2019, 12:10am

Ok not quite there yet. While that did get my hosts to list and start there must be some non volatile data in that table as I’m getting setuid errors within the containers. What I’m going to have to do is white a script to dump that table, recreate it and import the data back in. Possibly removing the ‘volatile.last_state.power’ keys upon reimport.

KYuri · July 21, 2019, 9:46am

Yes, that is what I do at last step:

this: https://pastebin.com/raw/9QkhTD5q part of DB dump was converted into this: https://pastebin.com/raw/x8sz3S5J bash script and executed