I currently have 3 lxd servers in a cluster. One of the servers had and issue starting lxd that I was in the process of resolving but other two were running fine. The two running hosts upgraded from 3.14 to 3.15 which then caused the entire cluster to stop working and forced me to get the 3rd host back up and running. I got lxd running on the host upgraded it to 3.15 and followed the advice in the thread " Cluster node appears offline after upgrade to 3.15" (thanks for that btw). I am now able to do a lxc cluster list and have all 3 report being online. What is happening now is that when attempt to start a container or list the containers Iâm getting the following error: âError: Failed to fetch field Config: Failed to fetch ref for containers: projectâ
Right now Iâm lost as to what to try in order to fix this.
You probably did that already, but if you havenât, try running systemctl reload snap.lxd.daemon on all your nodes.
If the problem still happens, can you attempt lxd sql global .schema so we can check if thereâs maybe some schema problem with the database (as the error makes it sound like).
lxc query /1.0/containers worked and returned a list of containers.
lxc query /1.0/containers?recursion=1
lxc query /1.0/containers?recursion=2
Both return âError: Failed to fetch field Config: Failed to fetch ref for containers: projectâ
Ok. Iâve tried a few off the hop and there seems to be some working and some not. It seems to not be specific to a node as Iâve queried successfully on all nodes and have had failures on 2 of the 3 nodes. It will take a bit to work through the entire list of machines just thought Iâd give a quick update.
EDIT. As a new user Iâve reach my post limit for the day. So to answer your question below yes I get the same error message as before.
EDIT2: Here are the containers that failed with the Error âError: Failed to fetch container âadlab01â in project âdefaultâ: Failed to fetch Container: Failed to fetch field Config: Failed to fetch ref for containers: projectâ
Ok, cool and Iâm assuming that when it fails, they fail with the same error you got earlier?
Hopefully we can track down those failures to something those containers have in common.
Oops, the post edit doesnât trigger a notification so I only saw this now.
Iâll take a look through the earlier dump to see if thereâs anything unique about those containers that may explain why the database code is unhappy with them.
Ok, I found at least one data consistency issue in your database. Looking at adlab01 it has a duplicate volatile.last_state.power key which may be causing some problems.
Can you try:
lxd sql global âDELETE FROM containers_config WHERE container_id=78 AND key=âvolatile.last_state.powerâ;â
This should take care of that duplicate entry, then try accessing:
lxc query /1.0/containers/adlab01
If that works fine, you can attempt at just fixing everything else (assuming the same issue) with:
lxd sql global âDELETE FROM containers_config WHERE key=âvolatile.last_state.powerâ;â
The key isnât really used during runtime and will get re-added as needed upon container shutdown/reboot.
Hurray I can finaly post again!
Upon running the command I get âError: Failed to exec query: database disk image is malformedâ. However Iâm still able to dump and query the database.
root@lxdlab01:/var/snap/lxd/common/lxd/database/global# lxd sql global âDELETE FROM containers_config WHERE container_id=78 AND key=âvolatile.last_state.powerâ;â
Error: Failed to exec query: database disk image is malformed
On one of the database nodes, can you create `/var/snap/lxd/common/lxd/database/patch.global.sql containing:
DELETE FROM containers_config WHERE key='volatile.last_state.power';
Then do systemctl reload snap.lxd.daemon. This will use LXDâs early database query mechanism to try to execute that query before the database goes fully online.
I suspect it will fail in the same way, but itâs still worth a shot.
Would you mind making a tarball of /var/snap/lxd/common/lxd/database from all your database nodes and sending me that to stgraber at ubuntu dot com?
Iâll run it on one of our test clusters to replicate the issue and forward that to @freeekanayaka so he can figure out how that might have happened and how to make it consistent again.
deleting all records from table containers_config (of course dump lxd sql global .dump was saved before) not leads to invalid records gone: https://pastebin.com/raw/CDHcJj2J
root@lxdhome01:~# lxc list
Error: Failed to fetch field Config: Failed to fetch ref for containers: project
root@lxdhome01:~# lxd sql global âdrop table containers_config;â
Rows affected: 1
root@lxdhome01:~# lxd sql global âCREATE TABLE containers_config (id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, container_id INTEGER NOT NULL, key TEXT NOT NULL, value TEXT, FOREIGN KEY (container_id) REFERENCES containers (id) ON DELETE CASCADE, UNIQUE (container_id, key));â
Rows affected: 1
root@lxdhome01:~# lxd sql global âselect containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=âtest02â;â
±-----±â±----+
| name | id | key |
±-----±â±----+
±-----±â±----+
And I can now list and start the container and it repopulates the containers_config table.
lxc list test02
±-------±--------±-----±-----±-----------±----------±---------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------±--------±-----±-----±-----------±----------±---------+
| test02 | STOPPED | | | PERSISTENT | 0 | lxdlab02 |
±-------±--------±-----±-----±-----------±----------±---------+
root@lxdhome01:~# lxc start test02
root@lxdhome01:~# lxc list test02
±-------±--------±-------------------±-----±-----------±----------±---------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------±--------±-------------------±-----±-----------±----------±---------+
| test02 | RUNNING | 10.9.10.171 (eth0) | | PERSISTENT | 0 | lxdlab02 |
±-------±--------±-------------------±-----±-----------±----------±---------+
root@lxdhome01:~# lxd sql global âselect containers.name, containers_config.id, containers_config.key FROM containers JOIN containers_config ON containers.id=containers_config.container_id where containers.name=âtest02â;â
±-------±â±--------------------------+
| name | id | key |
±-------±â±--------------------------+
| test02 | 9 | volatile.eth0.host_name |
| test02 | 6 | volatile.eth0.hwaddr |
| test02 | 7 | volatile.idmap.current |
| test02 | 8 | volatile.last_state.power |
±-------±â±--------------------------+
So Iâm thinking once I start all the containers I should be back to a working state.
Ok not quite there yet. While that did get my hosts to list and start there must be some non volatile data in that table as Iâm getting setuid errors within the containers. What Iâm going to have to do is white a script to dump that table, recreate it and import the data back in. Possibly removing the âvolatile.last_state.powerâ keys upon reimport.