Cannot list instances: Failed to fetch instances: id

Hi folks,

I have a machine where I’m unable to start a couple of specific containers, I’m unable to list containers at all, and several internal functions of LXD are broken (snapshot expiries, automated snapshots, etc). I’m suspecting database corruption but I can’t see it:

t=2020-11-03T22:15:50+0000 lvl=eror msg="Problem loading instances list" err="Failed to fetch instances: id"
t=2020-11-03T22:17:50+0000 lvl=eror msg="Failed to load containers for scheduled snapshots" err="Failed to fetch instances: id"
t=2020-11-03T22:17:50+0000 lvl=eror msg="Failed to load instances for snapshot expiry" err="Failed to fetch instances: id"
t=2020-11-03T22:18:50+0000 lvl=eror msg="Failed to load instances for snapshot expiry" err="Failed to fetch instances: id"

When I go to list containers with --fast, I get Error: name as the only response. When I go to start one of the broken containers, I get:

Error: Common start logic: Failed to get snapshots: Failed to fetch instance_snapshots: id

I get a message about using --show-log but there’s nothing at all useful in there (output is the same as lxc info without --show-log).

I can dump the entire global database without issue. I can distill it down to the following query which fails when specified with one of the broken container names, but works (at least, returns an empty set) for other containers:

lxd sql global "SELECT instances_snapshots.id, projects.name AS project, instances.name AS instance, instances_snapshots.name, instances_snapshots.creation_date, instances_snapshots.stateful, coalesce(instances_snapshots.description, ''), instances_snapshots.expiry_date 
FROM instances_snapshots JOIN projects ON instances.project_id = projects.id JOIN instances ON instances_snapshots.instance_id = instances.id 
WHERE instance = 'BROKEN_CONTAINER_HERE' ORDER BY projects.id, instances.id, instances_snapshots.name"
Error: Failed to execute query: id

Adding --debug to this command doesn’t do anything helpful except repeat the query back to me.

Further removed extra parts from the query if it helps narrow down the problem:

lxd sql global "SELECT instances_snapshots.id, instances_snapshots.name, instances_snapshots.creation_date, instances_snapshots.stateful, instances_snapshots.expiry_date FROM instances_snapshots WHERE instances_snapshots.instance_id = 76"

Where 76 is the ID of one of the affected containers, and I get Error: Failed to execute query: id. If I change the ID to a different container, I get a result.

Does lxd sql global "SELECT id, instance_id, name FROM instances_snapshots" also fail?

No, but it does if I append creation_date to the column list. And actually that’s slightly more helpful!

lxd sql global "SELECT id, instance_id, name, creation_date FROM instances_snapshots"
Error: Got a row error: parsing time "12" as "2006-01-02": cannot parse "12" as "2006"

Interestingly if I pull only the IDs and names, I can get an output and I can see about 8 rows that are not present in the dump I generated earlier.

These snapshots aren’t present in zfs list -t snapshot so I wonder if I can just delete them from the DB and then I’ll be able to bring up the container?

Ok, so it’s a bad snapshot time which probably messed things up with a newer sqlite being less tolerant :slight_smile:

You should be able to do something like UPDATE instances_snapshots SET creation_date="2020-11-02" WHERE creation_date="12";

No luck on that one, it returned four rows when I query for that but the update only updated two.

Now I get a different error for parsing time “0”.

That’s, hmm, interesting.
How much do you care about having accurate snapshot creation times?

If you don’t, you could probably just do a UPDATE instances_snapshots SET creation_date="2020-11-02"; and effectively replace the entire column with a valid value.

We definitely care, however helpfully the devs that built the service which speaks to LXD put the unix timestamp in the snapshot name so I think I can reset them via that, let me try that.

Hmm, it looks like my database is very sick - I just tried a different query to see if I could grab a list of affected rows and I’m getting part of the LXD uidmap config as a date. :scream:

Is there a way to force LXD to recreate the database based off what it knows somehow? Maybe via the local DB instead of the global one?

I could delete the database then reimport all the containers from the backup.yml but will that reimport snapshots etc? And can I do that without restarting containers?

Ok, before we make any more damage, make a copy of /var/snap/lxd/common/lxd/database.

When was your database last in a properly working state?

I’m not sure - BTW I have a copy of the dump before I started messing around with it at all. It’s not complete in that there are definitely rows that aren’t present in the dump, but those seem to be the corrupt ones.

Actually that’s a point - can I delete the database and reimport from the dump? That’d probably get me most of the way to a valid state when starting from a clean slate.

If it’s something recent, all we may need to do is just remove a few of the recent segments to undo the corruption.

Looks like it’s been at least three hours since the problems started, just based off the logs alone. :frowning:

BTW when we first had issues, I tried restarting just the LXD process, which it looks like that tried to restart dnsmasq as well. Since the latter cannot start, DHCP leases have been starting to expire, so downtime is inevitable so any fixes that require container restarts are 100% on the table.

Ok. I need to step away for a while.

Once you’ve made sure you have a good copy of the database directory, look inside database/global/

You’ll see something like this:

root@castiana:~# ls -l /var/snap/lxd/common/lxd/database/global
total 12683
-rw------- 1 root root 8388176 Oct 26 11:57 0000000000222817-0000000000224806
-rw------- 1 root root 8388176 Oct 26 12:01 0000000000224807-0000000000226796
-rw------- 1 root root 8387960 Oct 26 12:04 0000000000226797-0000000000228783
-rw------- 1 root root 3966104 Oct 26 12:07 0000000000228784-0000000000229718
-rw------- 1 root root  264288 Oct 26 12:22 0000000000229719-0000000000229741
-rw------- 1 root root  623064 Oct 26 14:04 0000000000229742-0000000000229788
-rw------- 1 root root  755688 Oct 26 22:32 0000000000229789-0000000000229853
-rw------- 1 root root  536880 Oct 27 08:42 0000000000229854-0000000000229900
-rw------- 1 root root 1440096 Oct 27 17:36 0000000000229901-0000000000230009
-rw------- 1 root root 7336248 Oct 27 18:13 0000000000230010-0000000000230385
-rw------- 1 root root 2598720 Oct 27 19:03 0000000000230386-0000000000230512
-rw------- 1 root root  227208 Oct 27 19:04 0000000000230513-0000000000230533
-rw------- 1 root root  157080 Oct 27 19:05 0000000000230534-0000000000230549
-rw------- 1 root root  272568 Oct 27 20:12 0000000000230550-0000000000230573
-rw------- 1 root root 1279920 Oct 27 22:58 0000000000230574-0000000000230680
-rw------- 1 root root  157080 Oct 27 23:00 0000000000230681-0000000000230696
-rw------- 1 root root  198456 Oct 28 01:32 0000000000230697-0000000000230717
-rw------- 1 root root  343056 Oct 28 10:18 0000000000230718-0000000000230751
-rw------- 1 root root  338664 Oct 28 12:32 0000000000230752-0000000000230781
-rw------- 1 root root  157080 Oct 28 13:54 0000000000230782-0000000000230797
-rw------- 1 root root  388200 Oct 28 17:31 0000000000230798-0000000000230831
-rw------- 1 root root  268392 Oct 28 17:32 0000000000230832-0000000000230854
-rw------- 1 root root  297336 Oct 28 17:42 0000000000230855-0000000000230880
-rw------- 1 root root  281064 Oct 28 22:47 0000000000230881-0000000000230908
-rw------- 1 root root  289344 Oct 29 05:07 0000000000230909-0000000000230937
-rw------- 1 root root  413040 Oct 29 11:55 0000000000230938-0000000000230974
-rw------- 1 root root  512328 Oct 29 21:42 0000000000230975-0000000000231022
-rw------- 1 root root  491232 Oct 29 21:44 0000000000231023-0000000000231062
-rw------- 1 root root  190128 Oct 29 21:48 0000000000231063-0000000000231081
-rw------- 1 root root  573792 Oct 29 21:57 0000000000231082-0000000000231128
-rw------- 1 root root  198408 Oct 29 22:06 0000000000231129-0000000000231148
-rw------- 1 root root  198408 Oct 29 22:08 0000000000231149-0000000000231168
-rw------- 1 root root  338808 Oct 29 22:12 0000000000231169-0000000000231200
-rw------- 1 root root 1085664 Oct 29 22:40 0000000000231201-0000000000231288
-rw------- 1 root root  371856 Oct 29 22:45 0000000000231289-0000000000231323
-rw------- 1 root root  206688 Oct 29 23:50 0000000000231324-0000000000231344
-rw------- 1 root root  405048 Oct 30 11:40 0000000000231345-0000000000231384
-rw------- 1 root root  854520 Oct 30 20:22 0000000000231385-0000000000231454
-rw------- 1 root root  326208 Oct 30 20:32 0000000000231455-0000000000231482
-rw------- 1 root root  375456 Oct 30 20:37 0000000000231483-0000000000231510
-rw------- 1 root root  383808 Oct 30 21:00 0000000000231511-0000000000231540
-rw------- 1 root root  907752 Oct 30 21:07 0000000000231541-0000000000231608
-rw------- 1 root root  486768 Oct 30 21:12 0000000000231609-0000000000231643
-rw------- 1 root root 1022736 Oct 30 21:14 0000000000231644-0000000000231712
-rw------- 1 root root 1810320 Oct 30 23:25 0000000000231713-0000000000231833
-rw------- 1 root root  239736 Oct 31 04:35 0000000000231834-0000000000231857
-rw------- 1 root root  425712 Oct 31 17:30 0000000000231858-0000000000231899
-rw------- 1 root root  734880 Nov  1 00:11 0000000000231900-0000000000231960
-rw------- 1 root root  412968 Nov  1 00:17 0000000000231961-0000000000231996
-rw------- 1 root root  495408 Nov  1 00:20 0000000000231997-0000000000232037
-rw------- 1 root root  454080 Nov  1 00:23 0000000000232038-0000000000232074
-rw------- 1 root root  454128 Nov  1 00:25 0000000000232075-0000000000232112
-rw------- 1 root root  908328 Nov  1 14:10 0000000000232113-0000000000232188
-rw------- 1 root root 1082208 Nov  2 15:12 0000000000232189-0000000000232285
-rw------- 1 root root  565608 Nov  2 15:17 0000000000232286-0000000000232332
-rw------- 1 root root  474744 Nov  2 15:21 0000000000232333-0000000000232371
-rw------- 1 root root  474744 Nov  2 15:29 0000000000232372-0000000000232410
-rw------- 1 root root 1320216 Nov  2 21:40 0000000000232411-0000000000232507
-rw------- 1 root root 1168392 Nov  3 19:00 0000000000232508-0000000000232604
-rw-r--r-- 1 root root 1146880 Nov  3 19:00 db.bin
-rw-r--r-- 1 root root      32 Dec 12  2019 metadata1
-rw-r--r-- 1 root root      32 Dec 12  2019 metadata2
-rw------- 1 root root 8388608 Nov  3 19:01 open-1
-rw------- 1 root root 8388608 Nov  3 19:01 open-2
-rw------- 1 root root 8388608 Nov  3 19:01 open-3
-rw------- 1 root root 4447072 Oct 27 22:41 snapshot-622-230635-2633346115
-rw------- 1 root root      56 Oct 27 22:41 snapshot-622-230635-2633346115.meta
-rw------- 1 root root 3899112 Oct 30 21:12 snapshot-622-231659-119842149
-rw------- 1 root root      56 Oct 30 21:12 snapshot-622-231659-119842149.meta
root@castiana:~# 

In this example, the most recent snapshot of my database is at transaction 231659.
I then have a number of segments (the files with two numbers) which get me all the way up to 232604 my current transaction id.

In this case, I can technically delete (starting from the bottom), all of those files:

-rw------- 1 root root 1810320 Oct 30 23:25 0000000000231713-0000000000231833
-rw------- 1 root root  239736 Oct 31 04:35 0000000000231834-0000000000231857
-rw------- 1 root root  425712 Oct 31 17:30 0000000000231858-0000000000231899
-rw------- 1 root root  734880 Nov  1 00:11 0000000000231900-0000000000231960
-rw------- 1 root root  412968 Nov  1 00:17 0000000000231961-0000000000231996
-rw------- 1 root root  495408 Nov  1 00:20 0000000000231997-0000000000232037
-rw------- 1 root root  454080 Nov  1 00:23 0000000000232038-0000000000232074
-rw------- 1 root root  454128 Nov  1 00:25 0000000000232075-0000000000232112
-rw------- 1 root root  908328 Nov  1 14:10 0000000000232113-0000000000232188
-rw------- 1 root root 1082208 Nov  2 15:12 0000000000232189-0000000000232285
-rw------- 1 root root  565608 Nov  2 15:17 0000000000232286-0000000000232332
-rw------- 1 root root  474744 Nov  2 15:21 0000000000232333-0000000000232371
-rw------- 1 root root  474744 Nov  2 15:29 0000000000232372-0000000000232410
-rw------- 1 root root 1320216 Nov  2 21:40 0000000000232411-0000000000232507
-rw------- 1 root root 1168392 Nov  3 19:00 0000000000232508-0000000000232604

Progressively getting my database to go back in time.

You can take the same approach on your side, identify what your latest snapshot is, then try to remove the last segment file, see if the DB recovered, if it didn’t, try to go back more, …

1 Like

Based on when the segment was written, you can get an idea of exactly how far back you’ll go by deleting it.

If this works for you, that should let you go back just a tiny bit in time and not have to re-create or manually re-import everything into a clean DB.

Thanks, I’ll give that a go. Do I have to restart LXD to get it to take effect or is it instant?

Ideally you should do this without LXD running. Doing kill PID for the LXD process should have it shut down and not come back up. You can then remove a segment and start it back up.

If this approach somehow doesn’t stop LXD, then the alternative is to do the segment change and then kill -9 the LXD process so it doesn’t have a chance to flush anything into the DB before it restarts.