Corrupted database after automatic snap refresh from lxd 3.14 to lxd 3.15

geodb27 · July 19, 2019, 9:27am

People : hi !
I have two lxd clusters. One of them suffered the upgrade from 3.14 to 3.15, but somehow succeeded to get back online with everything up and running.
Unfortunately, the other did not.
On every node of this cluster (there are 3 of them), the snap process responsible for upgrading lxd hung and I had much troubles to get them all back online.
I had 5 containers on this cluster, and now, only 3 are shown when I issue lxc list.
Considering this, here are the things I tried to get back my two missing containers :

On one of the nodes of the cluster I launched sqlite3 on the global.bak/db.bin and issued .dump so as to have all the data displayed before me.
In another terminal, I launched all the queries that seemed to me that were required to inject all missing data. There were a lot of them, but if I remember correctly, I inserted the missing datas in the tables containers, containers_config, containers_devices, containers_devices_config, containers_profiles, storage_volumes and storage_volumes_config.
after this quite long work, I was able to see some thing correctly displayed for the lost containers when I issued “lxc list” or “lxc config show [container]”. However, lxc start [container] raised an error :

Error: Common start logic: Load go-lxc struct: invalid character ‘I’ looking for beginning of object key string was what I got.

So, from this point, there was nothing more I could do. As suggested on the irc chan, I went to another approach. Here is what I did on every node :

stopped lxd.
backed-up the /var/snap/lxd/common/lxd/database/global folder
dropped the global floder
copied global.bak to global
restarted lxd.

“snap start lxd” took quite a long time (I guess it was upgrading the database to the new format). And… My two containers disappeared once again.

If this can be of any help, these three virtual machines running lxd use ceph as storage backend and so, the rbd devices for my two missing containers are still here.
Of course, I could destroy these two rbd devices and re-create my two missing containers, but this would be the very last solution I could think of.

Any help appreciated to get things running as expected !

freeekanayaka · July 19, 2019, 11:58am

Hello,

please send me a tarball of the database/ directory of all three nodes (with everything, including the global.bak folder). I’ll take a look and see what’s wrong. Email is free dot ekanayaka at canonical dot com

geodb27 · July 19, 2019, 12:24pm

Thanks a lot @freeekanayaka for your help and concern. I do it as soon as I can !