Corrupted database after automatic snap refresh from lxd 3.14 to lxd 3.15

People : hi !
I have two lxd clusters. One of them suffered the upgrade from 3.14 to 3.15, but somehow succeeded to get back online with everything up and running.
Unfortunately, the other did not.
On every node of this cluster (there are 3 of them), the snap process responsible for upgrading lxd hung and I had much troubles to get them all back online.
I had 5 containers on this cluster, and now, only 3 are shown when I issue lxc list.
Considering this, here are the things I tried to get back my two missing containers :

  • On one of the nodes of the cluster I launched sqlite3 on the global.bak/db.bin and issued .dump so as to have all the data displayed before me.
  • In another terminal, I launched all the queries that seemed to me that were required to inject all missing data. There were a lot of them, but if I remember correctly, I inserted the missing datas in the tables containers, containers_config, containers_devices, containers_devices_config, containers_profiles, storage_volumes and storage_volumes_config.
  • after this quite long work, I was able to see some thing correctly displayed for the lost containers when I issued “lxc list” or “lxc config show [container]”. However, lxc start [container] raised an error :

Error: Common start logic: Load go-lxc struct: invalid character ‘I’ looking for beginning of object key string was what I got.

So, from this point, there was nothing more I could do. As suggested on the irc chan, I went to another approach. Here is what I did on every node :

  • stopped lxd.
  • backed-up the /var/snap/lxd/common/lxd/database/global folder
  • dropped the global floder
  • copied global.bak to global
  • restarted lxd.

“snap start lxd” took quite a long time (I guess it was upgrading the database to the new format). And… My two containers disappeared once again.

If this can be of any help, these three virtual machines running lxd use ceph as storage backend and so, the rbd devices for my two missing containers are still here.
Of course, I could destroy these two rbd devices and re-create my two missing containers, but this would be the very last solution I could think of.

Any help appreciated to get things running as expected !

Hello,

please send me a tarball of the database/ directory of all three nodes (with everything, including the global.bak folder). I’ll take a look and see what’s wrong. Email is free dot ekanayaka at canonical dot com

Thanks a lot @freeekanayaka for your help and concern. I do it as soon as I can !