Error: database disk image is malformed

stgraber · January 16, 2018, 10:01pm

What does ls -lah /var/lib/lxd show you?
I wonder if there are some lock files or something that sqlite would create…

I’m assuming you’ve tried rebooting the system to make sure that it wasn’t some lock being held by some running process?

serbik · January 16, 2018, 10:07pm

@stgraber Yeah tried a few reboots, same results and same containers which are showing as ERROR.

root@server:~# ls -lah /var/lib/lxd/
total 232K
drwxr-xr-x 13 root root 4.0K Jan 16 21:07 .
drwxr-xr-x 50 root root 4.0K Jan 16 21:09 ..
drwx--x--x  2 root root 4.0K Nov 19 02:52 containers
drwx--x--x 11 root root 4.0K Nov 19 02:52 devices
drwxr-xr-x  2 root root 4.0K Jul  5  2017 devlxd
drwx------  2 root root 4.0K Aug 17 02:27 disks
drwx------  2 root root 4.0K Nov 28 07:32 images
-rw-r--r--  1 root root 100K Jan 16 21:03 lxd.db
-rw-r--r--  1 root root  68K Aug 17 01:05 lxd.db.bak
drwx--x--x  2 root root 4.0K Aug 17 01:05 networks
drwx------  4 root root 4.0K Jul  5  2017 security
-rw-r--r--  1 root root 1.9K Jul  5  2017 server.crt
-rw-------  1 root root 3.2K Jul  5  2017 server.key
drwx--x--x  2 root root 4.0K Jul  5  2017 shmounts
drwx------  2 root root 4.0K Dec 12 23:52 snapshots
drwxr-xr-x  3 root root 4.0K Oct  3 19:16 storage-data
drwx--x--x  5 root root 4.0K Oct  3 19:18 storage-pools
srw-rw----  1 root lxd     0 Jan 16 20:25 unix.socket

dnoe · January 17, 2018, 3:08pm

I am also facing this issue. Everything looks ok, but over half my containers are stuck in that error state.

dnoe · January 17, 2018, 3:57pm

I’ve noticed some really strange/odd behavior. I was updating the system, and rebooting multiple times. I noticed different containers coming/going from the error state on the different reboots.

Meaning a container that was marked as an error, was only stopped after a reboot. Though one that was stopped, was then marked as an error state. It’s really got me puzzled.

dnoe · January 17, 2018, 5:57pm

I’ve downloaded the LXD database (/var/lib/lxd/lxd.db), and I ran an integrity check. It came back with quite a few rows missing from the “sqlite_autoindex_containers_config_1”. I wonder if just updating the indexes would resolve this.

ADDED at a later time:
I just compared the database file to one that isn’t corrupt and the integrity check comes back OK. so I am pretty sure that’s the problem, just not sure if there’s a fix or anything.

I think I resolved this issue. I exported, then re-imported the data into a new database. After replacing the corrupted DB, all the containers were showing the “stopped” state. I am still having some issues unrelated to the corrupted database.

serbik · January 17, 2018, 9:13pm

@dnoe - were you getting same error on lxc start?

I still have original db but only backup is a few months old, if its any use for testing for the cause

dnoe · January 17, 2018, 9:17pm

Yes. The exact same error. I have finally resolved all the issues, but that was a complete pain.

Take the corrupt database, and export the data to an .sql file. Create a new database, then import the data from the .sql file. Move the new database file into place (/var/lib/lxd/lxd.db) and all the containers showed the “stopped” state. I had to fix the configuration on a couple of the containers manually, but I am back in production with the machine.

serbik · January 17, 2018, 9:49pm

@dnoe

I’m seeing similar results…

sqlite> pragma integrity_check;

*** in database main ***
On tree page 11 cell 8: 2nd reference to page 82
On tree page 11 cell 5: Rowid 12475 out of order
On tree page 11 cell 5: 2nd reference to page 59
On tree page 55 cell 5: Rowid 12447 out of order
On tree page 52 cell 12: Rowid 12509 out of order
Page 85 is never used
row 5 missing from index sqlite_autoindex_storage_volumes_config_1
row 6 missing from index sqlite_autoindex_storage_volumes_config_1
row 7 missing from index sqlite_autoindex_storage_volumes_config_1
row 8 missing from index sqlite_autoindex_storage_volumes_config_1
row 9 missing from index sqlite_autoindex_storage_volumes_config_1
wrong # of entries in index sqlite_autoindex_storage_volumes_config_1
row 37 missing from index sqlite_autoindex_containers_config_1
row 43 missing from index sqlite_autoindex_containers_config_1
row 44 missing from index sqlite_autoindex_containers_config_1
row 51 missing from index sqlite_autoindex_containers_config_1
row 52 missing from index sqlite_autoindex_containers_config_1
row 53 missing from index sqlite_autoindex_containers_config_1
row 54 missing from index sqlite_autoindex_containers_config_1
Error: database disk image is malformed
sqlite>

dnoe · January 17, 2018, 9:51pm

That is pretty much what I was seeing. Now simply export the data into a .sql file. Create a new database, and import said .sql file into the new database. Save the database and close out, then move it into place on the LXD machine.

serbik · January 17, 2018, 10:03pm

Thanks - I’ve managed to recreate a few of the containers from new images but I’ll try get original instance fixed with a reimport

Did you suffer from power loss as well, or was it just a normal reboot?

dnoe · January 17, 2018, 10:13pm

I was attempting a normal reboot, but it ended up crashing and I had to manually kill the power.

After you reimport the data into a new database, run the integrity check to verify it passes with “OK”. I had a couple containers that had their configuration messed up after being restored, but those were easy fixes.

stgraber · January 17, 2018, 11:38pm

Ok, so looks like the on-disk database got corrupted due to a force reboot or power loss?

What underlying filesystem are you using on /var/lib/lxd? Not really sure what we can do about this, maybe there’s some sqlite3 option to force a hard sync of the DB more often to try and reduce the damage on disk in such cases?

dnoe · January 17, 2018, 11:42pm

I believe /var/lib/lxd is on a ext4 partition, on a raid10 setup if I remember right. I’ll update this tomorrow when I’m back in the office.

dnoe · January 18, 2018, 12:21am

I found a couple good reads related to sqlite and power loss. Not sure if forcing a sync more often would help or not.

http://www.sqlite.org/atomiccommit.html
https://www.sqlite.org/howtocorrupt.html

serbik · January 18, 2018, 6:58am

/var/lib/lxd is on ext4 LV

Those suggest that sqlite is designed to withstand a power outage but maybe the indexes are different.

Lesson for me is to invest in a UPS and more frequent lxd.db backups

stgraber · January 28, 2018, 9:17pm

Sometimes a corrupted sqlite3 may still be dumped with .dump, if that’s the case, you can then create a new database and load the text export of the old one.

stgraber · January 28, 2018, 9:18pm

@freeekanayaka something to keep in mind with the dqlite/clustering work, we need to document backup and disaster recovery for both a standalone non-clustered system and for a clustered setup.

Celina27 · March 28, 2019, 2:30pm

Hi,
I am having the same issue too.
Thanks dnoe for the links !

freeekanayaka · March 28, 2019, 3:57pm

Please provide more details: LXD version, clustered/not-clustered, etc

simos · March 28, 2019, 4:18pm

@freeekanayaka This Celina27 account is likely a spam account. The same username appears in http://stopforumspam.com/ipcheck/147.135.36.175