Error: database disk image is malformed

serbik · January 15, 2018, 7:58pm

Hi guys,

After a powercut i’ve been trying to get my LXD containers up again, but getting the error when using ‘lxc start’

root@server:/var/lib/lxd/containers# lxc start nc-backend
error: database disk image is malformed

What logs etc can I help you out with to help look into this?

LXD version : 2.21-0ubuntu2~16.04.1~ppa amd64

root@server:/var/lib/lxd/containers# lxc list
+--------------+---------+------+------+------------+-----------+
|     NAME     |  STATE  | IPV4 | IPV6 |    TYPE    | SNAPSHOTS |
+--------------+---------+------+------+------------+-----------+
| cmk          | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| gs-test      | RUNNING |      |      | PERSISTENT | 0         |
+--------------+---------+------+------+------------+-----------+
| mta          | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| mumble       | STOPPED |      |      | PERSISTENT | 0         |
+--------------+---------+------+------+------------+-----------+
| nc-backend   | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| nc-proxy     | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| pi-hole      | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| plex         | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+
| transmission | ERROR   |      |      | PERSISTENT |           |
+--------------+---------+------+------+------------+-----------+

root@server:/var/lib/lxd/containers# lxc storage list
+-------------+-------------+--------+----------------------------------------+---------+
|    NAME     | DESCRIPTION | DRIVER |                 SOURCE                 | USED BY |
+-------------+-------------+--------+----------------------------------------+---------+
| default     |             | zfs    | zmirror0/lxd                           | 5       |
+-------------+-------------+--------+----------------------------------------+---------+
| lxd-storage |             | zfs    | zmirror0/lxd-storage                   | 2       |
+-------------+-------------+--------+----------------------------------------+---------+
| ssd-storage |             | dir    | /var/lib/lxd/storage-pools/ssd-storage | 9       |
+-------------+-------------+--------+----------------------------------------+---------+

root@server:/var/lib/lxd/containers# sudo zfs list
NAME                                                                                           USED  AVAIL  REFER  MOUNTPOINT
zmirror0                                                                                       636G  2.01T    96K  none
zmirror0/kvm                                                                                  49.2G  2.01T    96K  none
zmirror0/kvm/images                                                                           49.2G  2.01T  49.2G  /var/lib/libvirt/images
zmirror0/lxd                                                                                  33.3G  2.01T    96K  none
zmirror0/lxd-storage                                                                           554G  2.01T    96K  none
zmirror0/lxd-storage/containers                                                                 96K  2.01T    96K  none
zmirror0/lxd-storage/custom                                                                    554G  2.01T    96K  none
zmirror0/lxd-storage/custom/media                                                              517G  2.01T   517G  /var/lib/lxd/storage-pools/lxd-storage/custom/media
zmirror0/lxd-storage/custom/nextcloud-data                                                    36.7G  2.01T  36.6G  /var/lib/lxd/storage-pools/lxd-storage/custom/nextcloud-data
zmirror0/lxd-storage/deleted                                                                    96K  2.01T    96K  none
zmirror0/lxd-storage/images                                                                     96K  2.01T    96K  none
zmirror0/lxd-storage/snapshots                                                                  96K  2.01T    96K  none
zmirror0/lxd/containers                                                                       32.6G  2.01T    96K  none
zmirror0/lxd/containers/nc-backend                                                            32.6G  2.01T  3.11G  /var/lib/lxd/storage-pools/default/containers/nc-backend
zmirror0/lxd/deleted                                                                           649M  2.01T    96K  none
zmirror0/lxd/deleted/images                                                                    649M  2.01T    96K  none
zmirror0/lxd/deleted/images/7a7ff654cbd8f5f09bec03aa19d8d7d92649127d18659036a963b1ea63f90d25   649M  2.01T   649M  none
zmirror0/lxd/images                                                                             96K  2.01T    96K  /var/lib/lxd/images
zmirror0/lxd/snapshots                                                                         288K  2.01T    96K  none
zmirror0/lxd/snapshots/nc-backend                                                               96K  2.01T    96K  none
zmirror0/lxd/snapshots/nextcloud01                                                              96K  2.01T    96K  none


root@server:/var/lib/lxd/containers# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
udev                                        7.8G     0  7.8G   0% /dev
tmpfs                                       1.6G  8.9M  1.6G   1% /run
/dev/mapper/host01--vg-root                  94G   21G   70G  23% /
tmpfs                                       7.9G     0  7.9G   0% /dev/shm
tmpfs                                       5.0M     0  5.0M   0% /run/lock
tmpfs                                       7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sda2                                   473M  341M  108M  77% /boot
/dev/sda1                                   511M  3.4M  508M   1% /boot/efi
zmirror0/kvm/images                         2.1T   50G  2.1T   3% /var/lib/libvirt/images
zmirror0/lxd/images                         2.1T  128K  2.1T   1% /var/lib/lxd/images
cgmfs                                       100K     0  100K   0% /run/cgmanager/fs
tmpfs                                       100K     0  100K   0% /var/lib/lxd/shmounts
tmpfs                                       100K     0  100K   0% /var/lib/lxd/devlxd
tmpfs                                       1.6G     0  1.6G   0% /run/user/1000
zmirror0/lxd/containers/nc-backend          2.1T  3.2G  2.1T   1% /var/lib/lxd/storage-pools/default/containers/nc-backend
zmirror0/lxd-storage/custom/media           2.6T  518G  2.1T  21% /var/lib/lxd/storage-pools/lxd-storage/custom/media
zmirror0/lxd-storage/custom/nextcloud-data  2.1T   37G  2.1T   2% /var/lib/lxd/storage-pools/lxd-storage/custom/nextcloud-data

stgraber · January 16, 2018, 9:03pm

Hmm, nothing very obvious, so /var/lib/lxd is stored on you vg-root?
Are you maybe running out of inodes? (df -i)

If not, then maybe it’s some sqlite locking issue. It’s a confusing error considering that read access appears fine, if the database is completely broken, the container list would have failed too.

serbik · January 16, 2018, 9:31pm

Hi stgraber,

/var/lib/lxd is on my vg-root , here is inodes:

root@server:~# df -i
Filesystem                      Inodes  IUsed      IFree IUse% Mounted on
udev                           2043898    524    2043374    1% /dev
tmpfs                          2048963    776    2048187    1% /run
/dev/mapper/host01--vg-root    6205584 744804    5460780   13% /
tmpfs                          2048963      1    2048962    1% /dev/shm
tmpfs                          2048963      5    2048958    1% /run/lock
tmpfs                          2048963     18    2048945    1% /sys/fs/cgroup
/dev/loop0                       13648  13648          0  100% /snap/core/3748
/dev/loop1                        1103   1103          0  100% /snap/lxd/5408
/dev/sda2                       124928    324     124604    1% /boot
/dev/sda1                            0      0          0     - /boot/efi
zmirror0/kvm/images         4321235018     10 4321235008    1% /var/lib/libvirt/images
cgmfs                          2048963     14    2048949    1% /run/cgmanager/fs
tmpfs                          2048963      4    2048959    1% /run/user/1000

As you can see from above i was about to try starting over again with LXD (from snap) - and try to import my existing containers, if that’s possible? Or even just the storage-pools?

I’ve kept the old /var/lib/lxd however, if you’d like to continue troubleshooting possibly a sqlite locking issue.

Thanks very much,

stgraber · January 16, 2018, 10:01pm

What does ls -lah /var/lib/lxd show you?
I wonder if there are some lock files or something that sqlite would create…

I’m assuming you’ve tried rebooting the system to make sure that it wasn’t some lock being held by some running process?

serbik · January 16, 2018, 10:07pm

@stgraber Yeah tried a few reboots, same results and same containers which are showing as ERROR.

root@server:~# ls -lah /var/lib/lxd/
total 232K
drwxr-xr-x 13 root root 4.0K Jan 16 21:07 .
drwxr-xr-x 50 root root 4.0K Jan 16 21:09 ..
drwx--x--x  2 root root 4.0K Nov 19 02:52 containers
drwx--x--x 11 root root 4.0K Nov 19 02:52 devices
drwxr-xr-x  2 root root 4.0K Jul  5  2017 devlxd
drwx------  2 root root 4.0K Aug 17 02:27 disks
drwx------  2 root root 4.0K Nov 28 07:32 images
-rw-r--r--  1 root root 100K Jan 16 21:03 lxd.db
-rw-r--r--  1 root root  68K Aug 17 01:05 lxd.db.bak
drwx--x--x  2 root root 4.0K Aug 17 01:05 networks
drwx------  4 root root 4.0K Jul  5  2017 security
-rw-r--r--  1 root root 1.9K Jul  5  2017 server.crt
-rw-------  1 root root 3.2K Jul  5  2017 server.key
drwx--x--x  2 root root 4.0K Jul  5  2017 shmounts
drwx------  2 root root 4.0K Dec 12 23:52 snapshots
drwxr-xr-x  3 root root 4.0K Oct  3 19:16 storage-data
drwx--x--x  5 root root 4.0K Oct  3 19:18 storage-pools
srw-rw----  1 root lxd     0 Jan 16 20:25 unix.socket

dnoe · January 17, 2018, 3:08pm

I am also facing this issue. Everything looks ok, but over half my containers are stuck in that error state.

dnoe · January 17, 2018, 3:57pm

I’ve noticed some really strange/odd behavior. I was updating the system, and rebooting multiple times. I noticed different containers coming/going from the error state on the different reboots.

Meaning a container that was marked as an error, was only stopped after a reboot. Though one that was stopped, was then marked as an error state. It’s really got me puzzled.

dnoe · January 17, 2018, 5:57pm

I’ve downloaded the LXD database (/var/lib/lxd/lxd.db), and I ran an integrity check. It came back with quite a few rows missing from the “sqlite_autoindex_containers_config_1”. I wonder if just updating the indexes would resolve this.

ADDED at a later time:
I just compared the database file to one that isn’t corrupt and the integrity check comes back OK. so I am pretty sure that’s the problem, just not sure if there’s a fix or anything.

I think I resolved this issue. I exported, then re-imported the data into a new database. After replacing the corrupted DB, all the containers were showing the “stopped” state. I am still having some issues unrelated to the corrupted database.

serbik · January 17, 2018, 9:13pm

@dnoe - were you getting same error on lxc start?

I still have original db but only backup is a few months old, if its any use for testing for the cause

dnoe · January 17, 2018, 9:17pm

Yes. The exact same error. I have finally resolved all the issues, but that was a complete pain.

Take the corrupt database, and export the data to an .sql file. Create a new database, then import the data from the .sql file. Move the new database file into place (/var/lib/lxd/lxd.db) and all the containers showed the “stopped” state. I had to fix the configuration on a couple of the containers manually, but I am back in production with the machine.

serbik · January 17, 2018, 9:49pm

@dnoe

I’m seeing similar results…

sqlite> pragma integrity_check;

*** in database main ***
On tree page 11 cell 8: 2nd reference to page 82
On tree page 11 cell 5: Rowid 12475 out of order
On tree page 11 cell 5: 2nd reference to page 59
On tree page 55 cell 5: Rowid 12447 out of order
On tree page 52 cell 12: Rowid 12509 out of order
Page 85 is never used
row 5 missing from index sqlite_autoindex_storage_volumes_config_1
row 6 missing from index sqlite_autoindex_storage_volumes_config_1
row 7 missing from index sqlite_autoindex_storage_volumes_config_1
row 8 missing from index sqlite_autoindex_storage_volumes_config_1
row 9 missing from index sqlite_autoindex_storage_volumes_config_1
wrong # of entries in index sqlite_autoindex_storage_volumes_config_1
row 37 missing from index sqlite_autoindex_containers_config_1
row 43 missing from index sqlite_autoindex_containers_config_1
row 44 missing from index sqlite_autoindex_containers_config_1
row 51 missing from index sqlite_autoindex_containers_config_1
row 52 missing from index sqlite_autoindex_containers_config_1
row 53 missing from index sqlite_autoindex_containers_config_1
row 54 missing from index sqlite_autoindex_containers_config_1
Error: database disk image is malformed
sqlite>

dnoe · January 17, 2018, 9:51pm

That is pretty much what I was seeing. Now simply export the data into a .sql file. Create a new database, and import said .sql file into the new database. Save the database and close out, then move it into place on the LXD machine.

serbik · January 17, 2018, 10:03pm

Thanks - I’ve managed to recreate a few of the containers from new images but I’ll try get original instance fixed with a reimport

Did you suffer from power loss as well, or was it just a normal reboot?

dnoe · January 17, 2018, 10:13pm

I was attempting a normal reboot, but it ended up crashing and I had to manually kill the power.

After you reimport the data into a new database, run the integrity check to verify it passes with “OK”. I had a couple containers that had their configuration messed up after being restored, but those were easy fixes.

stgraber · January 17, 2018, 11:38pm

Ok, so looks like the on-disk database got corrupted due to a force reboot or power loss?

What underlying filesystem are you using on /var/lib/lxd? Not really sure what we can do about this, maybe there’s some sqlite3 option to force a hard sync of the DB more often to try and reduce the damage on disk in such cases?

dnoe · January 17, 2018, 11:42pm

I believe /var/lib/lxd is on a ext4 partition, on a raid10 setup if I remember right. I’ll update this tomorrow when I’m back in the office.

dnoe · January 18, 2018, 12:21am

I found a couple good reads related to sqlite and power loss. Not sure if forcing a sync more often would help or not.

http://www.sqlite.org/atomiccommit.html
https://www.sqlite.org/howtocorrupt.html

serbik · January 18, 2018, 6:58am

/var/lib/lxd is on ext4 LV

Those suggest that sqlite is designed to withstand a power outage but maybe the indexes are different.

Lesson for me is to invest in a UPS and more frequent lxd.db backups

stgraber · January 28, 2018, 9:17pm

Sometimes a corrupted sqlite3 may still be dumped with .dump, if that’s the case, you can then create a new database and load the text export of the old one.

stgraber · January 28, 2018, 9:18pm

@freeekanayaka something to keep in mind with the dqlite/clustering work, we need to document backup and disaster recovery for both a standalone non-clustered system and for a clustered setup.