SQlite3 virtiofs mmap

dlarcheyw · February 11, 2025, 3:43pm

Hi all,

I am encountering a strange and unexpected issue which was first triggered by a docker-compose file in a debian/12 VM but I below write a script to reproduce the error in much lighter context.

Basically, a SQlite3 database (generated by the above docker-compose script) called prowlarr.db cannot be properly opened if it is located on a custom volume mounted as a virtiofs. In particular, an attempt at dumping it fails, generating an ERROR: (10) disk I/O error message. However, there is no issue (incl. when dumping) for the very same prowlarr.db database when it is located (as a copy) in the root / filesystem of the VM (eg in ~root/).

According to strace, a mmap system call on the temporary generated file prowlarr.db-shm fails but I am unable to pin down the issue further. Hence my asking for help.

Strangely, this error does not trigger for any database (for instance, dumping works in all cases for the sample.db as downloaded below). Hence, I needed to export this specific prowlarr.db to a Github repository.

The below script explains how to reproduce the error.

Thanks in advance for considering this issue.

## inside the host running incusd

# create a custom volume on a pre-existing storage pool called "incus_raid3"
incus storage volume create incus_raid5 sqlite

# assuming the default profile gives internet access to the vm
# create a VM test-sqlite and attach "sqlite" custom storage volume
incus create images:debian/12 test-sqlite --vm
incus storage volume attach incus_raid5 sqlite test-sqlite /mnt

# start the VM
incus start test-sqlite

# wait until VM agent is listening
incus shell test-sqlite

## now inside the test-sqlite vm

# install required software
apt install -y sqlite3 nano wget strace

# download 2 examples databases on ~root
cd
wget -c "https://github.com/codecrafters-io/sample-sqlite-databases/raw/refs/heads/master/sample.db"
wget -c "https://github.com/DmxLarchey/failing-db/raw/refs/heads/main/prowlarr.db"

# and copy them /mnt/sqlite
mkdir /mnt/sqlite
cp sample.db prowlarr.db /mnt/sqlite

# dump sample.db and prowlarr.db on ~root, both succeed
cd 
echo .dump | sqlite3 sample.db
echo .dump | sqlite3 prowlarr.db

# dump sample.db on /mnt/sqlite (success)
cd /mnt/sqlite
echo .dump | sqlite3 sample.db

# dump sample.db  on /mnt/sqlite (failure)
cd /mnt/sqlite
echo .dump | sqlite3 prowlarr.db

# fails on disk I/O error (related to failure of mmap according to strace
echo .dump | strace sqlite3 prowlarr.db

stgraber · February 11, 2025, 4:12pm

The usual checks would be to look at dmesg and make sure that the driver isn’t somehow full.

Then the next potential issues would be around:

What’s the filesystem backing your incus_raid5 pool
What filesystem is the VM using to access the shared filesystem (virtiofs or 9p)

That’s because it could be a locking/mapping related issue with virtiofs/9p in the VM, with virtiofsd/qemu on the host, or with the underlying filesystem itself.

So you can try some permutations here by using io.bus on the disk to force it into 9p or virtiofs mode and try both. As well as create a storage pool using a different storage driver and try a volume out of that to see if that changes the behavior.

dlarcheyw · February 12, 2025, 8:29am

Thank you very much already for considering the issue I am raising.

Some partial answers:

dmesg does not show any information/error when running the .dump on the database;
the incus_raid5 pool is a zfs storage pool created by the following commands (FYI):
- the evo970 lvm VG is composed of 4 partitions on 4 different NVME devices;
- lvcreate -n incus_raid5 --type raid5 -i 3 -L 384GiB evo970;
- incus storage create incus_raid5 zfs source=/dev/evo970/incus_raid5;
the filesystem used is virtiofs, running the command mount in the VM gives:
- ... incus_sqlite on /mnt type virtiofs (rw,relatime) ...

As you suggested, I tried the io.bus=9p option for the sqlite custom disk device but it is not automounted anymore. There is a message virtio-fs: tag <incus_sqlite> not found early in the dmesg (I guess this is to be expected). I did check for the 9p tag incus_sqlite can indeed be found in the file /sys/bus/virtio/drivers/9pnet_virtio/virtio9/mount_tag.

However, the command mount -t 9p incus_sqlite /mnt fails this way:

root@test-sqlite:~# mount -t 9p incus_sqlite /mnt/ -o trans=virtio
mount: /mnt: bad option; for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program.
       dmesg(1) may have more information after failed mount system call.

and dmesg shows no new information. Same if I simply run mount -t 9p incus_sqlite /mnt/ instead, ie w/o the option.

I will now try a different storage pool driver…

Thx again

dlarcheyw · February 12, 2025, 9:29am

As a complement, I did try 2 other storage pool drivers, ie the dir and the lvm storage drivers, in addition to the initial choice of zfs. I get the same error message.

[edit] additionaly I also tried with mounting a sub-dir of the host directly in the VM. Fails as well. So it seems to me the issue lies in virtiofs.

[edit2] I did try in a container (instead of a --vm) and this time, the sql dump works !!! Unfortunately, I need a VM to properly run my docker-compose stack. So maybe the issue lies in a combination of VM & virtiofs?

[edit3] with the option io.cache=unsafe io.bus=virtiofsd, the error goes away, including in a VM. So I have a kind of work arround.

I discovered this io.cache option inspecting how virt-manager/virsh was doing for virtiofsd. Under a virsh VM, the default cache policy is always (which means unsafe).

I was able to mount also using 9p in the virsh VM but then the dump error gets triggered again. I do not know what is the default cache policy for 9p in virsh.

I am still trying to figure out how to mount the 9p tag in an incus VM.

stgraber · February 12, 2025, 7:54pm

Okay, you probably want to stick with virtiofs whenever possible for performance anyways.

The fact that io.cache=unsafe fixed it for you is good to know and likely the right path forward in this scenario. VirtioFS is a very odd filesystem, basically using shared memory with the host and using the kernel’s FUSE implementation to handle filesystem operations.

It can bypass the VFS cache to avoid double caching of files, but to do that safely, it needs to know when either side (guest or host) may be modifying the files. I suspect that the way sqlite3 maps the file makes it hard/impossible for virtiofsd to keep track of that and it therefore just fails it. Using unsafe mode makes it possible for cache inconsistency if multiple writes are altering the same file, but depending on the intended use, that may be perfectly acceptable.

stgraber · February 12, 2025, 7:54pm

(You can also run in this kind of situation when dealing with network filesystems like NFS/CIFS. File locking and direct mapping of files are always very tricky in those kind of environments.)

dlarcheyw · February 12, 2025, 9:16pm

I am going to move the database to a local storage of the VM with exclusive use and then less critical data to shared storage that might not even need io.cache=unsafe.

sqlite3 seems to have a framework for concurrent access but may be it is not compatible with concurrency over a shared or network file system, see eg SQLite Over a Network, Caveats and Considerations.

zalmarge · February 19, 2025, 7:09pm

Interesting.

Yesterday, I attempted to install MariaDB 11.4 in a VM with a filesystem-type volume mounted at /var/lib/mysql. The installation failed with the obscure error:

Can't init tc log

Debugging with strace led me to the following output:

openat(AT_FDCWD, "./tc.log", O_RDWR|O_CLOEXEC) = 18
lseek(18, 0, SEEK_END)                  = 24576
mmap(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 18, 0) = -1 ENODEV (No such device)
close(18)

From what I understand, this suggests that the underlying filesystem (VirtioFS) does not support memory-mapped files.

To resolve this, any of the following workarounds helped:

Switching to a block-type volume
Moving the transaction coordinator log to the VM’s root filesystem using --log-tc
Activating binary logging with --log-bin

Out of curiosity, I’ve just tested installing MariaDB with io.cache=unsafe, and it worked too. Good to know that this provides another potential workaround.