Failure with readonly mount on VM

bernd · May 24, 2021, 7:07pm

Hi all,
I am using lxd 4.14 on Debian 11, installed via snap. When creating a VM and setting a readonly mount of a host path, then the VM fails to start:

root@debian:~# lxc init ubuntu:20.04 vm --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating vm
root@debian:~# lxc config device add vm srv disk source=/srv path=/srv readonly=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Device srv added to vm
root@debian:~# lxc start vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/20450/bin/qemu-system-x86_64 -S -name vm -uuid baff5fe7-d700-49e2-9861-60f510506059 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-reboot -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm/qemu.log -chroot /var/snap/lxd/common/lxd/virtual-machines/vm -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/0 (label console)
: Process exited with a non-zero value
Try `lxc info --show-log vm` for more info
root@debian:~# lxc info --show-log vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Name: vm
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/05/24 18:43 UTC
Status: Stopped
Type: virtual-machine
Profiles: default

Log:

qemu-system-x86_64:/var/snap/lxd/common/lxd/logs/vm/qemu.conf:338: cannot initialize fsdev 'lxd_srv': failed to open '/var/lib/snapd/hostfs/srv': Permission denied

Is this a bug in lxd or am I doing something wrong? Any help is appreciated.

All the best,
Bernd

stgraber · May 25, 2021, 12:52am

@tomp can you take a look into this one?

tomp · May 25, 2021, 8:48am

I created an issue to track this:

github.com/lxc/lxd

Failure with readonly disk device mount on VM

opened 08:47AM - 25 May 21 UTC

tomponline

Bug

Reported from https://discuss.linuxcontainers.org/t/failure-with-readonly-mount-…on-vm/11179/3 - LXD 4.14 and main branch - Ubuntu 20.04 (snap and native) ``` lxc init ubuntu:20.04 vm --vm lxc config device add vm srv disk source=/srv path=/srv readonly=true lxc start vm Error: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /usr/bin/qemu-system-x86_64 -S -name vm -uuid 16596540-ac89-478a-b0ba-b43142aac662 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/log/lxd/test_vm/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/log/lxd/test_vm/qemu.spice -pidfile /var/log/lxd/test_vm/qemu.pid -D /var/log/lxd/test_vm/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/1 (label console) ```

bernd · May 25, 2021, 8:51am

Thank you Thomas!
Just looked at the bug report and have a small correction (not sure whether it matters): the host wher this error occurs is running Debian 11. The VM guest is Ubuntu 20.04.

tomp · May 25, 2021, 8:52am

I have reproduced on ubuntu 20.04 without the snap as well.

tomp · May 25, 2021, 9:07am

It looks like an AppArmor issue btw.

tomp · May 25, 2021, 4:48pm

This should fix it:

github.com/lxc/lxd

VM: Fix readonly disk shares

lxc:master ← tomponline:tp-vm-readonly-shares

opened 12:40PM - 25 May 21 UTC

tomponline

+113 -71

- Use 9p virtfs-proxy-helper even when `readonly=true` (fixes #8809 by avoiding… qemu apparmor profile). - Log why virtio-fs isn't being used - rather than a generic, and potentially inaccurate warning. - Don't use virtio-fs if disk device config has readonly=true (as doesn't appear to be supported). Fallback to 9p instead. - Improve lxd-agent startup process by waiting for `/dev/vsock` to appear and activating listener before setting up agent mounts. This avoid prematurely exiting (and getting systemd to restart lxd-agent) which was causing the virtiofs mount to be replaced (mounted over) by the fallback 9p mount. - Adds support for readonly block device disks (previously the `readonly` property was silently ignored).

bernd · May 25, 2021, 9:31pm

Is it, as a workaround until the fixed version of lxd is released, possible to force in the configuration of the mount that the 9p driver (instead of virtio-fs) should be used? How would this be done?

tomp · May 25, 2021, 9:38pm

The mount with readonly=false should work fine, but the fix is required for readonly=true.

bernd · May 25, 2021, 9:55pm

I know that ‘readonly=true’ works, but I am looking for a workaround for ‘readonly=false’.

tomp · May 25, 2021, 10:01pm

Disabling apparmor would provide a workaround, but beyond that I do not know of a workaround, beyond the fix that is.

tomp · May 25, 2021, 10:01pm

You could in principle mount the source directory as read only and then share that as readonly=false.

stgraber · May 25, 2021, 10:18pm

You can also use raw.apparmor to directly allow the missing path for that instance.

bernd · May 26, 2021, 10:17am

I’d need the host to have write access to the path, but the VM to have only read-only access. So I tried to do a bind mount of the host path /srv with option ‘ro’ to a new path /srv_ro and then specify /srv_ro as source argument in the lxd configuration. Now the VM starts up fine, but still has write access to /srv_ro, ignoring the read-only mount option, so this workaround does not work for me.

bernd · May 26, 2021, 10:21am

@stgraber: I tried your suggestion but could not get it working. I issued:

printf “/var/lib/snapd/hostfs/srv rw,” | lxc config set vm raw.apparmor -

but get the same error message when starting the VM and the log complains about the very same path not being accessible. It looks like my raw.apparmor settings gets ignored.

tomp · May 26, 2021, 10:46am

This works for me, notice the trailing slash, and the r rather than rw (as QEMU is opening the directory in read-only mode as intended).

lxc config device set <instance> <disk> readonly=true
printf "/var/lib/snapd/hostfs/srv/ r," | lxc config set <instance> raw.apparmor -
lxc start <instance>

tomp · May 26, 2021, 10:49am

This PR adds automated tests for this functionality:

bernd · May 26, 2021, 11:32am

I can confirm that this works in that it makes the VM start up without an error. (Though I don’t understand why giving more AppArmor permissions breaks it.)

However, the read-only protection of the path is easy to circumvent in the VM: If I issue

mount -o remount,rw /srv

in the VM, the VM will still be able to write to /srv. It appears that all the readonly flag does is that it signals to the VM to mount the path readonly. That means the VM is in control, but for security reasons I need the host to be in control.

tomp · May 26, 2021, 12:38pm

Can you show the output of mount | grep srv inside the VM?

My suspicion is that you’re also being affected by a different bug also fixed in VM: Fix readonly disk shares by tomponline · Pull Request #8810 · lxc/lxd · GitHub

If so:

Short version:
I can’t think of a workaround that will work properly and you’re going to need to wait until that patch lands in the snap channel you’re using.

Long version:

When we originally added directory sharing support we only used 9p sharing, which supports a readonly property. However because we run the QEMU process as non-root, this prevented sharing directories not accessible by the unprivileged user we run QEMU as. To work around this QEMU provides the virtfs-proxy-helper process which we can start as root and the QEMU process uses that to access the directory on the host.
However we discovered whilst adding support for this that there was a bug in QEMU that meant that when using the virtfs-proxy-helper process the readonly property was ignored. We filed a bug upstream (which has subsequently been fixed), and then added an exception LXD that avoided using virtfs-proxy-helper when readonly=true. This meant that only globally accessible directories could be shared as readonly, but at least it wasn’t a security issue.
The problem with 9p though, is that it is not particular performant, and so later we added optional support for virtiofs, which uses a different proxy process called virtiofsd that is run as root to allow access to directories not owned by the unprivileged user we run the QEMU process as.
When virtiofsd is available we launch both that and the 9p virtfs-proxy-helper to allow the guest to use either. The lxd-agent process running inside the VM uses the virtiofs share as preference and if that doesn’t work (either due to lack of support on the host or inside the guest OS) then it fallbacks to the 9p share.
The lxd-agent process mounts the share as ro if readonly=true, but we have always been aware that this is a security feature, just a nicety to help indicate to the guest that this is a read only directory.
Later we also added AppArmor support to the QEMU process (but not the proxy processes). This is what prevented starting the VM when using readonly=true share because due to 2. we weren’t using the proxy process for read only shares, and so QEMU tried to access a directory that wasn’t allowed by its AppArmor process. As an aside, the AppArmor rules have to exactly match the type of file open operation that is occurring (i.e allowing rw access doesn’t allow r only access).
Because the upstream bug in virtfs-proxy-helper that meant the readonly setting was not respected is now fixed, yesterday I switched our 9p share to always use virtfs-proxy-helper. In the process I discovered two more bugs; firstly, the the virtiofs share doesn’t have support for readonly in QEMU, so although the 9p share was correctly being started with readonly mode, the virtiofs share was still writeable (this is what I think you’re finding now when remounting the mount as rw), and secondly, that due to a race condition in lxd-agent when loading the vsock kernel module, this was causing the lxd-agent to exit during the boot process (and be restarted by systemd) with the effect of initially mounting the the share using virtiofs, and then on subsequent restart, attempting to mount it again as virtiofs (and failing) and falling back to mounting it as 9p (causing a 2nd over mount of the same path). However this behaviour was dependent on how quickly the vsock kernel module loaded.
Because in the released LXD versions, the virtiofs share is available (even if not mounted), when you modify the raw.apparmor setting to allow the QEMU process to start when directly accessing the 9p share, someone who is root inside the guest can then choose to mount the virtiofs share and would then be able to write to the readonly share.

As this is a somewhat complex dance between LXD, QEMU and the lxd-agent, I’ve added some automated tests above to try and catch any future regressions in all 3 of the subsystems.

bernd · May 26, 2021, 4:49pm

The output of mount | grep srv is:

lxd_srv on /srv type virtiofs (ro,relatime)

Thank you also for the full explanation. I think I understand now that 9p supports read-only access, but virtiofs always provides read-write access, so lxd should use 9p for proper read-only access. Due to a bug in an underlying component, lxd uses virtiofs in this case anyway. By default, this fails on apparmor (so apparmor enforces the read-only security), but when I set raw.apparmor, I override the apparmor profile so it can use virtiofs and thus provides full read-write access to the VM. Finally lxd-agent sets the mount flag ro in the VM, but being a VM process it can’t of course enforce security.

I take it then that there isn’t really a workaround.

I am willing to help with testing if that helps. So, when there is snap channel with the fixes included that I can try, please let me know.