Hi all,
I am using lxd 4.14 on Debian 11, installed via snap. When creating a VM and setting a readonly mount of a host path, then the VM fails to start:
root@debian:~# lxc init ubuntu:20.04 vm --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating vm
root@debian:~# lxc config device add vm srv disk source=/srv path=/srv readonly=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Device srv added to vm
root@debian:~# lxc start vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/20450/bin/qemu-system-x86_64 -S -name vm -uuid baff5fe7-d700-49e2-9861-60f510506059 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-reboot -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/vm/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/vm/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/vm/qemu.pid -D /var/snap/lxd/common/lxd/logs/vm/qemu.log -chroot /var/snap/lxd/common/lxd/virtual-machines/vm -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: char device redirected to /dev/pts/0 (label console)
: Process exited with a non-zero value
Try `lxc info --show-log vm` for more info
root@debian:~# lxc info --show-log vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Name: vm
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/05/24 18:43 UTC
Status: Stopped
Type: virtual-machine
Profiles: default
Log:
qemu-system-x86_64:/var/snap/lxd/common/lxd/logs/vm/qemu.conf:338: cannot initialize fsdev 'lxd_srv': failed to open '/var/lib/snapd/hostfs/srv': Permission denied
Is this a bug in lxd or am I doing something wrong? Any help is appreciated.
Thank you Thomas!
Just looked at the bug report and have a small correction (not sure whether it matters): the host wher this error occurs is running Debian 11. The VM guest is Ubuntu 20.04.
Is it, as a workaround until the fixed version of lxd is released, possible to force in the configuration of the mount that the 9p driver (instead of virtio-fs) should be used? How would this be done?
Iâd need the host to have write access to the path, but the VM to have only read-only access. So I tried to do a bind mount of the host path /srv with option âroâ to a new path /srv_ro and then specify /srv_ro as source argument in the lxd configuration. Now the VM starts up fine, but still has write access to /srv_ro, ignoring the read-only mount option, so this workaround does not work for me.
@stgraber: I tried your suggestion but could not get it working. I issued:
printf â/var/lib/snapd/hostfs/srv rw,â | lxc config set vm raw.apparmor -
but get the same error message when starting the VM and the log complains about the very same path not being accessible. It looks like my raw.apparmor settings gets ignored.
I can confirm that this works in that it makes the VM start up without an error. (Though I donât understand why giving more AppArmor permissions breaks it.)
However, the read-only protection of the path is easy to circumvent in the VM: If I issue
mount -o remount,rw /srv
in the VM, the VM will still be able to write to /srv. It appears that all the readonly flag does is that it signals to the VM to mount the path readonly. That means the VM is in control, but for security reasons I need the host to be in control.
Short version:
I canât think of a workaround that will work properly and youâre going to need to wait until that patch lands in the snap channel youâre using.
Long version:
When we originally added directory sharing support we only used 9p sharing, which supports a readonly property. However because we run the QEMU process as non-root, this prevented sharing directories not accessible by the unprivileged user we run QEMU as. To work around this QEMU provides the virtfs-proxy-helper process which we can start as root and the QEMU process uses that to access the directory on the host.
However we discovered whilst adding support for this that there was a bug in QEMU that meant that when using the virtfs-proxy-helper process the readonly property was ignored. We filed a bug upstream (which has subsequently been fixed), and then added an exception LXD that avoided using virtfs-proxy-helper when readonly=true. This meant that only globally accessible directories could be shared as readonly, but at least it wasnât a security issue.
The problem with 9p though, is that it is not particular performant, and so later we added optional support for virtiofs, which uses a different proxy process called virtiofsd that is run as root to allow access to directories not owned by the unprivileged user we run the QEMU process as.
When virtiofsd is available we launch both that and the 9p virtfs-proxy-helper to allow the guest to use either. The lxd-agent process running inside the VM uses the virtiofs share as preference and if that doesnât work (either due to lack of support on the host or inside the guest OS) then it fallbacks to the 9p share.
The lxd-agent process mounts the share as ro if readonly=true, but we have always been aware that this is a security feature, just a nicety to help indicate to the guest that this is a read only directory.
Later we also added AppArmor support to the QEMU process (but not the proxy processes). This is what prevented starting the VM when using readonly=true share because due to 2. we werenât using the proxy process for read only shares, and so QEMU tried to access a directory that wasnât allowed by its AppArmor process. As an aside, the AppArmor rules have to exactly match the type of file open operation that is occurring (i.e allowing rw access doesnât allow r only access).
Because the upstream bug in virtfs-proxy-helper that meant the readonly setting was not respected is now fixed, yesterday I switched our 9p share to always use virtfs-proxy-helper. In the process I discovered two more bugs; firstly, the the virtiofs share doesnât have support for readonly in QEMU, so although the 9p share was correctly being started with readonly mode, the virtiofs share was still writeable (this is what I think youâre finding now when remounting the mount as rw), and secondly, that due to a race condition in lxd-agent when loading the vsock kernel module, this was causing the lxd-agent to exit during the boot process (and be restarted by systemd) with the effect of initially mounting the the share using virtiofs, and then on subsequent restart, attempting to mount it again as virtiofs (and failing) and falling back to mounting it as 9p (causing a 2nd over mount of the same path). However this behaviour was dependent on how quickly the vsock kernel module loaded.
Because in the released LXD versions, the virtiofs share is available (even if not mounted), when you modify the raw.apparmor setting to allow the QEMU process to start when directly accessing the 9p share, someone who is root inside the guest can then choose to mount the virtiofs share and would then be able to write to the readonly share.
As this is a somewhat complex dance between LXD, QEMU and the lxd-agent, Iâve added some automated tests above to try and catch any future regressions in all 3 of the subsystems.
Thank you also for the full explanation. I think I understand now that 9p supports read-only access, but virtiofs always provides read-write access, so lxd should use 9p for proper read-only access. Due to a bug in an underlying component, lxd uses virtiofs in this case anyway. By default, this fails on apparmor (so apparmor enforces the read-only security), but when I set raw.apparmor, I override the apparmor profile so it can use virtiofs and thus provides full read-write access to the VM. Finally lxd-agent sets the mount flag ro in the VM, but being a VM process it canât of course enforce security.
I take it then that there isnât really a workaround.
I am willing to help with testing if that helps. So, when there is snap channel with the fixes included that I can try, please let me know.