NFS Ganesha in LXC

Shantur_Rathore · August 6, 2018, 10:43am

Hi,

I am trying to setup a nfs-ganesha server in LXC container. It seems to be failing for me to run and nfs-ganesha complains

vfs_open_by_handle :FSAL :DEBUG :Failed with Function not implemented openflags 0x00200003

This translates to missing open_by_handle_at function.

github.com

nfs-ganesha/nfs-ganesha/blob/V2.6-stable/src/include/os/linux/fsal_handle_syscalls.h#L78


				    struct file_handle *handle, int *mnt_id,
				    int flags)
{
	return syscall(__NR_name_to_handle_at, mdirfd, name, handle, mnt_id,
		       flags);
}


static inline int open_by_handle_at(int mdirfd, struct file_handle *handle,
				    int flags)
{
	return syscall(__NR_open_by_handle_at, mdirfd, handle, flags);
}
#endif				/* MAX_HANDLE_SZ */


#ifndef O_PATH
#define O_PATH 010000000
#endif


#ifndef AT_EACCESS
#define AT_EACCESS 0x200
#endif

I read somethings about lxd and open_by_handle_at method related to security, I made the container priviledged

    ➜  ~ lxc config show template-nfs-server
    architecture: x86_64
    config:
      image.architecture: amd64
      image.description: ubuntu 18.04 LTS amd64 (release) (20180724)
      image.label: release
      image.os: ubuntu
      image.release: bionic
      image.serial: "20180724"
      image.version: "18.04"
      raw.apparmor: |-
        mount fstype=rpc_pipefs,
        mount fstype=nfsd,
      security.privileged: "true"
      volatile.base_image: 38219778c2cf02521f34f950580ce3af0e4b61fbaf2b4411a7a6c4f0736071f9
      volatile.eth0.hwaddr: 00:16:3e:f5:d6:ea
      volatile.eth0.name: eth0
      volatile.idmap.base: "0"
      volatile.idmap.next: '[]'
      volatile.last_state.idmap: '[]'
      volatile.last_state.power: RUNNING
    devices:
      Vsphere-Templates:
        path: /mnt/templates
        pool: lxd-ceph
        source: Vsphere-Templates
        type: disk
    ephemeral: false
    profiles:
    - infrastructure
    stateful: false
    description: ""

still the error persists. Is there something i need to do?

Thanks

stgraber · August 6, 2018, 6:48pm

open_by_handle_at is directly banned by the default Seccomp policy as it provides for a very easy escape of confinement.

You should be able to override that with lxc config set template-nfs-server raw.lxc lxc.seccomp= but note that a privileged container with that disabled effectively means that root in the container can somewhat trivially escape to the host.

This syscall being allowed in containers was the reason behind the shocker exploit a few years back.

Shantur_Rathore · August 7, 2018, 12:47pm

Thanks @stgraber.
That did the trick but as I am using latest snap 3.3 the command needed was

lxc config set template-nfs-server raw.lxc lxc.seccomp.profile=

I really appreciate you taking time and answering all question on the forum.
I will be asking loads.

stgraber · August 7, 2018, 2:40pm

Oh oops, forgot we renamed that one too with 3.0

oleg · July 21, 2023, 1:41pm

Hi. I also stuck with this problem.
For now the configuration of my container nfs-test looks like:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230719)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230719"
  image.type: squashfs
  image.version: "22.04"
  raw.lxc: lxc.seccomp.profile = /usr/share/lxc/config/nfs-enabled.seccomp
  volatile.base_image: a0a9b9976255e7235afe495e920e6c0f40f55ae22852a5d5c31139aa9408f2e5
  volatile.cloud-init.instance-id: 542ed0e3-a70d-490b-872e-fcf5cfd38ecf
  volatile.host-local.hwaddr: 00:16:3e:fe:ca:d5
  volatile.host-nat.hwaddr: 00:16:3e:fe:f3:16
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 362294c6-50f9-4f4f-a9d3-2ea730fde0b8
  volatile.uuid.generation: 362294c6-50f9-4f4f-a9d3-2ea730fde0b8

The key line in config is:

raw.lxc: lxc.seccomp.profile = /usr/share/lxc/config/nfs-enabled.seccomp

nfs-enabled.seccomp

2
denylist
reject_force_umount  # comment this to allow umount -f;  not recommended
[all]
kexec_load errno 1
#open_by_handle_at errno 1
init_module errno 1
finit_module errno 1
delete_module errno 1

So, as you see, the line with open_by_handle_at is commented.
But after start the container it is still unable to expose directory with nfs-ganesha.

Effective seccomp config for started container:
$ vim vim /var/lib/lxd/security/seccomp/nfs-test

2
denylist
[all]
reject_force_umount  # comment this to allow umount -f;  not recommended
[all]
kexec_load errno 38
open_by_handle_at errno 38
init_module errno 38
finit_module errno 38
delete_module errno 38

My OS is Linux Mint 20.2 based on Ubuntu 22.04.
LXD is compiled according to the official documentation on github.

Did I miss something in configuration?

oleg · August 9, 2023, 12:43pm

Could you, please, give at least a hint where to dig into?

amikhalitsyn · August 11, 2023, 12:16pm

Hi!

Are you sure that problem is about open_by_handle_at in your case? You’ve seen errors about that in logs?

I can say that your container is not privileged, and open_by_handle_at syscall checks capable(CAP_DAC_READ_SEARCH). It means that caller process should have CAP_DAC_READ_SEARCH capability in the initial user namespace. So if you are absolutely sure that problem is about open_by_handle_at then make your container privileged.

oleg · August 16, 2023, 1:43pm

Hi, amikhalitsyn

With a privileged container everything works.
But my goal is to run nfs-ganesha (which is run in user and not in kernel space) without full privileges, but only with the ones that is directly required.

Are you sure that problem is about open_by_handle_at in your case? You’ve seen errors about that in logs?

I saw an error messages in nfs-ganesha logs with complaints on open_by_handle_at function.

oleg · August 16, 2023, 1:55pm

So the problem may not be related to that function, but what i see is the LXD doesn’t take into account lxc.seccomp.profile. Otherwise i would be able to check whether it help or not.
By the way, i also specified custom apparmor profile.

amikhalitsyn · August 17, 2023, 12:22pm

@oleg

I’ve read through the ganesha code briefly and I can say that open_by_handle_at is essential part of the process of “exporting filesystem” with NFS. It’s not important if it’s nfsd (kernel server) or user space (nfs-ganesha). [ analogical discussion exporting aufs file system via nfs-ganesha · Issue #252 · nfs-ganesha/nfs-ganesha · GitHub ]

As I’ve written below, open_by_handle_at requires CAP_DAC_READ_SEARCH in the initial user namespace. You can’t allow it using different apparmor/seccomp profile. This is a strict limitation. The only way to make it work is to use privileged container.

Who knows, maybe someday we find a way to make open_by_handle_at secure to be allowed in the user namespaces, but this syscall is a controversial thing (for containers) from the security standpoint that’s why it’s not allowed to use it inside the user namespace.

I can say that even LXC privileged containers are safe enough. We use extra LSM (AppArmor) to provide a good level of isolation from the host.

oleg · August 18, 2023, 5:48pm

amikhalitsyn thanks for your time.
Now the total problem became much clear for me.
I’m not familiar with specifically system/linux/kernel programming and my question may seem incorrect, but i saw that e.g. for docker containers I can choose kernel CAP-abilities. So, can they be chosen as a set of abilities before container creation? After all, as I understand it, a privileged container is just a container with an extended set of CAP features.

We use extra LSM (AppArmor) to provide a good level of isolation from the host.

Could you, please, share, what AppArmor directives you use? I’m interested in using network (replicated/distributed) file system and now is in active research what will fit our needs from the perspective of performance, convenience and maintainability.

amikhalitsyn · August 20, 2023, 10:57am

but i saw that e.g. for docker containers I can choose kernel CAP-abilities.

yes, you can. And in this case docker does not use user namespace, which is the same as LXC privileged (!) container from the security perspective

Could you, please, share, what AppArmor directives you use?

you don’t need to configure anything by your hands. LXC does everything with AppArmor for privileged containers automatically.

oleg · August 22, 2023, 12:02pm

Ok. I solved my main problem, in other words I just come up to the understanding that the solution is privileged container. But now I understand – why.

The non primary question is: why does lxc/lxd not take into account parameter
raw.lxc: lxc.seccomp.profile

Is it by design?