Udev inside privileged container

Holger · February 7, 2022, 1:53pm

Hi,

I am running LXD version 4.21 on Ubuntu 20.04.

I would like to run udev inside my privileged container and let it manage the device nodes for loopback devices. My use-case is the following: I would like to prepare hard-drive images inside the container and need to access the partitions after formatting the image file. The process is illustrated by the following simplified script:

#!/bin/bash

# Name of the image file.
IMAGE=hd.img

# Create empty image.
dd if=/dev/zero of=${IMAGE} bs=512 count=4194304

# Find a loopback device and attach it to the image.
sudo losetup -f ${IMAGE}
DEVICE=$(losetup -a | grep ${IMAGE} | cut -d':' -f 1)

# Init image with GPT partition table.
sudo sgdisk -Z ${DEVICE}
sudo sgdisk -o ${DEVICE}

# Create a partition.
sudo sgdisk -n 1:2048:4096 -t 0:8300 -c 0:Linux ${DEVICE}

After the image has been created and formatted, I would like to reload the Kernel’s partition table by running

sudo partprobe /dev/loop20

so that the new partition appears as, e.g. /dev/loop20p1.

Here is my container configuration, which already incorporates ideas from this thread: https://github.com/lxc/lxd/issues/1841

architecture: x86_64
config:
  image.description: Base image
  linux.kernel_modules: overlay
  raw.apparmor: mount,
  raw.lxc: |
    lxc.cgroup.devices.allow = c 4:* rwm
    lxc.cgroup.devices.allow = b 7:* rwm
    lxc.cgroup.devices.allow = b 8:* rwm
    lxc.cgroup.devices.allow = c 10:236 rwm
    lxc.cgroup.devices.allow = c 10:237 rwm
    lxc.cgroup.devices.allow = c 116:* rwm
    lxc.cgroup.devices.allow = c 188:* rwm
    lxc.cgroup.devices.allow = b 252:* rwm
    lxc.cgroup.devices.allow = b 253:* rwm
    lxc.mount.auto=sys:rw proc:mixed cgroup:mixed
  security.nesting: "true"
  security.privileged: "true"
  volatile.base_image: f5da84d0ffe30fb083d87d7990c08a5ee89dc58e148b43f5a6344d62899c3a71
  volatile.idmap.base: "0"
  volatile.idmap.current: '[]'
  volatile.idmap.next: '[]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.net-bridge.host_name: veth-2320020504
  volatile.net-bridge.hwaddr: 00:16:3e:ac:17:c0
  volatile.uuid: 4ab7c28a-4420-43e9-8407-918d878eafd8

With the above configuration udev starts fine:

$ systemctl status udev.service
● systemd-udevd.service - udev Kernel Device Manager
   Loaded: loaded (/lib/systemd/system/systemd-udevd.service; static; vendor preset: enabled)
   Active: active (running) since Mon 2022-02-07 13:26:31 UTC; 16min ago
     Docs: man:systemd-udevd.service(8)
           man:udev(7)
 Main PID: 58 (systemd-udevd)
   Status: "Processing with 40 children at max"
    Tasks: 1
   CGroup: /system.slice/systemd-udevd.service
           └─58 /lib/systemd/systemd-udevd

It also receives the relevant events

$ udevadm monitor
...
KERNEL[4903.633614] change   /devices/virtual/block/loop20 (block)
UDEV  [4903.720529] change   /devices/virtual/block/loop20 (block)
...

KERNEL[5041.497195] add      /devices/virtual/block/loop20/loop20p1 (block)
UDEV  [5041.507141] add      /devices/virtual/block/loop20/loop20p1 (block)

But the respective device (/dev/loop20p1) does not appear inside my container, while it does on the host. Furthermore, I observed that partprobe takes very long to finish, a couple of minutes, which seems also strange. But it does not give any error, though.

Do you have any ideas?

Best,
Holger

EDIT: My container image is based on Ubuntu 18.04.

stgraber · February 7, 2022, 6:15pm

Block devices aren’t namespaced so using a privileged container to trigger the creation of block devices will cause them to show up on the host, not in the container.

You’ll need to manually add any resulting device to the container as part of its devices.

Holger · February 9, 2022, 11:25am

That explains it. So I guess I have to stick to my current solution (explicitly allocating a loopback device for each partition, using offset and size).

Thanks for the quick answer!

mtalexan · November 29, 2023, 3:36pm

This is an older topic, but I just wanted to clarify something as I understand it. @stgraber is correct that there’s no namespacing of block devices, but that doesn’t mean it’s not possible to do what you’re trying to do.

I believe @stgraber meant major:minor device numbers are dynamically allocated and so it’s hard to know what to allow thru in advance. There’s almost no reserved major numbers for device numbers, and there’s no predefined major device number ranges reserved for specific types of devices.

However, you can see what the current dynamically allocated device numbers are on your host by doing cat /proc/devices, and you can also see what the major:minor device number is for a current device file with ls -la or stat. Furthermore, partitioned (non-loopback) block devices will all share the same major device number and their minor device numbers correspond to their partition number. Loopback devices are odd however, and all share a single major device number and have a minor device number corresponding to their loopback number (i.e. if all loopback devices have major number 7, then loop15 will have device number 7:15). When you start partitioning loopback devices, the minor numbers will no longer correspond to either the loopback number or the partition number though.

All this isn’t to say it’s not possible to do what you’re trying to do, it just means you have a tougher time defining the cgroup.devices.allow rules to let newly created loopback partition devices be visible within the container.
For non-loopback devices it’s just a matter of dynamically determining the major number for the specific primary block device and allowing all minor numbers thru (i.e. if /dev/sda is 65:0 then cgroup.devices.allow b 65:* rwm).
For loopback devices, you only have the option of allowing all loopback devices thru if you want access to newly created partitions. It’s not possible to allow only a single loopback device and it’s partitions thru.

Just reiterating too that this whole solution does require running udev (or equivalent) inside the container, and there are also complexities with getting device naming the same as on the host.
Allowing the devices thru with the cgroup.devices.allow opens a window to the host for the container to be able to see those devices, but doesn’t create the device files associated with them. udev inside the container is necessary for that still.
For matching names to the host, you’re effectively running udev on the host and udev within the container and hoping they come to the same conclusion based on naming, but their naming is based on the presence of other devices at the time the device file is being created, so the conclusions would normally differ. The lxc-device tool on the host handles this name matching issue for existing devices being added to the container and explicitly creates the device file inside the container with the name matching the one on the host. This could theoretically be leveraged to get partition device files named the same as the host by using lxc-device to mount the primary block device into the container with the host-matched name, and then relying on udev inside the container to come to the same conclusions about partition names as the host because of the primary block device file name and the minor device number.