Limiting disk iops with Docker inside LXD container

Hello,
I have LXD container with disk IOPS limit configured as below:

root:
limits.max: 500iops
limits.read: 500iops
limits.write: 500iops

It work’s fine when doing disk intensive operations from container itself. Container has security.nesting=true to allow it to have Docker inside. It works great but I can see that I/O operations done inside Docker in this container are not throttled. I can see ~1500 IOPS in comparison to stable ~500 IOPS when doing operations from LXD container.

I suppose that Docker’s cgroups does not inherit limits from LXD container. Is there any way to do it?

I use BTRFS attached as loop device to container. I think that it blocks me from using limits implemented in Docker (eg. device-read-bps).

Thank you for any suggestions! :slight_smile:

Can you show lxc config show --expanded NAME for your container?

1 Like

Of course:

architecture: x86_64
config:
  environment.lxdMosaicPullMetrics: "y"
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20201201)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20201201"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "4"
  limits.cpu.priority: "5"
  limits.memory: 10GB
  limits.memory.enforce: hard
  security.nesting: "true"
  snapshots.expiry: 1w
  snapshots.pattern: snapshot-{{creation_date.Format("20060102")}}-%d
  snapshots.schedule: 0 0 * * *
  snapshots.schedule.stopped: "false"
  volatile.base_image: 3e9403fe7645000fc49ec89bca056c7fd53e9a142a3a9054ee02c13a2f14b6d0
  volatile.eth0.host_name: veth070989f2
  volatile.eth0.hwaddr: 00:16:3e:9c:90:11
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: bd14ecf1-d949-4b64-b3d0-652ecd9c7778
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    limits.max: 500iops
    limits.read: 500iops
    limits.write: 500iops
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- lxdMosaicPullMetrics
- default
stateful: false
description: ""

I have created test container with same profile but with no cpus/memory limits.

FIO result from LXD container:

   iops        : min=  470, max=  500, avg=499.44, stdev= 3.07, samples=184

FIO result from Docker’s Ubuntu container inside LXD container:

   iops        : min=17638, max=31696, avg=28444.51, stdev=1745.46, samples=55

(these are read iops)

Test container config:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20210201)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20210201"
  image.type: squashfs
  image.version: "20.04"
  security.nesting: "true"
  volatile.base_image: d1df9c150a9fd265ba93a00fe062757bd34d9c0daa076063f59204f0e3bf2629
  volatile.eth0.hwaddr: 00:16:3e:e4:68:f7
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: 1e07851a-6c95-4a8b-8ce8-1c14c2a0dc6a
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    limits.max: 500iops
    limits.read: 500iops
    limits.write: 500iops
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

What’s the Docker backend in use here?

The blkio limits should be hierarchical so should apply to the container too, but if the container is using an overlay backend, that may explain it?

It’s Docker installed through Ubuntu’s snap.

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.11
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version:
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-65-generic
 Operating System: Ubuntu Core 16
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 62.83GiB
 Name: test
 ID: 5GAO:BW3D:VTPL:QPAT:RQGT:5GST:HYE6:OSBD:7AET:LVYD:DIY6:EEB4
 Docker Root Dir: /var/snap/docker/common/var-lib-docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

I have installed Docker from APT docker.io package and results are the same - IOPS limit is not applied while as I can see SNAP package used overlay2 while APT package uses btrfs:

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.8
 Storage Driver: btrfs
  Build Version: Btrfs v5.4.1
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version:
 runc version:
 init version:
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-65-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 62.83GiB
 Name: test
 ID: N5N7:PZ4F:X2KJ:U2LU:3LMW:F4GI:TWXE:WYS6:YGRZ:SJAP:RK4L:VRVY
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Read FIO result for Docker installed through APT:

   iops        : min=50174, max=65604, avg=59655.88, stdev=4826.27, samples=26

I think that issue is not limited to disk limits only. When i run Ubuntu container in APT-installed Docker on test LXD container which have cpus limit to 4 and memory to 4GB, I see all host’s CPUs and all host’s memory from htop.

In simple words:

  • after limits I see 4 CPUs and 4GB RAM from LXD container.
  • I see 12 CPUs and 64GB RAM from Docker container in this LXD container

It’s normal that you can see all CPU and memory in the nested Docker container.
Docker doesn’t know to use LXCFS to properly render limits. You won’t be able to exceed the CPU or memory limit inside the container, it’s just reporting that’s incorrect.

1 Like

I think I might have an answer:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/blkio-controller.html

Throttling implements hierarchy support; however, throttling’s hierarchy support is enabled iff “sane_behavior” is enabled from cgroup side, which currently is a development option and not publicly available.

As I understand sub-cgroups just do not inherit blkio parametrs from LXD container cgroup. I have checked in /sys/fs/cgroup/blkio/docker/blkio.throttle.read_iops_device and /sys/fs/cgroup/blkio/docker/{container}/blkio.throttle.read_iops_device and read iops throttling wasn’t there.

I have made small test by assigning IOPS limit directly to Docker container cgroup:

root@test:~# echo "7:4 500" > /sys/fs/cgroup/blkio/docker/57ef9e662e2ab7c79bf63223d0f00e89e9911fe01f4c13904e03bba34de68aaf/blkio.throttle.write_iops_device
root@test:~# echo "7:4 500" > /sys/fs/cgroup/blkio/docker/57ef9e662e2ab7c79bf63223d0f00e89e9911fe01f4c13904e03bba34de68aaf/blkio.throttle.read_iops_device

And container started to throttle.

I have assigned the same limits to “docker” cgroup but again - it’s not inherited by container.

Fun, can you try writing 1 to the /sys/fs/cgroup/blkio/cgroup.sane_behavior file on your system, see if that fixes it (you’ll need to restart the container).

I don’t know if there’s some kind of systemd option to have the host set it to true on startup.

I cannot set it to 1

# echo 1 > /sys/fs/cgroup/blkio/cgroup.sane_behavior
-bash: echo: write error: Invalid argument

According to https://www.man7.org/linux/man-pages/man7/cgroups.7.html sane_behavior is set always to 0 in cgroups2 and cannot be changed. The same applies to clone_children.

I would use Docker’s device-read-iops option but it’s not possible to set it as loop device in container is not in /dev/* while Docker forces to use device paths starting with /dev. :frowning:

I tried setting disk IOPS limit through systemd Docker service but it also doesn’t work. It makes sense as cgroups created for containers won’t inherit from parent.

Well, you’re not on cgroup2 though.

You are right but looking at kernel source code it’s now read only value only for cgroups v1:

Oh, lovely :slight_smile:

@stgraber Thank you for your time. I think that only thing I can do is to continue this topic within kernel’s issue tracker or Docker’s one (to allow using major&minor version for limiting IOPS to be compilant with cgroups). I really appreciate your help, have a nice day! :slight_smile:

I have submitted this issue in kernel bug tracker: https://bugzilla.kernel.org/show_bug.cgi?id=211689

I kinda doubt that you’ll see much traction on cgroup1 changes at this point.
cgroup2 should have the proper behavior, though whether your Docker container will work at all on it is another question :wink:

LXD does support cgroup2 including configuring its blkio controller.

1 Like

@stgraber Is there a way to start single LXD container using cgroups v2 or systemd.unified_cgroup_hierarchy=1 must by applied to host to test it? As I checked Docker supports cgroups v2 since 20.10 so it might be something worth a try. :smiley:

Using cgroup2 is unfortunately a system-wide thing…