ERROR cgfsng - cgroups/cgfsng.c:__cgroup_tree_create:747 - No space left on device

From time to time, I’m unable to start new containers (tried with newly copied containers):

# lxc start work01-2021-10-06-06-33-04           
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart work01-2021-10-06-06-33-04 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/work01-2021-10-06-06-33-04/lxc.conf:
Try `lxc info --show-log work01-2021-10-06-06-33-04` for more info

When I inspect the log, it’s full of “No space left on device” when creating cgroups:

Name: work01-2021-10-06-06-33-04
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2021/10/06 06:34 UTC
Last Used: 2021/10/06 11:33 UTC

Log:

lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:__cgroup_tree_create:747 - No space left on device - Failed to create 13(lxc.monitor.work01-2021-10-06-06-33-04)
lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:cgroup_tree_create:831 - No space left on device - Failed to create monitor cgroup 13(lxc.monitor.work01-2021-10-06-0
6-33-04)
lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:__cgroup_tree_create:747 - No space left on device - Failed to create 13(lxc.monitor.work01-2021-10-06-06-33-04-1)
lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:cgroup_tree_create:831 - No space left on device - Failed to create monitor cgroup 13(lxc.monitor.work01-2021-10-06-0
6-33-04-1)
lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:__cgroup_tree_create:747 - No space left on device - Failed to create 13(lxc.monitor.work01-2021-10-06-06-33-04-2)
lxc work01-2021-10-06-06-33-04 20211006113300.976 ERROR    cgfsng - cgroups/cgfsng.c:cgroup_tree_create:831 - No space left on device - Failed to create monitor cgroup 13(lxc.monitor.work01-2021-10-06-0
6-33-04-2)

(...repeated hundreds of times...)

lxc work01-2021-10-06-06-33-04 20211006113300.993 ERROR    cgfsng - cgroups/cgfsng.c:__cgroup_tree_create:747 - No space left on device - Failed to create 13(lxc.monitor.work01-2021-10-06-06-33-04-999)
lxc work01-2021-10-06-06-33-04 20211006113300.993 ERROR    cgfsng - cgroups/cgfsng.c:cgroup_tree_create:831 - No space left on device - Failed to create monitor cgroup 13(lxc.monitor.work01-2021-10-06-06-33-04-999)
lxc work01-2021-10-06-06-33-04 20211006113300.993 ERROR    cgfsng - cgroups/cgfsng.c:cgfsng_monitor_create:1067 - Numerical result out of range - Failed to create monitor cgroup
lxc work01-2021-10-06-06-33-04 20211006113300.993 ERROR    start - start.c:__lxc_start:2002 - Failed to create monitor cgroup
lxc work01-2021-10-06-06-33-04 20211006113300.993 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:868 - Received container state "ABORTING" instead of "RUNNING"
lxc work01-2021-10-06-06-33-04 20211006113306.452 WARN     cgfsng - cgroups/cgfsng.c:cgfsng_payload_destroy:538 - Uninitialized limit cgroup
lxc work01-2021-10-06-06-33-04 20211006113306.455 WARN     cgfsng - cgroups/cgfsng.c:cgfsng_monitor_destroy:910 - Uninitialized monitor cgroup
lxc 20211006113306.459 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:220 - Connection reset by peer - Failed to receive response
lxc 20211006113306.461 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:129 - Failed to receive file descriptors

Running LXD 4.19 on Ubuntu 20.04, on AWS, with 5.11.0-1017-aws kernel.

Any idea how to fix it?

Note - it’s not a problem with i.e. rootfs out of space:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  1.1M  6.3G   1% /run
/dev/nvme0n1p1   39G   17G   22G  44% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/loop0       34M   34M     0 100% /snap/amazon-ssm-agent/3552
/dev/loop1       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop3      100M  100M     0 100% /snap/core/11606
/dev/loop4       56M   56M     0 100% /snap/core18/2128
/dev/loop5       73M   73M     0 100% /snap/lxd/21497
/dev/loop6       56M   56M     0 100% /snap/core18/2074
/dev/loop7       62M   62M     0 100% /snap/core20/1026
/dev/loop8       62M   62M     0 100% /snap/core20/1081
tmpfs           1.0M     0  1.0M   0% /var/snap/lxd/common/ns
tmpfs           6.3G     0  6.3G   0% /run/user/0
/dev/loop10     100M  100M     0 100% /snap/core/11743
/dev/loop2       74M   74M     0 100% /snap/lxd/21624

I’ve noticed that it only seems to happen on the LXD servers which run containers running snaps, i.e. letsencrypt snap.

That usually points to /sys/fs/cgroup/cpuset not being initialized.
What do:

  • /sys/fs/cgroup/cpuset/cpuset.cpus
  • /sys/fs/cgroup/cpuset/cgroup.clone_children
    show?