LXD hugepages config degrades VM startup

spaceman_spiff · January 1, 2022, 2:46pm

Hello and happy New Year! I have been using LXD to run system containers on my server for approximately one year, which has been a great! I also run libvirt on the same server, and am interested in converting my KVM/qemu VM builds to be accommodated by LXD.

Problem: My initial VM build (with LXD) was successful, but cannot connect to the VM after adding limits.memory.hugepages to the LXD’s configuration for the VM.

Note: I have previously configured hugepages on my system to support running a VM via libvirt, which appears to work fine, but I suspect that this configuration is problematic and/or insufficient for LXD’s needs. Or maybe it is fine, and i have a different issue… I will detail my hugepages configuration further below.

Basic information about the bare-metal host:

$ grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"

$ uname -a
Linux mountain 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

$ lscpu | grep "Model name"
Model name:                      AMD Ryzen 9 5950X 16-Core Processor

My LXD installation was performed via snap, and qemu was installed via apt.

$ snap list
Name    Version    Rev    Tracking       Publisher   Notes
core    16-2.52.1  11993  latest/stable  canonical✓  core
core20  20211129   1270   latest/stable  canonical✓  base
lxd     4.21       22147  latest/stable  canonical✓  -

$ qemu-system-x86_64 --version
QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-11)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

Here is the LXD build of the VM, first without hugepages. I am following this guide, with changes as applicable. As indicated below, this works fine.

$ lxc launch images:ubuntu/focal/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace
Starting furnace

$ lxc info furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Name: furnace
Status: RUNNING
Type: virtual-machine
Architecture: x86_64
PID: 1221899
Created: 2021/12/31 13:43 EST
Last Used: 2021/12/31 13:43 EST

Resources:
  Processes: 15
  Disk usage:
    root: 6.26MiB
  CPU usage:
    CPU usage (in seconds): 3
  Memory usage:
    Memory (current): 136.66MiB
    Memory (peak): 160.94MiB
  Network usage:
    enp5s0:
      Type: broadcast
      State: UP
      Host interface: tapef4100bc
      MAC address: 00:16:3e:0f:53:57
      MTU: 1500
      Bytes received: 1.66kB
      Bytes sent: 1.77kB
      Packets received: 14
      Packets sent: 14
      IP addresses:
        inet:  192.168.14.170/24 (global)
        inet6: fe80::216:3eff:fe0f:5357/64 (link)
    lo:
      Type: loopback
      State: UP
      MTU: 65536
      Bytes received: 5.92kB
      Bytes sent: 5.92kB
      Packets received: 80
      Packets sent: 80
      IP addresses:
        inet:  127.0.0.1/8 (local)
        inet6: ::1/128 (local)

$ lxc exec furnace -- su root
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
root@furnace:~# uname -a
Linux furnace 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

However, if I try to add hugepages into the configuration, I cannot lxc exec into the VM anymore because of an issue with the lxd-agent. lxc console seems to work, but I have no viable login.

$ lxc init images:ubuntu/focal/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace

$ lxc info furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Name: furnace
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2021/12/31 13:50 EST

$ lxc config set furnace limits.memory.hugepages=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc config show furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20211228_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20211228_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.memory.hugepages: "true"
  volatile.apply_template: create
  volatile.base_image: 3909be5d5c59409c001c40805c86bcb29ac787e10618a3c10ddfd425300d7adb
  volatile.eth0.hwaddr: 00:16:3e:cd:73:63
  volatile.uuid: 5985947b-2fff-4c91-ba0d-7ca3ad4aceb5
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

$ lxc start furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc info furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Name: furnace
Status: RUNNING
Type: virtual-machine
Architecture: x86_64
PID: 1236047
Created: 2021/12/31 13:50 EST
Last Used: 2021/12/31 13:54 EST

Resources:
  Processes: -1
  Disk usage:
    root: 8.00KiB
  Network usage:
    eth0:
      Type: broadcast
      State: UP
      Host interface: tape76c6878
      MAC address: 00:16:3e:cd:73:63
      MTU: 1500
      Bytes received: 862B
      Bytes sent: 164B
      Packets received: 5
      Packets sent: 2
      IP addresses:

$ lxc exec furnace -- su root
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Error: Failed to connect to lxd-agent

$ lxc console furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
To detach from the console, press: <ctrl>+a q

furnace login:

Admittedly, my “configuration” of the VM to use hugepages seems pretty naive and so my first thought is that it is incomplete. Maybe there is something pretty obvious that am missing something in my LXD VM configuration. Any help here would be much appreciated.

Thank you!

spaceman_spiff · January 1, 2022, 2:47pm

The other side of this story is the host’s hugepages configuration. I haven’t been able to find anything in LXD’s documentation to indicate how this is expected to be configured. This means either, 1) none is required, or 2) it is entirely on the user to figure this out. As much as I wish it were (1), i suspect that it is really (2). Hopefully this aspect isn’t actually a problem but in case it is…

As stated above, hugepages + libvirt were previously setup on the server to run VMs. The libvirt installation + configuration was pretty basic, but I have been able to get it to run VMs using an amateur hugepages configuration. I suspect that while my libvirt & hugepages configuration “works”, it might only work in a specific way. (ie. root, privileged, other?)

The following is my hugepages configuration…

The default hugepages size is 2MB. The host has 64GB of memory, and I have configured the allocation of 16384 hugepages (32GB) via /etc/sysctl.conf to ensure that the hugepages are statically allocated at boot time.

$ grep "vm.nr_hugepages" /etc/sysctl.conf
vm.nr_hugepages=16384

Also note that i have not declared a shared memory group id for hugepages as libvirt does not need it. I don’t know if the same is true for LXD, so this might be the culprit…?

$ grep "vm.hugetlb_shm_group" /etc/sysctl.conf || echo "missing"
missing

Also, I disable Transparent Hugepages, which is accomplished via a systemd service on boot.

$ grep ExecStart /etc/systemd/system/disable-thp.service
ExecStart=/bin/sh -c "echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled && echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag"

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

Here are the salient indications of the resulting system:

$ grep -e "Mem" -e "Huge" /proc/meminfo
MemTotal:       65764064 kB
MemFree:        13629120 kB
MemAvailable:   15254948 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:   16384
HugePages_Free:    15872
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        33554432 kB

$ mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)

Note that the discrepancy between HugePages_Total and HugePages_Free. This difference is due to the VM actively running (in the degraded state) via LXD. If I stop that VM, the numbers equal out.

stgraber · January 1, 2022, 6:53pm

Hmm, this is odd, getting a shell inside the VM through lxc console would hopefully provide some information as to why the agent isn’t feeling like starting.

In general for hugepages, all I’ve done is make sure that nr_hugepages * Hugepagesize is large enough to accomodate those VMs I want to have with limits.memory.hugepages=true and the rest has generally worked just fine here on a combination of platforms (Intel Xeon, AMD EPYC, AMD Ryzen 5700G, Intel KabyLake).

LXD expects one mount of hugetlbfs to be somewhere (doesn’t matter where), if more than one is found and none is at /dev/hugepages, then the hugepages support will be disabled.

spaceman_spiff · January 5, 2022, 3:53am

Thank you @stgraber for getting back to me. I appreciate your thoughts on this matter.

I spent some time trying to look into lxd-agent. In fact, it looks like the service tries to start, but times out while trying to mount the lxd-agent source directory, to copy its contents into the VM and run the agent executable itself.

Here is what i found:

Build, Start & Enter Console

I script the following commands to run consecutively.

lxc init images:ubuntu/focal/cloud/amd64 furnace --vm
lxc start furnace
sleep 20
lxc exec furnace passwd
lxc stop furnace
lxc config set furnace boot.autostart=true
lxc config set furnace limits.cpu=5,20
lxc config set furnace limits.memory=4GiB
lxc config set furnace limits.memory.hugepages=true
lxc start furnace
lxc console furnace

This comes back immediately…

WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
To detach from the console, press: <ctrl>+a q

…but then the console stalls for several minutes until this comes out.

[FAILED] Failed to start LXD - agent.

with more (seemingly unrelated) startup kernel output, and then the login prompt:

furnace login: root
password:

If I dont do anything for a while after login, the following kernel message eventually pops up repeatedly.

[  605.177594]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  605.178430] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.179606] INFO: task mount:552 blocked for more than 120 seconds.

Okay, something is wrong with a mount somewhere. The lxd-agent service status confirms this.

root@furnace:~# systemctl status lxd-agent
● lxd-agent.service - LXD - agent
     Loaded: loaded (/lib/systemd/system/lxd-agent.service; enabled; vendor pre>
     Active: deactivating (final-sigterm) (Result: timeout)
       Docs: https://linuxcontainers.org/lxd
    Process: 592 ExecStartPre=/lib/systemd/lxd-agent-setup (code=killed, signal>
      Tasks: 6 (limit: 4687)
     Memory: 2.3M
     CGroup: /system.slice/lxd-agent.service
             ├─456 umount -l /run/lxd_agent
             ├─540 mount -t virtiofs config /run/lxd_agent/.mnt
             └─552 mount -t virtiofs config /run/lxd_agent/.mnt

             Jan 05 01:32:52 furnace systemd[1]: lxd-agent.service: Found left-over process 552 (mount) in control group while starting unit. Ignoring.
             Jan 05 01:32:52 furnace systemd[1]: This usually indicates indicates unclean termination of a previous run, or service deficiencies.
             Jan 05 01:32:52 furnace systemd[1]: Starting LXD - agent...
             Jan 05 01:34:23 furnace systemd[1]: lxd-agent.service: start-pre operation time>
             Jan 05 01:34:23 furnace systemd[1]: lxd-agent.service: Control process exited, >

Numerous entries can be found in journalctl -xe looking like this:

Jan 05 01:16:48 furnace systemd-udevd[310]: vport5p1: Spawned process '/bin/systemctl start lxd-agent.service' [314] is taking longer than 59s to complete
Jan 05 01:16:48 furnace systemd-udevd[291]: vport5p1: Worker [310] processing SEQNUM=2898 is taking a long time
Jan 05 01:17:18 furnace systemd[1]: lxd-agent.service: start-pre operation timed out. Terminating.
Jan 05 01:17:18 furnace systemd[1]: lxd-agent.service: Control process exited, code=killed, status=15/TERM
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- An ExecStartPre= process belonging to unit lxd-agent.service has exited.
--
-- The process' exit code is 'killed' and its exit status is 15.
Jan 05 01:17:18 furnace systemd[1]: lxd-agent.service: Failed with result 'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit lxd-agent.service has entered the 'failed' state with result 'timeout'.
Jan 05 01:17:18 furnace systemd[1]: Failed to start LXD - agent.
-- Subject: A start job for unit lxd-agent.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit lxd-agent.service has finished with a failure.
--
-- The job identifier is 93 and the job result is failed.
Jan 05 01:17:18 furnace systemd-udevd[310]: vport5p1: Process '/bin/systemctl start lxd-agent.service' failed with exit code 1.

So systemctl and journalctl indicate that ExecStartPre=/lib/systemd/lxd-agent-setup is timing out, which is defined as part of the lxd-agent.service.

root@furnace:~# cat /lib/systemd/lxd-agent-setup
#!/bin/sh
set -eu
PREFIX="/run/lxd_agent"

# Functions.
mount_virtiofs() {
    mount -t virtiofs config "${PREFIX}/.mnt" >/dev/null 2>&1
}

mount_9p() {
    /sbin/modprobe 9pnet_virtio >/dev/null 2>&1 || true
    /bin/mount -t 9p config "${PREFIX}/.mnt" -o access=0,trans=virtio >/dev/null 2>&1
}

fail() {
    umount -l "${PREFIX}" >/dev/null 2>&1 || true
    rmdir "${PREFIX}" >/dev/null 2>&1 || true
    echo "${1}"
    exit 1
}

# Setup the mount target.
umount -l "${PREFIX}" >/dev/null 2>&1 || true
mkdir -p "${PREFIX}"
mount -t tmpfs tmpfs "${PREFIX}" -o mode=0700,size=50M
mkdir -p "${PREFIX}/.mnt"

# Try virtiofs first.
mount_virtiofs || mount_9p || fail "Couldn't mount virtiofs or 9p, failing."

# Copy the data.
cp -Ra "${PREFIX}/.mnt/"* "${PREFIX}"

# Unmount the temporary mount.
umount "${PREFIX}/.mnt"
rmdir "${PREFIX}/.mnt"

# Fix up permissions.
chown -R root:root "${PREFIX}"

So it looks like mounting with virtiofs stalls instead of failing, preventing trying with 9p, or failing outright, causing the service to timeout and be continously retried by systemd.

I am not really sure of how to troubleshoot the virtiofs mount issue, or what on the host that it is trying to mount, so any further help would be appreciated.

stgraber · January 5, 2022, 4:19am

Very odd that just switching the hugepages on is causing that difference… Anything else going on in dmesg that could explain what’s going on?

Also, what kernel is running in the guest?

spaceman_spiff · January 5, 2022, 2:31pm

Thanks again @stgraber for getting back to me. I appreciate your interest.

The kernel in the guest is the following:

root@furnace:~# uname -a
Linux furnace 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

and in the host, is the following:

$ uname -a
Linux mountain 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

dmesg has some interesting information! The first 1.567553 seconds of logs don’t really show anything abnormal looking. After that, we start to see some interesting output connected to the mount issues and aforementioned kernel messages output to the console.

[    1.567551] kvm: Nested Virtualization enabled
[    1.567553] kvm: Nested Paging enabled
[  484.345461] INFO: task mount:526 blocked for more than 120 seconds.
[  484.346129]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  484.346585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.345461] INFO: task mount:526 blocked for more than 120 seconds.
[  484.346129]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  484.346585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.347193] mount           D    0   526      1 0x00000004
[  484.347194] Call Trace:
[  484.347200]  __schedule+0x2e3/0x740
[  484.347201]  schedule+0x42/0xb0
[  484.347204]  rwsem_down_write_slowpath+0x244/0x4d0
[  484.347206]  ? _crng_backtrack_protect+0x56/0x70
[  484.347208]  down_write+0x41/0x50
[  484.347210]  grab_super+0x32/0x80
[  484.347213]  ? virtio_fs_restore+0x10/0x10 [virtiofs]
[  484.347213]  sget_fc+0xdf/0x220
[  484.347214]  ? virtio_fs_init_fs_context+0x20/0x20 [virtiofs]
[  484.347215]  virtio_fs_get_tree+0x11e/0x340 [virtiofs]
[  484.347218]  ? security_capable+0x3d/0x60
[  484.347219]  vfs_get_tree+0x2a/0xc0
[  484.347220]  ? capable+0x19/0x20
[  484.347222]  do_mount+0x7b1/0x9c0
[  484.347223]  ksys_mount+0x82/0xd0
[  484.347224]  __x64_sys_mount+0x25/0x30
[  484.347225]  do_syscall_64+0x57/0x190
[  484.347227]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  484.347228] RIP: 0033:0x7f16cd762dde
[  484.347232] Code: Bad RIP value.
[  484.347233] RSP: 002b:00007ffc9cdeaef8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  484.347234] RAX: ffffffffffffffda RBX: 00007f16cd891204 RCX: 00007f16cd762dde
[  484.347234] RDX: 0000560eea1a0c40 RSI: 0000560eea1a0c80 RDI: 0000560eea1a0c60
[  484.347235] RBP: 0000560eea1a0a30 R08: 0000000000000000 R09: 00007ffc9cde9c70
[  484.347235] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  484.347235] R13: 0000560eea1a0c60 R14: 0000560eea1a0c40 R15: 0000560eea1a0a30
[  605.177417] INFO: task mount:526 blocked for more than 241 seconds.
[  605.178108]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  605.178596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.179288] INFO: task mount:533 blocked for more than 120 seconds.
[  605.179797]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  605.180213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.177417] INFO: task mount:526 blocked for more than 241 seconds.
[  605.178108]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  605.178596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.179251] mount           D    0   526      1 0x00000004
[  605.179252] Call Trace:
[  605.179257]  __schedule+0x2e3/0x740
[  605.179258]  schedule+0x42/0xb0
[  605.179260]  rwsem_down_write_slowpath+0x244/0x4d0
[  605.179262]  ? _crng_backtrack_protect+0x56/0x70
[  605.179263]  down_write+0x41/0x50
[  605.179265]  grab_super+0x32/0x80
[  605.179267]  ? virtio_fs_restore+0x10/0x10 [virtiofs]
[  605.179268]  sget_fc+0xdf/0x220
[  605.179268]  ? virtio_fs_init_fs_context+0x20/0x20 [virtiofs]
[  605.179269]  virtio_fs_get_tree+0x11e/0x340 [virtiofs]
[  605.179271]  ? security_capable+0x3d/0x60
[  605.179272]  vfs_get_tree+0x2a/0xc0
[  605.179273]  ? capable+0x19/0x20
[  605.179275]  do_mount+0x7b1/0x9c0
[  605.179276]  ksys_mount+0x82/0xd0
[  605.179276]  __x64_sys_mount+0x25/0x30
[  605.179278]  do_syscall_64+0x57/0x190
[  605.179279]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  605.179280] RIP: 0033:0x7f16cd762dde
[  605.179284] Code: Bad RIP value.
[  605.179285] RSP: 002b:00007ffc9cdeaef8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  605.179285] RAX: ffffffffffffffda RBX: 00007f16cd891204 RCX: 00007f16cd762dde
[  605.179286] RDX: 0000560eea1a0c40 RSI: 0000560eea1a0c80 RDI: 0000560eea1a0c60
[  605.179286] RBP: 0000560eea1a0a30 R08: 0000000000000000 R09: 00007ffc9cde9c70
[  605.179286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  605.179286] R13: 0000560eea1a0c60 R14: 0000560eea1a0c40 R15: 0000560eea1a0a30
[  605.179288] INFO: task mount:533 blocked for more than 120 seconds.
[  605.179797]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  605.180213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.180859] mount           D    0   533      1 0x00000004
[  605.180859] Call Trace:
[  605.180861]  __schedule+0x2e3/0x740
[  605.180862]  schedule+0x42/0xb0
[  605.180863]  rwsem_down_write_slowpath+0x244/0x4d0
[  605.180864]  ? _crng_backtrack_protect+0x56/0x70
[  605.180865]  down_write+0x41/0x50
[  605.180866]  grab_super+0x32/0x80
[  605.180866]  ? virtio_fs_restore+0x10/0x10 [virtiofs]
[  605.180867]  sget_fc+0xdf/0x220
[  605.180868]  ? virtio_fs_init_fs_context+0x20/0x20 [virtiofs]
[  605.180868]  virtio_fs_get_tree+0x11e/0x340 [virtiofs]
[  605.180869]  ? security_capable+0x3d/0x60
[  605.180870]  vfs_get_tree+0x2a/0xc0
[  605.180870]  ? capable+0x19/0x20
[  605.180871]  do_mount+0x7b1/0x9c0
[  605.180871]  ksys_mount+0x82/0xd0
[  605.180872]  __x64_sys_mount+0x25/0x30
[  605.180873]  do_syscall_64+0x57/0x190
[  605.180874]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  605.180874] RIP: 0033:0x7fa5805f9dde
[  605.180875] Code: Bad RIP value.
[  605.180876] RSP: 002b:00007ffdf72eef78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  605.180876] RAX: ffffffffffffffda RBX: 00007fa580728204 RCX: 00007fa5805f9dde
[  605.180876] RDX: 0000560b3b240c40 RSI: 0000560b3b240c80 RDI: 0000560b3b240c60
[  605.180877] RBP: 0000560b3b240a30 R08: 0000000000000000 R09: 00007ffdf72edcf0
[  605.180877] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  605.180877] R13: 0000560b3b240c60 R14: 0000560b3b240c40 R15: 0000560b3b240a30
[  726.007108] INFO: task mount:526 blocked for more than 362 seconds.
[  726.008013]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  726.008774] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.009565] INFO: task mount:533 blocked for more than 241 seconds.
[  726.010192]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  726.010846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.007108] INFO: task mount:526 blocked for more than 362 seconds.
[  726.008013]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  726.008774] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.009525] mount           D    0   526      1 0x00000004
[  726.009527] Call Trace:
[  726.009532]  __schedule+0x2e3/0x740
[  726.009533]  schedule+0x42/0xb0
[  726.009536]  rwsem_down_write_slowpath+0x244/0x4d0
[  726.009538]  ? _crng_backtrack_protect+0x56/0x70
[  726.009539]  down_write+0x41/0x50
[  726.009541]  grab_super+0x32/0x80
[  726.009544]  ? virtio_fs_restore+0x10/0x10 [virtiofs]
[  726.009544]  sget_fc+0xdf/0x220
[  726.009545]  ? virtio_fs_init_fs_context+0x20/0x20 [virtiofs]
[  726.009546]  virtio_fs_get_tree+0x11e/0x340 [virtiofs]
[  726.009548]  ? security_capable+0x3d/0x60
[  726.009549]  vfs_get_tree+0x2a/0xc0
[  726.009550]  ? capable+0x19/0x20
[  726.009552]  do_mount+0x7b1/0x9c0
[  726.009552]  ksys_mount+0x82/0xd0
[  726.009553]  __x64_sys_mount+0x25/0x30
[  726.009555]  do_syscall_64+0x57/0x190
[  726.009556]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  726.009557] RIP: 0033:0x7f16cd762dde
[  726.009561] Code: Bad RIP value.
[  726.009562] RSP: 002b:00007ffc9cdeaef8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  726.009563] RAX: ffffffffffffffda RBX: 00007f16cd891204 RCX: 00007f16cd762dde
[  726.009563] RDX: 0000560eea1a0c40 RSI: 0000560eea1a0c80 RDI: 0000560eea1a0c60
[  726.009563] RBP: 0000560eea1a0a30 R08: 0000000000000000 R09: 00007ffc9cde9c70
[  726.009564] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  726.009564] R13: 0000560eea1a0c60 R14: 0000560eea1a0c40 R15: 0000560eea1a0a30
[  726.009565] INFO: task mount:533 blocked for more than 241 seconds.
[  726.010192]       Not tainted 5.4.0-91-generic #102-Ubuntu
[  726.010846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.011755] mount           D    0   533      1 0x00000004
[  726.011756] Call Trace:
[  726.011759]  __schedule+0x2e3/0x740
[  726.011760]  schedule+0x42/0xb0
[  726.011761]  rwsem_down_write_slowpath+0x244/0x4d0
[  726.011763]  ? _crng_backtrack_protect+0x56/0x70
[  726.011764]  down_write+0x41/0x50
[  726.011765]  grab_super+0x32/0x80
[  726.011767]  ? virtio_fs_restore+0x10/0x10 [virtiofs]
[  726.011768]  sget_fc+0xdf/0x220
[  726.011770]  ? virtio_fs_init_fs_context+0x20/0x20 [virtiofs]
[  726.011771]  virtio_fs_get_tree+0x11e/0x340 [virtiofs]
[  726.011772]  ? security_capable+0x3d/0x60
[  726.011773]  vfs_get_tree+0x2a/0xc0
[  726.011773]  ? capable+0x19/0x20
[  726.011774]  do_mount+0x7b1/0x9c0
[  726.011775]  ksys_mount+0x82/0xd0
[  726.011776]  __x64_sys_mount+0x25/0x30
[  726.011777]  do_syscall_64+0x57/0x190
[  726.011778]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  726.011783] RIP: 0033:0x7fa5805f9dde
[  726.011785] Code: Bad RIP value.
[  726.011786] RSP: 002b:00007ffdf72eef78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  726.011786] RAX: ffffffffffffffda RBX: 00007fa580728204 RCX: 00007fa5805f9dde
[  726.011787] RDX: 0000560b3b240c40 RSI: 0000560b3b240c80 RDI: 0000560b3b240c60
[  726.011787] RBP: 0000560b3b240a30 R08: 0000000000000000 R09: 00007ffdf72edcf0
[  726.011787] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  726.011788] R13: 0000560b3b240c60 R14: 0000560b3b240c40 R15: 0000560b3b240a30

Im not sure how to interpret this…

stgraber · January 5, 2022, 3:20pm

Looks like the CPU getting stuck while performing memory operations for virtfs.
Virtfs uses virtio to communicate with the host which is effectively shared memory between host and guest.

I’m not sure where the problem is, it could be a kernel issue with the host causing problems with virtio on top of hugepages maybe?

I just checked and in our production cluster we have VMs running with working lxd-agent and virtfs using the exact same guest kernel.

stgraber@dakara:~$ lxc config show s-mtl-cluster:ceph01 --project lxc --expanded
architecture: x86_64
config:
  boot.autostart: "true"
  cluster.evacuate: stop
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20211127_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20211127_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "4"
  limits.memory: 48GiB
  limits.memory.hugepages: "true"
  limits.processes: "500"
  security.idmap.isolated: "true"
  security.protection.delete: "true"
  snapshots.expiry: 1w
  snapshots.pattern: snap-{{ creation_date|date:'2006-01-02_15-04-05' }}
  volatile.base_image: ec2cae5ce5d9da4aab25aa16878844c7789357fa78c9e0939fbc8cca7045fb9f
  volatile.eth0.host_name: tap259b89db
  volatile.eth0.hwaddr: 00:16:3e:2e:51:de
  volatile.last_state.power: RUNNING
  volatile.uuid: a6046532-fb21-4f62-b6d4-7e28412283ed
  volatile.vsock_id: "74"
devices:
  eth0:
    name: eth0
    network: default
    type: nic
  root:
    path: /
    pool: ssd
    size: 30GiB
    type: disk
ephemeral: false
profiles:
- default
- ceph
stateful: false
description: ""
stgraber@dakara:~$ lxc exec s-mtl-cluster:ceph01 bash --project lxc
root@ceph01:~# uname -a
Linux ceph01 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@ceph01:~# exit

spaceman_spiff · January 7, 2022, 2:14pm

The idea that it might be attributable to kernel differences would be concerning. For now, I have swapped out ubuntu/focal for debian/bullseye in an attempt to bring the guest and host kernel versions as close together as possible. I will need to reinstall the OS on the host to make the kernel version completely identical, so I will reserve that effort for later.
Host:

root@mountain:~# uname -a
Linux mountain 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

Guest:

root@furnace:~# uname -a
Linux furnace 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) x86_64 GNU/Linux

But, sadly this made no difference. I still see the same things happen in the Debian Bullseye VM that I saw in the Ubuntu Focal VM.

spaceman_spiff · January 7, 2022, 2:35pm

I felt it important to revisit interrogating the conditions causing the failure were before further pursuing kernel differences. Happily, I learned something new!

It isn’t solely my turning on hugepages that causes the lxd-agent service to not come up. The issue occurs when i enable hugepages at the same time as specifying a cpuset!

This works (hugepages only)

$ lxc init images:debian/bullseye/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace

$ lxc config set limits.memory.hugepages=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc start furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ sleep 20

$ lxc exec furnace -- su root
root@furnace:~#

This also works (cpuset only)

$ lxc init images:debian/bullseye/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace

$ lxc config set furnace limits.cpu=5,20
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc start furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ sleep 20

$ lxc exec furnace -- su root
root@furnace:~#

But when putting them together, no workie!

$ lxc init images:debian/bullseye/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace

$ lxc config set furnace limits.cpu=5,20
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc config set limits.memory.hugepages=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc start furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ sleep 20

$ lxc exec furnace -- su root
Error: Failed to connect to lxd-agent

Here is the interesting thing, if i specify only a single core for the cpuset, then it starts to work again!

$ lxc init images:debian/bullseye/cloud/amd64 furnace --vm
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
Creating furnace

$ lxc config set furnace limits.cpu=5
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc config set limits.memory.hugepages=true
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ lxc start furnace
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

$ sleep 20

$ lxc exec furnace -- su root
root@furnace:~#

So if i specify anything other that a single core via limits.cpu, the VM will fail to start when also enabling hugepages. I’ve tried this with multiple ways. Maybe i am doing something obviously wrong here?

@stgraber Can you retry your prior test, but specifying multiple cpus via limits.cpu?

Thanks in advance!

stgraber · January 7, 2022, 3:35pm

spaceman_spiff · January 7, 2022, 4:16pm

Awesome! I assume that this means that you were able to reproduce the issue. I am really looking forward to the LXD release that this will be included in.

stgraber · January 7, 2022, 5:00pm

Yeah, you needed to have the combination of:

CPU pinning
Split CPU/memory across two or more NUMA nodes
Hugepages enabled