XOrg hitting segmentation fault when starting a cluster

Noticed this while using the Cluster API provider, but I can reproduce without CAPN. Asking here instead of opening a ticket because I suspect this is somehow Nvidia drivers giving me yet another headache, but I’ll take any advice or bring this to any other platform if suggested.

When a new machine with a certain combination of settings starts, my X window session crashes and I need to log in again. My machine(s) are still running in the background the whole time.

Here’s the minimal machine config I’ve been able to come up with to reproduce by just running

incus launch images:debian/12 u1 < config.yaml

Where config.yaml is:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian bookworm amd64 (20260225_05:24)
  image.os: Debian
  image.release: bookworm
  image.serial: "20260225_05:24"
  image.type: squashfs
  image.variant: default
  raw.lxc: |
    lxc.apparmor.profile=unconfined
    lxc.mount.auto=proc:rw sys:rw cgroup:rw
  security.privileged: "true"
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

The key part being what’s in config.raw.lxc. Those 2 values in conjunction ( lxc.apparmor.profile=unconfined and lxc.mount.auto=proc:rw sys:rw cgroup:rw), and not by themselves, are what makes the difference for me. Other base images seem to hit this too.

The logs that seem closest to the issue that I’ve been able to gather so far are these from /var/log/Xorg.0.log (adding such a large block of logs because these lines seem to be different from the routine logs when compared to other sessions, sorry if they’re noise):

[ 17540.104] (**) Option "config_info" "udev:/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input0/event0"
[ 17540.104] (II) XINPUT: Adding extended input device "Sleep Button" (type: KEYBOARD, id 11)
[ 17540.104] (**) Option "xkb_model" "pc105"
[ 17540.104] (**) Option "xkb_layout" "es"
[ 17540.104] (WW) Option "xkb_variant" requires a string value
[ 17540.104] (WW) Option "xkb_options" requires a string value
[ 17540.106] (II) event0  - Sleep Button: is tagged by udev as: Keyboard
[ 17540.106] (II) event0  - Sleep Button: device is a keyboard
[ 17540.109] (II) config/udev: removing GPU device /sys/devices/pci0000:00/0000:00:02.0/drm/card1 /dev/dri/card1
[ 17540.109] xf86: remove device 1 /sys/devices/pci0000:00/0000:00:02.0/drm/card1
[ 17540.109] failed to find screen to remove
[ 17540.110] (II) config/udev: removing device Power Button
[ 17540.110] (II) event1  - Power Button: device removed
[ 17540.120] (II) UnloadModule: "libinput"
[ 17540.121] (II) config/udev: Adding input device Power Button (/dev/input/event1)
[ 17540.121] (**) Power Button: Applying InputClass "libinput keyboard catchall"
[ 17540.121] (II) Using input driver 'libinput' for 'Power Button'
[ 17540.121] (**) Power Button: always reports core events
[ 17540.121] (**) Option "Device" "/dev/input/event1"
[ 17540.122] (II) event1  - Power Button: is tagged by udev as: Keyboard
[ 17540.122] (II) event1  - Power Button: device is a keyboard
[ 17540.122] (II) event1  - Power Button: device removed
[ 17540.130] (**) Option "config_info" "udev:/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input1/event1"
[ 17540.130] (II) XINPUT: Adding extended input device "Power Button" (type: KEYBOARD, id 10)
[ 17540.130] (**) Option "xkb_model" "pc105"
[ 17540.130] (**) Option "xkb_layout" "es"
[ 17540.130] (WW) Option "xkb_variant" requires a string value
[ 17540.130] (WW) Option "xkb_options" requires a string value
[ 17540.132] (II) event1  - Power Button: is tagged by udev as: Keyboard
[ 17540.132] (II) event1  - Power Button: device is a keyboard
[ 17540.135] (II) config/udev: Adding input device Lid Switch (/dev/input/event2)
[ 17540.135] (II) No input driver specified, ignoring this device.
[ 17540.135] (II) This device may have been added with another device file.
[ 17540.135] (II) config/udev: removing GPU device /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/card2 /dev/dri/card2
[ 17540.135] xf86: remove device 0 /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm/card2
[ 17540.143] (II) UnloadModule: "nvidia"
[ 17540.143] (II) UnloadSubModule: "glxserver_nvidia"
[ 17540.143] (II) Unloading glxserver_nvidia
[ 17540.143] (II) UnloadSubModule: "wfb"
[ 17540.144] (EE) 
[ 17540.144] (EE) Backtrace:
[ 17540.148] (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x14c) [0x6501baee10cc]
[ 17540.149] (EE) 1: /lib/x86_64-linux-gnu/libc.so.6 (__sigaction+0x50) [0x724bd3045330]
[ 17540.149] (EE) 2: /usr/lib/xorg/Xorg (RRTellChanged+0x2a2) [0x6501bae22f12]
[ 17540.149] (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0x36b) [0x6501badc7adb]
[ 17540.149] (EE) 4: /usr/lib/xorg/Xorg (config_fini+0x9ce) [0x6501badc35be]
[ 17540.149] (EE) 5: /usr/lib/xorg/Xorg (config_fini+0x1648) [0x6501badc4238]
[ 17540.150] (EE) 6: /usr/lib/xorg/Xorg (OsCleanup+0x64a) [0x6501baee1b9a]
[ 17540.150] (EE) 7: /usr/lib/xorg/Xorg (WaitForSomething+0x185) [0x6501baeda4f5]
[ 17540.150] (EE) 8: /usr/lib/xorg/Xorg (SendErrorToClient+0x11a) [0x6501bad6186a]
[ 17540.150] (EE) 9: /usr/lib/xorg/Xorg (InitFonts+0x3ca) [0x6501bad65eda]
[ 17540.152] (EE) 10: /lib/x86_64-linux-gnu/libc.so.6 (__libc_init_first+0x8a) [0x724bd302a1ca]
[ 17540.153] (EE) 11: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0x8b) [0x724bd302a28b]
[ 17540.153] (EE) 12: /usr/lib/xorg/Xorg (_start+0x25) [0x6501bad4e395]
[ 17540.153] (EE) 
[ 17540.153] (EE) Segmentation fault at address 0x34
[ 17540.153] (EE) 
Fatal server error:
[ 17540.153] (EE) Caught signal 11 (Segmentation fault). Server aborting

Anything I could do to try to stop this from happening?

Those options together remove every last bit of container security. Effectively anything running in the container can do whatever it wants as real root on your system.

I don’t know exactly what the container does on startup to mess with your X server, but it could be just about anything.

I took those values from the incus config of the containers created with the Cluster Api provider. I take your point and will give this a go with VMs and with unprivileged containers to see what works best for me. But as this is coming from the cluster templates in the docs, I might not be the only one to hit it.

Yeah, it’s not really new. We’ve had similar reports of such container configurations being used for Kubernetes as old as 7-8 years ago with effectively the exact same symptoms when run on a laptop :slight_smile: