Lxc unable to connect to running container

Hello!

I have a cluster (4 identical machines) running as an lxd cluster. Everything works fine for a day or so and then lxc is unable to connect to the containers. Restarting the containers resolves the issue for another day or so until it happens again. When lxc is unable to connect the containers are still running (I can ssh to them.)

Machine configurations * 4:
Dell PowerEdge R6525
AMD EPYC 7282 16-Core Processor
128GB RAM
Ubuntu 22.04 LTS
lxd/lxc (snap) 5.0.1

Sample output:
cmd@cluster01:~$ lxc list
±-------------±--------±--------------------±-----±----------±----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test | RUNNING | 240.81.0.157 (eth0) | | CONTAINER | 0 | cluster01 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test2 | RUNNING | 240.82.0.189 (eth0) | | CONTAINER | 0 | cluster02 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test3 | RUNNING | 240.83.0.64 (eth0) | | CONTAINER | 0 | cluster03 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test4 | RUNNING | 240.84.0.229 (eth0) | | CONTAINER | 0 | cluster04 |
±-------------±--------±--------------------±-----±----------±----------±----------+
cmd@cluster01:~$ lxc shell ubuntu-test
Error: Failed to retrieve PID of executing child process
cmd@cluster01:~$ lxc console ubuntu-test
To detach from the console, press: +a q
Error: Error opening config file: “loading config file for the container failed”
Error: write /dev/pts/ptmx: file already closed
cmd@cluster01:~$ ssh 240.81.0.157
Last login: Tue Jan 3 21:58:07 2023 from 240.81.0.1
To run a command as administrator (user “root”), use "sudo ".
See “man sudo_root” for details.

cmd@ubuntu-test:~$ ping -c1 www.google.com
PING www.google.com (142.251.215.228) 56(84) bytes of data.
64 bytes from sea09s35-in-f4.1e100.net (142.251.215.228): icmp_seq=1 ttl=118 time=0.932 ms

— www.google.com ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.932/0.932/0.932/0.000 ms
cmd@ubuntu-test:~$
logout
Connection to 240.81.0.157 closed.
cmd@cluster01:~$ lxc restart ubuntu-test
cmd@cluster01:~$ lxc shell ubuntu-test
root@ubuntu-test:~#
logout

I am not sure how to debug this further and would greatly appreciate any help. Thank you!

It could be that something is clearing up /tmp, see https://github.com/lxc/lxd/issues/10771#issuecomment-1212183389

It looks like this was indeed the case. Thanks!

Solution: At the top of /usr/lib/tmpfiles.d/snapd.conf I added: x /tmp/snap-private-tmp/snap.lxd

This excludes the snap.lxd subdir from being “cleaned” which in turn breaks lxc’s ability to connect to containers.

1 Like

Would you be able to do to https://forum.snapcraft.io and report the issue with /usr/lib/tmpfiles.d/snapd.conf there, as it would be great if the snapd team could modify the default configuration to avoid issues with lxc exec.

Thanks