Hello!
I have a cluster (4 identical machines) running as an lxd cluster. Everything works fine for a day or so and then lxc is unable to connect to the containers. Restarting the containers resolves the issue for another day or so until it happens again. When lxc is unable to connect the containers are still running (I can ssh to them.)
Machine configurations * 4:
Dell PowerEdge R6525
AMD EPYC 7282 16-Core Processor
128GB RAM
Ubuntu 22.04 LTS
lxd/lxc (snap) 5.0.1
Sample output:
cmd@cluster01:~$ lxc list
±-------------±--------±--------------------±-----±----------±----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test | RUNNING | 240.81.0.157 (eth0) | | CONTAINER | 0 | cluster01 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test2 | RUNNING | 240.82.0.189 (eth0) | | CONTAINER | 0 | cluster02 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test3 | RUNNING | 240.83.0.64 (eth0) | | CONTAINER | 0 | cluster03 |
±-------------±--------±--------------------±-----±----------±----------±----------+
| ubuntu-test4 | RUNNING | 240.84.0.229 (eth0) | | CONTAINER | 0 | cluster04 |
±-------------±--------±--------------------±-----±----------±----------±----------+
cmd@cluster01:~$ lxc shell ubuntu-test
Error: Failed to retrieve PID of executing child process
cmd@cluster01:~$ lxc console ubuntu-test
To detach from the console, press: +a q
Error: Error opening config file: “loading config file for the container failed”
Error: write /dev/pts/ptmx: file already closed
cmd@cluster01:~$ ssh 240.81.0.157
Last login: Tue Jan 3 21:58:07 2023 from 240.81.0.1
To run a command as administrator (user “root”), use "sudo ".
See “man sudo_root” for details.
cmd@ubuntu-test:~$ ping -c1 www.google.com
PING www.google.com (142.251.215.228) 56(84) bytes of data.
64 bytes from sea09s35-in-f4.1e100.net (142.251.215.228): icmp_seq=1 ttl=118 time=0.932 ms
— www.google.com ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.932/0.932/0.932/0.000 ms
cmd@ubuntu-test:~$
logout
Connection to 240.81.0.157 closed.
cmd@cluster01:~$ lxc restart ubuntu-test
cmd@cluster01:~$ lxc shell ubuntu-test
root@ubuntu-test:~#
logout
I am not sure how to debug this further and would greatly appreciate any help. Thank you!