I’ve had this issue across several containers, with different loads, across different hosts. They are all LXD 5.13, mostly on Ubuntu 22.04 and one on Debian 11. They all use the DIR driver, and the host is on LVM.
The container loses network connectivity with the outside, and after a while it loses its IP (presumably when DHCP renewal comes along and it doesn’t work). Operations from the outside (enter the CT, or stopping it, also do not work anymore). As soon as I force stop the CT and restart it, everything is fine.
I will keep adding details to this post, but here are a couple of instances where the container became unreachable
Example one and two.
This issue has been so prevalent, and for so long, I am starting to question whether LXC is reliable enough for my needs.
top show when the CPU spikes, what is causing the load?
It suggests something inside the container is consuming a lot of CPU.
That’s possible. But why would that cause the CT to lose its IP?
Are you still seeing this issue? It sounds pretty unusual as have not heard of this sort of thing before.
Are you able to get a
ps auxnf output from the host when it occurs?
Also can you show
ip r and
sudo ss -ulpn on the host before and after it occurs?
Finally, if you launch a separate container which doesn’t have an active workload inside it, does this also get affected?
yes, it’s constant across multiple hostsXguests. containers with lower loads do not suffer from this.
i will provide the diag next time it happens. are the last 3 cmds supposed to run from the host or guest?
to protect sensitive information i’m sending you the logs via a DM.
the ct which hung (ct104-mon) does NOT have a heavy workload. it’s steady and minor, but it is active. the ct with the high workload is ct103. pbs is idle and the other two are medium workload.
When it happens next, please can you capture what is causing the CPU load spikes, which process?
I can tell you the name of the process. did you get the logs I PM’ed you?
Yes, I couldn’t see anything useful in them I’m afraid.
Sent you a new set of log files today. The difference between this one and the previous ones was that this time I caught it soon after the CT hang and before it loses it’s IP. There might be something helpful there
Pls lmk if there are other logs I should collect of cmds i should run.