Container loses connectivity, Host CPU Spikes

gaia · April 28, 2023, 8:38pm

I’ve had this issue across several containers, with different loads, across different hosts. They are all LXD 5.13, mostly on Ubuntu 22.04 and one on Debian 11. They all use the DIR driver, and the host is on LVM.

The container loses network connectivity with the outside, and after a while it loses its IP (presumably when DHCP renewal comes along and it doesn’t work). Operations from the outside (enter the CT, or stopping it, also do not work anymore). As soon as I force stop the CT and restart it, everything is fine.

I will keep adding details to this post, but here are a couple of instances where the container became unreachable

Example one and two.

This issue has been so prevalent, and for so long, I am starting to question whether LXC is reliable enough for my needs.

tomp · May 12, 2023, 12:53pm

What does top show when the CPU spikes, what is causing the load?

It suggests something inside the container is consuming a lot of CPU.

gaia · May 12, 2023, 4:26pm

That’s possible. But why would that cause the CT to lose its IP?

tomp · May 23, 2023, 7:25am

Are you still seeing this issue? It sounds pretty unusual as have not heard of this sort of thing before.

Are you able to get a ps auxnf output from the host when it occurs?

Also can you show ip a, ip r and sudo ss -ulpn on the host before and after it occurs?

Finally, if you launch a separate container which doesn’t have an active workload inside it, does this also get affected?

gaia · May 23, 2023, 3:01pm

yes, it’s constant across multiple hostsXguests. containers with lower loads do not suffer from this.

i will provide the diag next time it happens. are the last 3 cmds supposed to run from the host or guest?

thanks thomas.

tomp · May 23, 2023, 3:07pm

Host please.

gaia · May 28, 2023, 4:20pm

to protect sensitive information i’m sending you the logs via a DM.

the ct which hung (ct104-mon) does NOT have a heavy workload. it’s steady and minor, but it is active. the ct with the high workload is ct103. pbs is idle and the other two are medium workload.

tomp · June 1, 2023, 9:45am

When it happens next, please can you capture what is causing the CPU load spikes, which process?

gaia · June 1, 2023, 2:22pm

I can tell you the name of the process. did you get the logs I PM’ed you?

tomp · June 2, 2023, 8:30am

Yes, I couldn’t see anything useful in them I’m afraid.

gaia · June 13, 2023, 6:15pm

Hello Thomas

Sent you a new set of log files today. The difference between this one and the previous ones was that this time I caught it soon after the CT hang and before it loses it’s IP. There might be something helpful there

Pls lmk if there are other logs I should collect of cmds i should run.

gaia · September 26, 2023, 3:11pm

sent you a new set of info. it happened today again, this time I caught it before it lost its IP