SSH stopped accepting packets inside LXC container

Hi,

Our machine runs several (~100) LXC containers fine. However, we encountered a strange phenomenon: suddenly the sshd process inside some of the LXC stopped accepting SSH packets. Connecting to these LXC containers then will run into timeouts. Even restarting the container does not help.

Yet on the same host machine, we can happily SSH into other containers fine. Sometimes even HTTP on Port 80 still works into the containers, but SSH not.

Activating full debug output of SSH on the troubling instances (“mao”) reads:

root@mao:~# journalctl -f -u ssh
Oct 08 12:35:47 mao sshd[1180]: debug3: oom_adjust_setup
Oct 08 12:35:47 mao sshd[1180]: debug1: Set /proc/self/oom_score_adj from 0 to -1000
Oct 08 12:35:47 mao sshd[1180]: debug2: fd 3 setting O_NONBLOCK
Oct 08 12:35:47 mao sshd[1180]: debug1: Bind to port 22 on 0.0.0.0.
Oct 08 12:35:47 mao sshd[1180]: Server listening on 0.0.0.0 port 22.
Oct 08 12:35:47 mao sshd[1180]: debug2: fd 4 setting O_NONBLOCK
Oct 08 12:35:47 mao sshd[1180]: debug3: sock_set_v6only: set socket 4 IPV6_V6ONLY
Oct 08 12:35:47 mao sshd[1180]: debug1: Bind to port 22 on ::.
Oct 08 12:35:47 mao sshd[1180]: Server listening on :: port 22.
Oct 08 12:35:47 mao systemd[1]: Started OpenBSD Secure Shell server.

Full stop. When I connect to this machine, the sshd process does not get any packet. No further line is appended to this log above.

However, the kernel does get the packets:

root@mao:~# tcpdump -i any -v "port 22"
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:51:09.936091 eth0  In  IP (tos 0x48, ttl 61, id 47494, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.150.126.37374 > mao.ssh: Flags [S], cksum 0x1fda (correct), seq 3569343040, win 64240, options [mss 1289,sackOK,TS val 2667330399 ecr 0,nop,wscale 7], length 0
12:51:10.961012 eth0  In  IP (tos 0x48, ttl 61, id 47495, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.150.126.37374 > mao.ssh: Flags [S], cksum 0x1bd8 (correct), seq 3569343040, win 64240, options [mss 1289,sackOK,TS val 2667331425 ecr 0,nop,wscale 7], length 0
12:51:11.973967 eth0  In  IP (tos 0x48, ttl 61, id 47496, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.150.126.37374 > mao.ssh: Flags [S], cksum 0x17e3 (correct), seq 3569343040, win 64240, options [mss 1289,sackOK,TS val 2667332438 ecr 0,nop,wscale 7], length 0
12:51:12.987073 eth0  In  IP (tos 0x48, ttl 61, id 47497, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.150.126.37374 > mao.ssh: Flags [S], cksum 0x13ed (correct), seq 3569343040, win 64240, options [mss 1289,sackOK,TS val 2667333452 ecr 0,nop,wscale 7], length 0

Yet, these packets are not forwarded to the sshd process. It seems, that the kernel just drops the packets or move them elsewhere.

As a comparison, this is a good output on a second machine (“mao2”) running on the same LXC host:

root@mao2:~# tcpdump -i any "port 22"
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

12:39:02.364639 eth0  In  IP 172.18.150.126.58068 > mao2.ssh: Flags [S], seq 1149888520, win 64240, options [mss 1289,sackOK,TS val 3814277121 ecr 0,nop,wscale 7], length 0
12:39:02.364694 eth0  Out IP mao2.ssh > 172.18.150.126.58068: Flags [S.], seq 363056269, ack 1149888521, win 65160, options [mss 1460,sackOK,TS val 696852693 ecr 3814277121,nop,wscale 7], length 0
12:39:02.383668 eth0  In  IP 172.18.150.126.58068 > mao2.ssh: Flags [.], ack 1, win 502, options [nop,nop,TS val 3814277140 ecr 696852693], length 0
12:39:02.383668 eth0  In  IP 172.18.150.126.58068 > mao2.ssh: Flags [P.], seq 1:22, ack 1, win 502, options [nop,nop,TS val 3814277140 ecr 696852693], length 21: SSH: SSH-2.0-OpenSSH_9.5

Both machines (“mao”) and (“mao2”) are copied instances from a Ubuntu 20.04 TLS instance. They are the same.

Even stranger: we worked fine with the “mao” instance: ssh was working flawlessly. But suddenly It stopped. As I recall, I made a sshfs connection from my desktop to mao:/ and when I started browsing the files, the system collapsed and never recovered.

I do think this is some sort of LXC issue but cannot state that for sure.