It takes minutes to ssh to a new ubuntu 20.04 container on ubuntu 18.04 host

From an Ubuntu 18.04 host with the latest snap lxd (4.0.1), I launch a container from the image ubuntu:20.04, setup a user with ~/.ssh/authorized_keys and ssh to this user:
ssh connects and then seems to hang. Once it succeeds, then I can ssh again with no delay.
If I then make a snapshot of this container, copy the snapshot to a new container, and start it, I have the same problem: It takes too long to ssh the first time.

Assuming you can ping the container OK, then this sounds like it could be a DNS issue, either forward (i.e your client taking time to resolve the hostname of your container if you’re using a hostname in the ssh command) or reverse (your container’s SSHD trying to perform a reverse DNS lookup on your client’s IP).

Please can you check whether a dig -x <client ip> resolves quickly inside your container?

dig -x resolves quickly in the container, shortly after the container starts.
I tried ssh to the container from another container on the same host, using its .lxd name or its internal ipv4 address.

I’ve experienced slow ssh connections before due to host resolution problems, but those lasted about 10 seconds and happened every time. This feels different. It is as if the container is doing something that takes a long time, when it is first created, and ssh is put on hold until it is done.

There is no such delay with ssh to a ubuntu 20.04 container on ubuntu 20.04 host.

There were some previous threads in this forum about how to setup the host so that you can use the .lxd hostnames, from the host. Have you tried any of those on this Ubuntu 18.04 host?

I tried it with a brand new Ubuntu 18.04 host and I could not reproduce it, so I need to do more testing to find out why my particular host has this problem.

The problem is not the .lxd hostname, because it also happens when using the ip address.

The openssh-server in the container tries to resolve the IP address of the client.
The container has been configured to resolve DNS queries at the host. The DNS setup at the host can cause the container to take time to resolve queries. If you use the stock network (DNS-related) configuration at the host, then you should be OK. But, if there are any changes?

The whole setup is self-contained, and can probably use tshark to identify whether the issue is restricted within the container, and whether it is the host that is taking so long.

This was fixed by rebooting the host. It was not specific to 20.04 containers.

This is still a problem. The first time I ssh to a new ubuntu:18.04 container on an ubuntu 18.04 host with the latest lxd snap, it takes about ~20 minutes for the ssh to complete. Subsequent connections are normal. I timed an scp today. In one case it took 16 minutes, in another case, it took 24 minutes. This was from container to container. I am now waiting this simple command to complete from the host to a new container: “time scp ubuntu@10.0.0.39:/etc/hosts .” It’s been several minutes already.

I assume it will be temporarily fixed by rebooting, but it comes back.

containers created from images:alpine/3.11 work fine.

I have a standard network configuration + common iptables setup (shorewall) on the host.

From ubuntu 18.04 host to new ubuntu:18.04 container:

time scp ubuntu@10.0.0.39:/etc/hosts .
hosts                                         100%  221   501.1KB/s   00:00    

real	17m4.629s
user	0m0.027s
sys	0m0.001s

using shorewall on ubuntu is not so common.
anyway, in the spirit of ‘teach a man to fish’ (and he will bug you forever asking advice about the best hooks), strace is your friend. If strace leads you to gnome-keyring-daemon it may even be the same problem I struggled with one time.

I assume you suggest using strace on sshd, because ssh or scp just get stuck on read()
I can’t run sshd manually on the ubuntu container. This is what I get:
service ssh stop
/usr/sbin/sshd -d
debug1: sshd version OpenSSH_7.6, OpenSSL 1.0.2n 7 Dec 2017
debug1: private host key #0: ssh-rsa …
debug1: private host key #1: ecdsa-sha2-nistp256 …
debug1: private host key #2: ssh-ed25519 …
Missing privilege separation directory: /run/sshd

exactly - the trick is to look at the lines before to find out what it’s trying to read

The problem seems to be this:

/etc/update-motd.d/50-landscape-sysinfo

It runs in the container at boot, and seems to get stuck for a long time.

If I remove this file, the container boots normally and I can ssh into it immediately.

then it’s a different issue than the one I had indeed; since it’s the sshd service in the containers that is blocked by some dependency. In this case lxc console --show-log could be informative. For the record I have never seen this problem since landscape-common is among the dozen of packages that I purge immediately after creating a new Ubuntu container.

Could you share the list of packages that you purge? On 18.04, I remove lxd, lxd-client. I use the ubuntu: images, because I can use cloud-init to prevent the “ubuntu” user from being created.

(sorry for the delay I was busy elsewhere and I missed your post)
why not ? but I don’t think that it’s so interesting, it depends all on what is needed after and it could be all installed again from dependencies when installing needed packages.

lxd info libuv1 hdparm libbinutils telnet snapd xfsprogs ftp git git-man usbutils xdg-user-dirs ntfs-3g libntfs-3g88 lxd-client nano open-iscsi xauth gnupg krb5-locales popularity-contest plymouth snapd libx11-6 landscape-common ufw libplymouth4 libpng16-16 plymouth-theme-ubuntu-text libx11-data libxext6 irqbalance geoip-database libjpeg-dev gdisk mdadm curl dmidecode libnuma1 libxcb1 friendly-recovery libdrm-common

I think that with Ubuntu 20.04 there are 2 packages that have to be removed from the list since they don’t exist. Anyway Ubuntu 20.04 container is slimmer than a Ubuntu 18.04 so it’s less useful to remove packages maybe.

The issue discussed here was tracked down to a lxcfs bug we fixed this morning.
It was causing reads from /proc/uptime among other things to be significantly slower.
Combined with landscape-sysinfo reading that file for every single process, delays were quickly adding up.

The fix should hit stable over the next hour or so.