Lxd remote add <name> <fqdn> fails with "device or resource busy"

lepeuvedic · December 2, 2022, 5:29pm

My problem was the sudden loss of functionality of a PostgreSQL database backup script, which relied on lxc exec kubb:[container] – ls /var/lib/postgresql/backups in order to retrieve the list of database dumps. The command ran every night from my ordinary user crontab.

The remote server kubb and the local computer both run Ubuntu 20.04 LTS and LXD 5.8 from the snap. The database runs alone in a system container over Ubuntu 18.04.5 LTS and was deployed by Juju.

I quickly found out that the TLS certificate of the local computer stored on kubb was not the right one. It means that somehow, an update had changed LXD’s certificate on the local computer. The local computer does not run any container. The LXD snap in installed only to be able to communicate with the remote LXD.

The LAN runs over WiFi, has complete and working name resolution: both computers have the same configuration in /etc/nsswitch.conf. Docker is installed on the local computer, but not in use. When they need to resolve each other’s name, both computers have to use libnss_resolve.so.2, which contacts the local DNS resolver proxy, which relies on the LAN router to resolve local names, using the database of DHCP leases.

The LAN router also resolves Internet DNS and acts as a proxy for all the LAN nodes, distributing its own IP via DHCP as the name resolution service.

All the nodes can be resolved properly either as simple names (“kubb”), with the local domain (“kubb.lan”), using msdns (“kubb.local”). Using “ping”,

“kubb.lan” resolves to the global IPv6 address,
“kubb.local” resolves to an IPv6 temporary address
“kubb” resolves to the IPv6 LAN (more permanent) address

Of course, commands like resolvectl query cannot resolve msdns .local addresses, because they skip the name service switch stage and jump directly to DNS resolution.

/etc/nsswitch.conf

passwd: compat systemd
group: compat systemd
shadow: compat

hosts: myhostname mymachines files docker [NOTFOUND=return] mdns_minimal [NOTFOUND=return] resolve
networks: files

protocols: db files
services: db files
ethers: db files
rpc: db files

netgroup: nis

I made backups work again by removing the client certificate from kubb and adding it again.

Even though both nodes have fixed IPv4 addresses in the current router configuration, I don’t like rely on hardcoded IP in local configuration files. I therefore tried to add back the client using:

$ lxc remote add kubb kubb.lan --auth-type tls

on the local computer. This command also failed with the message

Error: Get “https://kubb:8443/1.0”: lookup kubb: device or resource busy

The error is clearly hinted as a name resolution error, even though the error code looks strange.

wget runs fine with this URL, as long as the option –no-check-certificate is given, since LXD certificates live in a closed world of self-signed certificates, which are no longer accepted by default on the world wide web.

I used strace on the lxd command to see what was going on, and I noticed that all the attempts to load “libnss_resolve.so.2” failed with ENOENT, including a couple of locations where the file actually existed. I did not find any trace of a chroot call, which might have changed the meaning of file paths.

If the Glibc name resolver cannot load the resolve plugin, it is unable to use DNS to resolve domain names. lxd fails to load all the nss plugins excepted “/lib/x86_64-linux-gnu/libnss_files.so.2”, but the /etc/hosts file is kept empty.

Of course, if an IP address in quad notation or an IPv6 address are given, lxd does not need the resolver at all, and the command succeeds (after validation of the server fingerprint, and input of the token).

So my question is: why is lxd unable to use the system level name resolution strategies as described in /etc/nsswitch.conf ?