Cluster upgraded automatically to 4.4 a few minutes ago, and now all my containers have no IPs

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:30:48:cf:7c:4c brd ff:ff:ff:ff:ff:ff
inet 84.17.40.19/26 brd 84.17.40.63 scope global noprefixroute enp1s0
valid_lft forever preferred_lft forever
inet6 fe80::85e2:7c79:defd:ade1/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 00:30:48:cf:7c:4d brd ff:ff:ff:ff:ff:ff
4: lxdfan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 02:f0:ac:00:7b:b8 brd ff:ff:ff:ff:ff:ff
inet 240.19.0.1/8 scope global lxdfan0
valid_lft forever preferred_lft forever
inet6 fe80::d0e9:96ff:fe67:4cba/64 scope link
valid_lft forever preferred_lft forever
16: vethf27985a3@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UP group default qlen 1000
link/ether 8e:f4:b9:f7:fd:0c brd ff:ff:ff:ff:ff:ff link-netnsid 1
17: lxdfan0-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UNKNOWN group default qlen 1000
link/ether ce:cc:5c:de:d7:94 brd ff:ff:ff:ff:ff:ff
inet6 fe80::cccc:5cff:fede:d794/64 scope link
valid_lft forever preferred_lft forever
18: lxdfan0-fan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UNKNOWN group default qlen 1000
link/ether 02:f0:ac:00:7b:b8 brd ff:ff:ff:ff:ff:ff
inet6 fe80::f0:acff:fe00:7bb8/64 scope link
valid_lft forever preferred_lft forever
20: vethf9c373d8@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UP group default qlen 1000
link/ether 3e:c0:87:3c:98:a2 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Please can you show the output of lxc config show <container> --expanded where <container> is one of your affected container’s name.

Then also, inside the affected container, please can you manually add an IP to it using the command:

ip a add 240.19.0.x/24 dev eth0

Where the .x part is a free IP in your subnet, you can pick any one in the range 240.19.0.2 to 240.19.0.254

Then inside the container show output of:

ip a
ip r

and

ping 240.19.0.1 -c 5

What I’m trying to ascertain is whether the bridge is functioning correctly, i.e can you ping the gateway, and whether its specifically something blocking DHCP requests to dnsmasq.

Let me give container a fixed ip… Give me a minute

Great ok, so we’ve narrowed down the issue to something blocking DHCP packets from reaching dnsmasq (or responses returning). The bridge is working and dnsmasq is running.

So so next up lets try this on the host:

ss -ulpn | grep dnsmasq

And also, can you run tcpdump -pvnl -i lxdfan0 port 67 and port 68 on the host and then stop/start one of the affected containers. This will show if DHCP requests are making it from the container to the host.

Can you run the full command please:

 tcpdump -pvnl -i lxdfan0 port 67 and port 68

Oh maybe your copy/paste just cut a bit off.

Can you show output of iptables-save please.

I’d expect a different output than that, have you truncated it at all?

Only a few bad ip denys on top

Turning off the UFW seems to solve problem, which is weird since this has been working forever. Only affect after 4.4 What ports should dnsmsq would be using?

Thanks for your help looking at problem. It seems to be related to UFW blocking DNSMASQUE, I will continue to figure what is wrong. It is interesting problem started after 4.4 upgrade across all servers.

That’s odd indeed as we’ve not changed anything in the firewalling code or related to network ports in 4.4.

It could be a race of some kind though where the LXD rules were initially added ahead of the ufw rules. Restarting LXD during the upgrade caused the rules to now be after those managed by UFW causing the issue.

I’m glad it’s not an issue in the piece of work we did do this cycle though (securing dnsmasq by using apparmor for it).

Who knows sometimes… Besides port 53… what other port should be open for dnsmasq or apparmor?

Can you show the output of lxc info, it would be interesting to see which firewall LXD detected (and used) when it started up, as I would expect to see some firewall rules that LXD adds to explicitly allow DHCP and DNS.

Perhaps they are being added to Nftables rather than Iptables or perhaps UFW has replaced them with its own ruleset.

You may also get some benefit from some of the approaches to managing LXD snap upgrade times described here Managing the LXD snap to avoid LXD being upgrade at times that are not good for you.

53 (DNS) on UDP/TCP and 67/68 (DHCP) on UDP

ufw is just a wrapper for iptables

Looks like 67/68 might be the problem… but it was working before, so may be the order changed.
Yes, definitely caused by Firewall of ports 67, 68… But it worked find before update. Hope this helps someone.
Thank everyone for help.

Thanks again for your help… it needed port 67 & 68, added to firewall. So all of a sudden after upgrade it needed it.

We have been running into the same problem (cluster with FAN, no IPs) since upgrading and have been frantically searching for a solution; we can confirm that unblocking udp ports 67 and 68 fixes the issue for us as well. Thank you!

This is quite weird as DHCP has always been on 67/68 UDP, it’s not something that we could even change if we wanted to :slight_smile:

So I’m quite confused as to why things would only get blocked by firewalling now.