Cluster upgraded automatically to 4.4 a few minutes ago, and now all my containers have no IPs

Tony_Anytime · August 3, 2020, 5:36pm

Let me give container a fixed ip… Give me a minute

tomp · August 3, 2020, 5:43pm

Great ok, so we’ve narrowed down the issue to something blocking DHCP packets from reaching dnsmasq (or responses returning). The bridge is working and dnsmasq is running.

So so next up lets try this on the host:

ss -ulpn | grep dnsmasq

And also, can you run tcpdump -pvnl -i lxdfan0 port 67 and port 68 on the host and then stop/start one of the affected containers. This will show if DHCP requests are making it from the container to the host.

tomp · August 3, 2020, 5:49pm

Can you run the full command please:

 tcpdump -pvnl -i lxdfan0 port 67 and port 68

tomp · August 3, 2020, 5:50pm

Oh maybe your copy/paste just cut a bit off.

tomp · August 3, 2020, 5:50pm

Can you show output of iptables-save please.

tomp · August 3, 2020, 5:53pm

I’d expect a different output than that, have you truncated it at all?

Tony_Anytime · August 3, 2020, 5:54pm

Only a few bad ip denys on top

Tony_Anytime · August 3, 2020, 6:04pm

Turning off the UFW seems to solve problem, which is weird since this has been working forever. Only affect after 4.4 What ports should dnsmsq would be using?

Tony_Anytime · August 3, 2020, 6:13pm

Thanks for your help looking at problem. It seems to be related to UFW blocking DNSMASQUE, I will continue to figure what is wrong. It is interesting problem started after 4.4 upgrade across all servers.

stgraber · August 3, 2020, 6:16pm

That’s odd indeed as we’ve not changed anything in the firewalling code or related to network ports in 4.4.

It could be a race of some kind though where the LXD rules were initially added ahead of the ufw rules. Restarting LXD during the upgrade caused the rules to now be after those managed by UFW causing the issue.

I’m glad it’s not an issue in the piece of work we did do this cycle though (securing dnsmasq by using apparmor for it).

Tony_Anytime · August 3, 2020, 6:38pm

Who knows sometimes… Besides port 53… what other port should be open for dnsmasq or apparmor?

tomp · August 3, 2020, 6:41pm

Can you show the output of lxc info, it would be interesting to see which firewall LXD detected (and used) when it started up, as I would expect to see some firewall rules that LXD adds to explicitly allow DHCP and DNS.

Perhaps they are being added to Nftables rather than Iptables or perhaps UFW has replaced them with its own ruleset.

You may also get some benefit from some of the approaches to managing LXD snap upgrade times described here Managing the LXD snap to avoid LXD being upgrade at times that are not good for you.

stgraber · August 3, 2020, 6:55pm

53 (DNS) on UDP/TCP and 67/68 (DHCP) on UDP

stgraber · August 3, 2020, 6:56pm

ufw is just a wrapper for iptables

Tony_Anytime · August 3, 2020, 6:56pm

Looks like 67/68 might be the problem… but it was working before, so may be the order changed.
Yes, definitely caused by Firewall of ports 67, 68… But it worked find before update. Hope this helps someone.
Thank everyone for help.

Tony_Anytime · August 4, 2020, 5:30pm

Thanks again for your help… it needed port 67 & 68, added to firewall. So all of a sudden after upgrade it needed it.

mt-caret · August 5, 2020, 9:15am

We have been running into the same problem (cluster with FAN, no IPs) since upgrading and have been frantically searching for a solution; we can confirm that unblocking udp ports 67 and 68 fixes the issue for us as well. Thank you!

stgraber · August 5, 2020, 2:58pm

This is quite weird as DHCP has always been on 67/68 UDP, it’s not something that we could even change if we wanted to

So I’m quite confused as to why things would only get blocked by firewalling now.