Woke up today to see all my lxd containers stuck in “RUNNING” state without an IPV4 address. How would I even start debugging this?
UPDATE: After looking more into it seems like this must be related to lxd being ugpraded to 4.5 since LXD was just updated to 4.5 yesterday (17-09-2020).
UPDATE2: I booted up a fresh Ubuntu 20.04 instance,
default it’s running lxd 4.0 ran container issued IP address fine
updated lxd to 4.5 restarted container ip address gone
UPDATE3: Since I cannot downgrade from 4.5 to 4.0 I’m creating a new instance and moving all my stuff from the broken instances to the new instance running 4.0 painful… If anyone knows a better way i’m all ears.
I just forced the update to LXD 4.5 with snap refresh. I am now on LXD 4.5 and the containers get their IP address as expected.
If you do not need the new features in LXD 4.x, you can track the LXD 4.0.x line (channel: 4.0/stable).
There are features in snapd so that you are not the first to try the new version of a snap package, https://snapcraft.io/docs/keeping-snaps-up-to-date
There is no explicit postpone the update until for this specific snap package until after X days before it was made available. But there is, for example, a postpone all updates until the last Friday of the month. This would help to notice any reports of issues when others are updating on the same day, and you can perform the snap refresh strategically when you know any issues have been fixed.
Perhaps its something related to Google Cloud? Which service provider are you running? I can confirm that all my nodes running 4.5 on google cloud are struggling to issue IP address.
OR
Another issue could be that i’m using fan networking? Or maybe something related to my networking config.
The installation that I described was a local installation with default networking. The opening post did not mention Google Cloud and fan networking.
As you are describing a production setup, you can either stick to the 4.0/stable channel if it suffices to your needs (are there any fixes/updates to fan networking that do not exist in LXD 4.0.x?).
Or, track the stable channels of the development versions of LXD. That is, since LXD 4.4 was OK for you, you can switch to 4.4/stable for the foreseeable future. Then, switch to 4.5/stable after you have tested that it works for your setup.
Having said all that, you have an installation to fix ASAP. There are many things to try. I suggest to try to figure out whether LXD 4.5 has a fan networking issue on Google Cloud. You can make a minimal installation with LXD 4.0.3, then, switch to 4.4/stable and check that all work, then switch to latest/stable to verify that things stop working.
If you want to get everything working fast, move the containers to LXD 4.0.x or LXD 4.4 for now.
Right I think I see the issue, your underlay subnet setting is 10.0.1.0/24 which means the fan will derive its IP from the host’s underlay address. However to find its underlay address it must look at the other network interfaces and find an address that is within the 10.0.1.0/24 subnet.
At first glance your ens4 interface fits the bill with an IP of 10.0.1.20/32 however, in LXD 4.5 we added a specific test to exclude IPs with a /32 subnet address as this was causing issues with people using floating IP aliases on external interfaces. And as such it is not able to derive a fan address.
Well I’m not familiar with GCP’s networking, but I’m guessing they use /32 address by default rather than more traditional /24 (for example). I presume this is because their networking doesn’t provide L2 broadcast/multicast and so all traffic must go to the router first.
I’ll put up a fix for this to exclude using lo interface for the FAN address generation and remove the /32 ignore rule, this way it should fix both the original issue that caused the change and avoid breaking setups such as used by yourself.