LXD container stuck in "RUNNING" without IP address

zacksiri · September 18, 2020, 3:25am

Hi

Woke up today to see all my lxd containers stuck in “RUNNING” state without an IPV4 address. How would I even start debugging this?

UPDATE: After looking more into it seems like this must be related to lxd being ugpraded to 4.5 since LXD was just updated to 4.5 yesterday (17-09-2020).

UPDATE2: I booted up a fresh Ubuntu 20.04 instance,

default it’s running lxd 4.0 ran container issued IP address fine
updated lxd to 4.5 restarted container ip address gone

UPDATE3: Since I cannot downgrade from 4.5 to 4.0 I’m creating a new instance and moving all my stuff from the broken instances to the new instance running 4.0 painful… If anyone knows a better way i’m all ears.

simos · September 18, 2020, 5:57am

I just forced the update to LXD 4.5 with snap refresh. I am now on LXD 4.5 and the containers get their IP address as expected.

If you do not need the new features in LXD 4.x, you can track the LXD 4.0.x line (channel: 4.0/stable).

There are features in snapd so that you are not the first to try the new version of a snap package, https://snapcraft.io/docs/keeping-snaps-up-to-date
There is no explicit postpone the update until for this specific snap package until after X days before it was made available. But there is, for example, a postpone all updates until the last Friday of the month. This would help to notice any reports of issues when others are updating on the same day, and you can perform the snap refresh strategically when you know any issues have been fixed.

zacksiri · September 18, 2020, 6:01am

Perhaps its something related to Google Cloud? Which service provider are you running? I can confirm that all my nodes running 4.5 on google cloud are struggling to issue IP address.

OR

Another issue could be that i’m using fan networking? Or maybe something related to my networking config.

simos · September 18, 2020, 6:43am

The installation that I described was a local installation with default networking. The opening post did not mention Google Cloud and fan networking.

As you are describing a production setup, you can either stick to the 4.0/stable channel if it suffices to your needs (are there any fixes/updates to fan networking that do not exist in LXD 4.0.x?).

Or, track the stable channels of the development versions of LXD. That is, since LXD 4.4 was OK for you, you can switch to 4.4/stable for the foreseeable future. Then, switch to 4.5/stable after you have tested that it works for your setup.

Having said all that, you have an installation to fix ASAP. There are many things to try. I suggest to try to figure out whether LXD 4.5 has a fan networking issue on Google Cloud. You can make a minimal installation with LXD 4.0.3, then, switch to 4.4/stable and check that all work, then switch to latest/stable to verify that things stop working.

If you want to get everything working fast, move the containers to LXD 4.0.x or LXD 4.4 for now.

tomp · September 18, 2020, 7:31am

Please can you provide the output of lxc config show <instance> --expanded for one of the instances without IPs.

Also please provide the output of ip a and ip r from both the LXD host and inside one of the containers.

zacksiri · September 18, 2020, 7:42am

  volatile.base_image: 572d670b29cf9d5983c9086687c1529157e569ed6b6c39417286c919f90c4028
  volatile.eth0.host_name: veth03b45f26
  volatile.eth0.hwaddr: 00:16:3e:24:3d:e7
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    mtu: "1410"
    name: eth0
    nictype: bridged
    parent: lxdfan0
    type: nic
  http:
    connect: tcp:127.0.0.1:4000
    listen: tcp:0.0.0.0:4001
    type: proxy
  root:
    path: /
    pool: local
    type: disk
ephemeral: false
profiles:
- ticket_booth-staging
stateful: false
description: ""

tomp · September 18, 2020, 7:43am

Are you running in a cluster?

Can you show the output of lxc network show lxdfan0 as well as the other items I requested in previous post?

Thanks

tomp · September 18, 2020, 7:44am

Also what version of lxd were you running before the upgrade?

zacksiri · September 18, 2020, 7:44am

config:
  bridge.mode: fan
  fan.underlay_subnet: 10.0.1.0/24
description: ""
name: lxdfan0
type: bridge
used_by:
- /1.0/instances/g-staging-1
- /1.0/instances/g-staging-2
- /1.0/instances/g-staging-3
- /1.0/instances/psql-workbench
- /1.0/instances/t-staging-1
- /1.0/instances/t-staging-2
- /1.0/instances/t-staging-3
- /1.0/profiles/default
- /1.0/profiles/g-staging
- /1.0/profiles/t-staging
managed: true
status: Created
locations:
- d-staging-a-01
- d-staging-b-01
- d-staging-c-01

tomp · September 18, 2020, 7:45am

And please can I see ip a and ip r on the LXD host and inside the container?

zacksiri · September 18, 2020, 7:46am

It was running on latest so 4.4 was the previous version.

ip a (host)

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc fq_codel state UP group default qlen 1000
    link/ether 42:01:0a:00:01:14 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.20/32 scope global dynamic ens4
       valid_lft 71510sec preferred_lft 71510sec
    inet6 fe80::4001:aff:fe00:114/64 scope link
       valid_lft forever preferred_lft forever
3: lxdfan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:84:3c:ad brd ff:ff:ff:ff:ff:ff
    inet6 fe80::216:3eff:fe84:3cad/64 scope link
       valid_lft forever preferred_lft forever
4: lxdfan0-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1450 qdisc noqueue master lxdfan0 state UNKNOWN group default qlen 1000
    link/ether 02:72:c2:c3:87:61 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::72:c2ff:fec3:8761/64 scope link
       valid_lft forever preferred_lft forever
12: veth03b45f26@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master lxdfan0 state UP group default qlen 1000
    link/ether 9a:c3:9b:b5:c2:40 brd ff:ff:ff:ff:ff:ff link-netnsid 0

ip r (host)

default via 10.0.1.1 dev ens4 proto dhcp src 10.0.1.20 metric 100
10.0.1.1 dev ens4 proto dhcp scope link src 10.0.1.20 metric 100

ip a (container)

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
11: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1410 qdisc noqueue state UP qlen 1000
    link/ether 00:16:3e:24:3d:e7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::216:3eff:fe24:3de7/64 scope link
       valid_lft forever preferred_lft forever

ip r (container)

returned nothing

tomp · September 18, 2020, 7:51am

Right I think I see the issue, your underlay subnet setting is 10.0.1.0/24 which means the fan will derive its IP from the host’s underlay address. However to find its underlay address it must look at the other network interfaces and find an address that is within the 10.0.1.0/24 subnet.

At first glance your ens4 interface fits the bill with an IP of 10.0.1.20/32 however, in LXD 4.5 we added a specific test to exclude IPs with a /32 subnet address as this was causing issues with people using floating IP aliases on external interfaces. And as such it is not able to derive a fan address.

See Delete a stopped container bring down the fan interface and

Is there a specific reason why you are using a /32 on ens4?

zacksiri · September 18, 2020, 7:58am

Hey

Yeah i think this was all setup by the default lxdfan bridge networking. I’m setting all this up using Googles VPC which runs on 10.0.1.0/24

so the 10.0.1.20/32 is the IP of the internal network assigned by google’s VPC. How would you suggest i configure this?

tomp · September 18, 2020, 8:00am

Well I’m not familiar with GCP’s networking, but I’m guessing they use /32 address by default rather than more traditional /24 (for example). I presume this is because their networking doesn’t provide L2 broadcast/multicast and so all traffic must go to the router first.

I’ll put up a fix for this to exclude using lo interface for the FAN address generation and remove the /32 ignore rule, this way it should fix both the original issue that caused the change and avoid breaking setups such as used by yourself.

zacksiri · September 18, 2020, 8:01am

@tomp thank you! for the swift reply. Other than these minor issue LXD has been a dream to work with.

Much thx for your hard work.

tomp · September 18, 2020, 8:15am

The PR for the fix is here: