Almalinux/9/cloud container is created with broken networking

I managed to make configuration that works with routed Ubuntu (ubuntu/jammy/cloud), so I tried to use it with AlmaLinux (almalinux/9/cloud), but it went horribly wrong. After starting the container, the main interface is down and there is pretty much no networking. Networking details are put in the correct place, and all looks like it should work, but the main interface of the guest is messed up for some reason. Enabling it manually doesn’t help.

This is my lxd’s configuration:

And here is a detailed cloud-init.log:

Am I missing something? I would appreciate any help.

Able to take a look @monstermunchkin ? Thanks

Please can you show lxc config show <instance> --expanded, lxc network show <network> and output of ip a and ip r inside the instance.

@tomp

lxc config show <instance> --expanded:

lxc network show <network>:

ip a (inside the instance):

ip r yields no results.

What’s weird is that the same configuration works out of the box with Ubuntu. I’m using ansible to set everything up, so I have pretty good reproducibility. I just switched the image of created instance in my ansible’s playbook.

Thank you for your patience. I really appreciate all the help I’m getting here. It’s one of the most helpful communities I’ve seen :slight_smile: (I experienced it also ~2 years ago when I was figuring out LXD the first time).

You’re cloud-init config has an issue. It should not have a gateway4 entry, as the default gateway is supposed to be configured to 169.254.0.1 and is covered by the routes section.

@tomp I removed the gateway entry from the profile and recreated the container. I can see that the gateway in the container is set to 169.254.0.1. Content of the /etc/sysconfig/network-scripts/ifcfg-eth0 file is:

AUTOCONNECT_PRIORITY=999
BOOTPROTO=none
DEFROUTE=yes
DEVICE=eth0
DNS1=8.8.8.8
DNS2=8.8.4.4
GATEWAY=169.254.0.1
IPADDR=188[redacted]
NETMASK=255.255.255.255
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

However, the problem is still the same. Container has been created with disabled interface. Bringing the interface up manually doesn’t help. Network is always unreachable. :frowning:

I improved a bit my networking skills and figured out how to make it work.

ip r on the guest brings no output, which means that lxd fails to create proper routes in the container.

Networking can be made functional after running in the container:

ip link set dev eth0 up
ip route add default via 169.254.0.1 dev eth0 proto static onlink 
ip route add 169.254.0.1 dev eth0 scope link

(eth0 is the container’s main interface’s name, also the firewall on the host needs to be configured for inbound connections)

Is it a bug? Any idea why LXD cannot set up these routes? Maybe some additional option is in the container’s configuration is required for almalinux? Or maybe it’s a bug in LXD?

Another interesting information would be NetworkManager’s log:

Feb 20 16:37:36 alma systemd[1]: Starting Network Manager...
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8360] NetworkManager (version 1.40.0-1.el9) is starting... (boot:2245da45-f1df-4a4f-a511-4fcf1426ef1b)
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8363] Read config: /etc/NetworkManager/NetworkManager.conf (etc: 99-cloud-init.conf)
Feb 20 16:37:36 alma systemd[1]: Started Network Manager.
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8400] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8415] manager[0x55e310fe5090]: monitoring kernel firmware directory '/lib/firmware'.
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8429] hostname: hostname: using hostnamed
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8429] hostname: static hostname changed from (none) to "alma"
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8432] dns-mgr: init: dns=none,systemd-resolved rc-manager=unmanaged
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8444] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8445] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8445] manager: Networking is enabled by state file
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8459] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.40.0-1.el9/libnm-settings-plugin-ifcfg-rh.so")
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8478] settings: Loaded settings plugin: keyfile (internal)
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8490] dhcp: init: Using DHCP client 'internal'
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8491] device (lo): carrier: link connected
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8493] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8498] manager: (eth0): new Veth device (/org/freedesktop/NetworkManager/Devices/2)
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8506] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8509] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8513] device (eth0): Activation: starting connection 'eth0' (53f5332f-f73f-4d0b-92b3-14d8b9d2b392)
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8516] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8518] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8519] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.8521] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.9105] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.9106] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.9108] manager: NetworkManager state is now CONNECTED_LOCAL
Feb 20 16:37:36 alma NetworkManager[132]: <info>  [1676911056.9109] device (eth0): Activation: successful, device activated.
Feb 20 16:37:42 alma NetworkManager[132]: <info>  [1676911062.8574] manager: startup complete

Ok guys, had your chance to stop me. I figured out the ugliest workaround ever for this issue, that makes AlmaLinux 9 usable. A custom service that starts up an interface and sets the default route.

/etc/systemd/system/custom-network-fix.service

[Unit]
Description=Custom Network Fix
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=exec

ExecStart=/bin/bash -c "/usr/bin/nmcli c up eth0;/usr/bin/nmcli d mod eth0 ipv4.gateway 169.254.0.1"

SyslogIdentifier=custom-network-fix

[Install]
WantedBy=multi-user.target

This is the only way I found that ensures that the network is up after reboot. :face_vomiting: .

There is something very wrong that container’s interface is not automatically upped on boot, and I couldn’t find any better way of achieving that.
ONBOOT=yes in /etc/sysconfig/network-scripts/ifcfg-eth0 doesn’t work.
nmcli d mod eth0 autoconnect yes doesn’t work.
I even tried to replace /etc/sysconfig/network-scripts/ifcfg-eth0 with a configuration in /etc/NetworkManager/system-connections/ to no avail.

If anyone has a better idea, please share, before I’ll go crazy. :smile:

1 Like

I figured out the ACTUAL workaround that doesn’t make me cringe!

It turns out that when you add to your cloud-init’s user-data boot-cmd commands that enable the interface, the issue gets magically resolved!

My theory is that because of the initially down interface, the cloud-init network setup fails in some weird way, breaking NetworkManager somehow. But if we’ll ensure that eth0 is up during the cloud-init run, everything completes fine, and NetworkManager works as expected. eth0 correctly gets automatically up after each reboot and the network is available!

Edit: Actually bootcmd might be running after every reboot, so it might be just a bit more elegant version of the previous solution, but oh well, it works.

I have no idea what is exactly at fault here (AlmaLinux, cloud-init, NetworkManager or LXD), but at least we have a solution.

This is the example profile for AlmaLinux 9.1, where networking actually works:

      - name: alma
        description: "Almalinux testing profile."
        config:
          user.user-data: |
            #cloud-config
            bootcmd:
              - nmcli c up eth0
              - nmcli d mod eth0 ipv4.gateway 169.254.0.1
          user.network-config: |
            version: 2
            ethernets:
              eth0:
                dhcp4: no
                dhcp6: no
                addresses:
                - [something]/32
                nameservers:
                  addresses:
                  - 8.8.8.8
                  - 8.8.4.4
                  search: []
                routes:
                - to: 0.0.0.0/0
                  via: 169.254.0.1
                  on-link: true
        devices:
          eth0:
            type: nic
            ipv4.address: [something]
            nictype: routed
            parent: enp1s0f0
            host_name: veth-alma
          root:
            type: disk
            path: /
            pool: default
            size: 20GB

I hope it will help someone. I spent so much time on this that if I helped you, please at least like this comment, so I could feel that my obsession with this issue was not in vain. :joy:

1 Like

LXD by way of liblxc will configure the interface (at boot time at least) with the specified IP addresses and default routes. However often times the guest OS’s network configuration system then takes over and alters/wipes it. With things like bridge or macvlan networking this isn’t an issue because LXD provides a DHCP/IPv6 RA service which the guest OS’ network system can use to then configure the guest’s network configuration as it wishes.

With routed this isn’t possible so we rely on either A) the guest OS leaving the initial configuration alone (which as you’ve seen NetworkManager isn’t fond of or B) using something like cloud-init or static configuration inside the guest to get the network interface configured correctly.

@tomp I think it doesn’t apply here. The main issue is that the static configuration doesn’t work – the interface is not brought up after reboot, no matter what. The open question is why this happens. Could it be an issue on the LXD’s side, with how veth is created? Or on the guest’s side, that doesn’t know how to correctly handle that veth on boot? I have no idea, but something doesn’t work here correctly.

The interface remains up when using an image like images:ubuntu/jammy (even when removing all the netplan config files), which suggests something inside the Almalinux image is bringing the interface down.

This means it wouldn’t be a LXD bug, but could need a change in distrobuilder’s Almalinux template to somehow prevent this from happening.

I think I found the direct culprit!

[root@alma ~]# nmcli c show
NAME         UUID                                  TYPE      DEVICE 
eth0         94ee35c6-73f6-445b-bcd1-b8dfe265909f  ethernet  eth0   
System eth0  5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03  ethernet  --

Notice the two connections. When I delete the first, the network goes up immediately! Now, I need to figure out, how to prevent the automatic creation of the additional connection that messes things up. Any advice would be appreciated as I have no idea what I’m doing, figuring this stuff on the fly. :joy:

I suggest removing your cloud-init config whilst testing to rule that out as a source.

Removing cloud-init config doesn’t help, but to be sure, I’ll test the non cloud-init image.

@tomp Weirdly enough, on the “default” image we still have two connections for some reason. Could be that caused by LXD attaching IP to veth?

So, we can reduce the workaround to adding just:

  bootcmd:
     - nmcli c delete eth0 || true

to the cloud-init’s user-data. That way, I can normally use cloud-init’s network configuration, just like with Ubuntu. The || true part is just so cloud-init won’t fail if one day this issue will be solved, and there will no longer an additional eth0 connection (it won’t remove the correct one because it’s named System eth0).

Now, I’m finally at peace. :smiling_face:

1 Like

@tomp I dug even more and pinpointed the issue to the /etc/systemd/system-generators/lxc file. Specifically, at the very end there is:

# Workarounds for NetworkManager in containers
if [ "${nm_exists}" -eq 1 ]; then
       fix_nm_link_state eth0
fi

Just commenting it out, and rebooting, resolves the problem. I don’t know the story behind that, but I suspect that it was added for older NetworkManager’s version and is not needed for the newer ones (or maybe it works only for bridged network? It clearly just breaks things in my setup). What would be the correct repo to report this issue?

Added:
I reported it here: https://github.com/lxc/distrobuilder/issues/701 :tada:

1 Like