Centos8 containers unable to automatically get ipv4 addresses after update

stgraber · July 19, 2021, 4:25am

Those are the results from our last test run earlier today on the Oracle images:

PASS: IPv4 address: oracle-7-unpriv
PASS: IPv6 address: oracle-7-unpriv
PASS: DNS resolution: oracle-7-unpriv
PASS: systemd clean: oracle-7-unpriv

PASS: IPv4 address: oracle-7-priv
PASS: IPv6 address: oracle-7-priv
PASS: DNS resolution: oracle-7-priv
PASS: systemd clean: oracle-7-priv

PASS: IPv4 address: oracle-7-cloud-unpriv
PASS: IPv6 address: oracle-7-cloud-unpriv
PASS: DNS resolution: oracle-7-cloud-unpriv
PASS: cloud-init user-data provisioning: oracle-7-cloud-unpriv
PASS: cloud-init vendor-data provisioning: oracle-7-cloud-unpriv
PASS: systemd clean: oracle-7-cloud-unpriv

PASS: IPv4 address: oracle-7-cloud-priv
PASS: IPv6 address: oracle-7-cloud-priv
PASS: DNS resolution: oracle-7-cloud-priv
PASS: cloud-init user-data provisioning: oracle-7-cloud-priv
PASS: cloud-init vendor-data provisioning: oracle-7-cloud-priv
PASS: systemd clean: oracle-7-cloud-priv

PASS: IPv4 address: oracle-8-unpriv
PASS: IPv6 address: oracle-8-unpriv
PASS: DNS resolution: oracle-8-unpriv
PASS: systemd clean: oracle-8-unpriv

PASS: IPv4 address: oracle-8-priv
PASS: IPv6 address: oracle-8-priv
PASS: DNS resolution: oracle-8-priv
PASS: systemd clean: oracle-8-priv

PASS: IPv4 address: oracle-8-cloud-unpriv
PASS: IPv6 address: oracle-8-cloud-unpriv
PASS: DNS resolution: oracle-8-cloud-unpriv
PASS: cloud-init user-data provisioning: oracle-8-cloud-unpriv
PASS: cloud-init vendor-data provisioning: oracle-8-cloud-unpriv
PASS: systemd clean: oracle-8-cloud-unpriv

PASS: IPv4 address: oracle-8-cloud-priv
PASS: IPv6 address: oracle-8-cloud-priv
PASS: DNS resolution: oracle-8-cloud-priv
PASS: cloud-init user-data provisioning: oracle-8-cloud-priv
PASS: cloud-init vendor-data provisioning: oracle-8-cloud-priv
PASS: systemd clean: oracle-8-cloud-priv

This shows our test system running a simple Ubuntu 20.04 host did get working network on all Oracle images, privileged or not, cloud or not.
So far the majority of users who reported issues with this have been found to run kernels that have broken network interface ownership which then break network manager.

To see if that’s what’s affecting you, check ls -lh /sys/class/net/ inside your container.

It should look like:

stgraber@shell01:~$ ls -lh /sys/class/net/
total 0
lrwxrwxrwx 1 root   root       0 Jul 18 16:10 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx 1 root   root       0 Jul 18 16:10 lo -> ../../devices/virtual/net/lo
stgraber@shell01:~$

If eth0 is owned by nobody:nogroup, this is an indication that your kernel doesn’t properly handled network interface ownership in unprivileged kernels and that NetworkManager will therefore refuse to use it.

Gilbert_Standen · July 19, 2021, 8:43am

Thanks.

In this system, both the container and the host system are Oracle 8. As mentioned, using privileged container does workaround the issue and container gets IP address successfully from DHCP in the privileged case.

If the Oracle 8 container is unprivileged then it does indeed have “nobody” as the owner of the virtual eth0, as shown below, and then is unable to get a DHCP address.

Results for each case are shown below.

Container (privileged)

[ubuntu@o83sv2 ~]$ lxc exec ora83d14 bash
[root@ora83d14 ~]# ls -lh /sys/class/net
total 0
lrwxrwxrwx. 1 root root 0 Jul 19 03:40 eth0 → …/…/devices/virtual/net/eth0
lrwxrwxrwx. 1 root root 0 Jul 19 03:40 lo → …/…/devices/virtual/net/lo
[root@ora83d14 ~]# cat /etc/oracle-release
Oracle Linux Server release 8.4
[root@ora83d14 ~]# uname -a
Linux ora83d14 5.4.17-2102.203.5.el8uek.x86_64 #2 SMP Mon Jun 28 16:44:26 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@ora83d14 ~]#

Container (unprivileged):

[root@oel83d12 rules.d]# ls -lh /sys/class/net
total 0
lrwxrwxrwx. 1 nobody nobody 0 Jul 19 08:32 eth0 → …/…/devices/virtual/net/eth0
lrwxrwxrwx. 1 root root 0 Jul 19 08:32 lo → …/…/devices/virtual/net/lo
[root@oel83d12 rules.d]# uname -a
Linux oel83d12 5.4.17-2102.203.5.el8uek.x86_64 #2 SMP Mon Jun 28 16:44:26 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@oel83d12 rules.d]# cat /etc/oracle-release
Oracle Linux Server release 8.4
[root@oel83d12 rules.d]#

stgraber · July 19, 2021, 11:29am

Right, so in this setup, NM will indeed not work. Your best bet short of finding a way to run a kernel with the needed fix is to manually switch over to something other than NM for your network configuration.

jsnjack · July 19, 2021, 4:14pm

just a note (unfortunately, not helpful in case of Oracle linux) - Centos 8 Stream image comes without nm

trystan · July 19, 2021, 5:03pm

Could you describe the needed fix? (I’m assuming this isn’t something that can applied without building a new kernel.)

If I had the details I could at least approach someone at Oracle about incorporating it into their UEK.

stgraber · July 19, 2021, 5:08pm

@brauner should have the link to the relevant kernel pull request and patches

Gilbert_Standen · July 19, 2021, 8:33pm

@trystan I did reach out to Avi Miller (@AviAtOracle) via Twitter, although he is no longer the Product Director for Oracle Linux, he would be able comment on it and/or to route it internally at Oracle. Haven’t heard back yet from Avi but it’s not even been 24 hours yet.

brauner · July 20, 2021, 7:10am

Pull request:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebb4a4bf76f164457184a3f43ebc1552416bc823

trystan · July 20, 2021, 11:18am

Thanks so much for getting that info! Much appreciated.

I decided to attempt booting the elrepo mainline kernel and sure enough the appropriate permissions were applied and NM acted on the interface.

However, I’m noticing that with a manually created bridge, the container will not pull an IP (NM says connecting waiting for IP indefinitely) if the host has an IP assigned to the same bridge.

As soon as I set the host’s ipv4 address on the bridge to ‘disabled’ the container is able to assign an IP.

For devices with a single interface this presents an issue as the only two options to bring the container into the same IP space as the host (macvlan or bridged interface) are not available.

stgraber · July 20, 2021, 12:52pm

What platform are you running on?
This behavior of only the host or a container being able to get an address usually points towards the network switch filtering MAC addresses.

trystan · July 20, 2021, 3:20pm

Oracle Linux 8 w/ the bridge (STP off) on top of a team

team details:

{"runner":{"name":"loadbalance","tx_balancer":{"name":"basic"}},"link_watch":{"name":"ethtool","delay_up":"0","delay_down":"0"}}

stgraber · July 20, 2021, 6:13pm

So I’m assuming that “team” is some kind of wrapper around standard Linux bonding. All that is fine. If there is MAC filtering, it’s likely outside of your system (done by the physical switch or a vSwitch in a virtual environment).

trystan · July 20, 2021, 7:52pm

I swapped from redhat’s new ‘team’ feature to the legacy ‘bonding’ using the closest comparable settings and this resolved it. Both host and container are now happily sharing the interface via a bridge.

As for the NM inside the LXC container: UEK kernel needs privileged mode, elrepo mainline 5.13 kernel does not. Hopefully I can convince someone at Oracle to add that PR to their kernel.

Djelibeybi · July 21, 2021, 2:00am

Hey folks. as @Gilbert_Standen mentioned, I’m no longer an Oracle Linux PM, but I did send this thread to the current PM for him to review with our engineering team.

stgraber · July 21, 2021, 2:15am

Thanks!

Djelibeybi · July 22, 2021, 2:47am

This has been logged internally as bug 33141684. If any of you have a paid Oracle Linux support subcription, I recommend opening an SR for this issue so that we get some customers attached to the bug (which raises its priority).

nateybobo · July 26, 2021, 4:44am

Can someone please shed some light on how this is a solution? I’m not really certain how quickly I can see this pull request making it’s way to my lxc host kernel.

I’m running some CentOS 8 containers. Even with static IPs set inside the container, I have to run ifup after each restart. Which is REALLY inconvenient when you need a container to start up and provide DHCP and DNS to your network…

wusikijeronii · November 15, 2021, 6:34pm

Hello.
After updating almalinux (RHEL based) in LXC I faced the same problem. After the update, the NetworkManager was be installed. (Before that was are network-scripts for managing network) Solution for me:

systemctl disable NetworkManager.service
reboot

Ozymandias · March 9, 2022, 11:59am

Unfortunately in EL8 network-scripts are not installed by default…
So the process for EL8 or any of its clone distro’s is…

systemctl disable NetworkManager
dhclient eth0
yum -y install network-scripts
systemctl enable network --now

Ozymandias · March 11, 2022, 9:57am

I’ve managed to create a cloud-init bootcmd section which automatically works around this issue for RHEL8 devilled containers… Disable NetworkManger which is ignoring veth derived devices (e.g. eth0) and then bring up the eth0 manually and then install and enable old school networking which does work.

config:
raw.idmap: both 1001 1001
user.user-data: |-
#cloud-config
package_update: true
timezone: Europe/London
bootcmd:
- [ cloud-init-per, once, nmdis, systemctl, disable, NetworkManager, --now ]
- [ cloud-init-per, once, eth0up, dhclient, eth0 ]
- [ cloud-init-per, once, epel, yum, -y, install, epel-release, network-scripts ]
- [ cloud-init-per, once, nwup, systemctl, enable, network, --now ]