Apparent race between SLAAC and NetworkManager

wibbit · November 30, 2022, 3:42pm

Good morning all

I’ve hit an unusual issue with running RHEL9-based containers on both RHEL7 and RHEL9 hosts.

Symptoms.

When starting the RHEL9 container, at times the network will come online with only an IPv6 address, IPv4 remains un-assigned.

Investigation

This appears to be down to the fact that when NetworkManager starts, it believes that the eth0 is already being managed, and as such does not try to manage it.

I’ve found 3 workarounds, all three appear to be a hack.

Disable IPv6 entirely in the instance via /etc/sysctl.conf
Remove the IPv6 configuration from the lxdnet0 bridge
Insert StartExecPre=/usr/sbin/ip addr flush eth0 to the NetworkManager.service

The IPv6 address is a SLAAC assigned IP, which I believe is negotiated within the kernel (happy to be told I’m wrong).

To help diagnose I eventually added the below to both systemd-sysctl.service and NetworkManager.service.

StartExecPre=/usr/sbin/ip addr show

I’ve added this also to systemd-sysctl.service as setting the below appeared not to help, and appeared to, at times not take effect.

net.ipv6.conf.all.autoconf=0
net.ipv6.conf.default.autoconf=0
net.ipv6.conf.default.accept_ra=0

What I can see from running /usr/sbin/ip addr show, is that very very very early on there is already a SLAAC assigned IP address present within the namespace prior to either systemd-sysctl.service running or NetworkManager.service running.

This is the output prior to systemd-sysctl.service being run, and we can see that there is no SLAAC address assigned.

Nov 30 12:58:50 doug-test2 ip[39]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
Nov 30 12:58:50 doug-test2 ip[39]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Nov 30 12:58:50 doug-test2 ip[39]: inet 127.0.0.1/8 scope host lo
Nov 30 12:58:50 doug-test2 ip[39]: valid_lft forever preferred_lft forever
Nov 30 12:58:50 doug-test2 ip[39]: inet6 ::1/128 scope host
Nov 30 12:58:50 doug-test2 ip[39]: valid_lft forever preferred_lft forever
Nov 30 12:58:50 doug-test2 ip[39]: 7238: eth0@if7239: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
Nov 30 12:58:50 doug-test2 ip[39]: link/ether 00:16:3e:02:9d:c9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Nov 30 12:58:50 doug-test2 ip[39]: inet6 fe80::216:3eff:fe02:9dc9/64 scope link tentative
Nov 30 12:58:50 doug-test2 ip[39]: valid_lft forever preferred_lft forever

Shortly after NetworkManager.service is started, where I also have the StartExecPre statement.

Nov 30 12:58:50 doug-test2 ip[62]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
Nov 30 12:58:50 doug-test2 ip[62]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Nov 30 12:58:50 doug-test2 ip[62]: inet 127.0.0.1/8 scope host lo
Nov 30 12:58:50 doug-test2 ip[62]: valid_lft forever preferred_lft forever
Nov 30 12:58:50 doug-test2 ip[62]: inet6 ::1/128 scope host
Nov 30 12:58:50 doug-test2 ip[62]: valid_lft forever preferred_lft forever
Nov 30 12:58:50 doug-test2 ip[62]: 7238: eth0@if7239: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
Nov 30 12:58:50 doug-test2 ip[62]: link/ether 00:16:3e:02:9d:c9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Nov 30 12:58:50 doug-test2 ip[62]: inet6 fd42:5823:ba6e:dbc8:216:3eff:fe02:9dc9/64 scope global tentative dynamic mngtmpaddr
Nov 30 12:58:50 doug-test2 ip[62]: valid_lft forever preferred_lft forever
Nov 30 12:58:50 doug-test2 ip[62]: inet6 fe80::216:3eff:fe02:9dc9/64 scope link tentative
Nov 30 12:58:50 doug-test2 ip[62]: valid_lft forever preferred_lft forever

Here you can see prior to NetworkManager has started the following.

Nov 30 12:58:50 doug-test2 ip[62]: link/ether 00:16:3e:02:9d:c9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Nov 30 12:58:50 doug-test2 ip[62]: inet6 fd42:5823:ba6e:dbc8:216:3eff:fe02:9dc9/64 scope global tentative dynamic mngtmpaddr
Nov 30 12:58:50 doug-test2 ip[62]: valid_lft forever preferred_lft forever

I have a simple script in place that will restart the instance and look for an IPv4 address, if there is one, it will restart the container again, and simply loops over that till their is no IPv4 address.
I can get the above failure within ~300 reboots, but as few as 1.

I’m guessing some more information will be needed, happy to provide as much as possible.

Bridge configuration below

config:
  dns.domain: localhost
  ipv4.address: 172.16.81.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:5823:ba6e:dbc8::1/64
  ipv6.nat: "true"
  raw.dnsmasq: conf-file=/etc/dnsmasq.d/lxd
description: ""
name: lxdnet0
type: bridge
used_by:
- /1.0/instances/doug-test
- /1.0/instances/doug-test2
- /1.0/instances/doug-test3
- /1.0/instances/doug-test4
- /1.0/profiles/default
managed: true
status: Created
locations:
- none

wibbit · December 9, 2022, 5:32pm

To see if this is just something with our local deployment.

Is anyone else using RHEL9?

xlmnxp · December 16, 2022, 1:38pm

there bugs with RHEL 9, same here
I got RHEL 9 container with IPv6 /48 and it work for sometime then go timeout
after the provider check it and figure the issue with RockyLinux 9 (same as RHEL 9) only because Debain images work fine