Random IPv6 address dropouts in lxdbr0

tomp · August 5, 2021, 9:03pm

And lxc network show lxdbr0?

BTW i just checked on my LXD host and each time an RA is received by the container the IPv6 lifetime address resets back to its maximum value again so check yours is doing the same, and if not then thats likely the issue. Also mine is an order of magnitude larger than your lifetime, suggesting something may have set it too low for the RA advert interval.

amcduffee · August 5, 2021, 9:05pm

anderson@anderson-ryzen9:~$ lxc network show lxdbr0
config:
  dns.search: lxd,corp.terasci.com
  ipv4.address: 10.11.12.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:c8f3:56ae:8db::1/64
  ipv6.nat: "true"
description: ""
name: lxdbr0
type: bridge
used_by:
- /1.0/instances/build-aosp-1404
- /1.0/instances/build-armbian
- /1.0/instances/build-ipxe
- /1.0/instances/build-stretch
- /1.0/profiles/default
managed: true
status: Created
locations:
- none

amcduffee · August 11, 2021, 9:07pm

I posted my lxdbr0 configuration above.

A few other notes:

The highest lifetime I have seen on the fd42:c8f3:56ae:8db::1/64 addresses is 3600sec or 1hr. This is true both for the interface in the container as well as the lxdbr0 interface on the host. The lifetime counts down and expires on both the host and container interfaces in the same way.
dnsmasq does send an RA every 8-10 minutes, so it is often enough that the 3600sec lifetime above should not be an issue. However, it sends the RA using the fe80:: address on the host lxdbr0 interface. Is it possible this causes the container to ignore it or not apply it due to the RA not coming from the fd42:c8f3:56ae:8db::1/64 network?
I do not see this issue on our production cluster where the containers are connected to a pre-configured br0 bridge to our office network. The production cluster does not use lxdbr0 and therefore there is no dnsmasq involved.

tomp · August 11, 2021, 9:11pm

Does it improve if you run lxc network set lxdbr0 ipv6.dhcp.stateful=true?

You may need to restart your containers afterwards so they take an address via DHCPv6.

amcduffee · August 11, 2021, 9:24pm

No change. I set that option, restarted the container and waited for the next RA on tcpdump. After the next RA the lifetime on eth0, in the container, was still dropping.

tomp · August 11, 2021, 9:44pm

OK lets set that option to false then.

Try seeing changing the interval helps:

lxc network set lxdbr0 raw.dnsmasq="ra-param=lxdbr0,10"

You should then see the RAs every 10s and at least inside my Focal container I can see the lifetime increase back to 86400secs every 10s for the IPv6 address:

inet6 fd42:5433:626e:4c0b:216:3eff:fe35:1f9d/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 86400sec preferred_lft 86400sec

tomp · August 11, 2021, 9:46pm

No you would expect to see RAs from the link-local address of the lxdbr0 interface as mentioned earlier.

tomp · August 11, 2021, 9:53pm

Can you also show output of ip -f inet6 route in your container?

tomp · August 11, 2021, 9:58pm

And finally the output of the following on host and container:

sudo  sysctl -a | grep lft

amcduffee · August 11, 2021, 10:03pm

It is working after setting this option. Is see the RA often and the lifetime in the container resets to 3600sec whenever one arrives.

amcduffee · August 11, 2021, 10:03pm

anderson@build-armbian:~$ ip -f inet6 route
fd42:c8f3:56ae:8db::/64 dev eth0 proto ra metric 100 expires 3595sec pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::216:3eff:fedc:4247 dev eth0 proto ra metric 100 expires 25sec mtu 1500 pref medium

amcduffee · August 11, 2021, 10:06pm

Host:

anderson@anderson-ryzen9:~$ sudo sysctl -a 2>/dev/null | grep lft
net.ipv6.conf.all.temp_prefered_lft = 86400
net.ipv6.conf.all.temp_valid_lft = 604800
net.ipv6.conf.br0.temp_prefered_lft = 86400
net.ipv6.conf.br0.temp_valid_lft = 604800
net.ipv6.conf.default.temp_prefered_lft = 86400
net.ipv6.conf.default.temp_valid_lft = 604800
net.ipv6.conf.enp5s0.temp_prefered_lft = 86400
net.ipv6.conf.enp5s0.temp_valid_lft = 604800
net.ipv6.conf.lo.temp_prefered_lft = 86400
net.ipv6.conf.lo.temp_valid_lft = 604800
net.ipv6.conf.lxdbr0.temp_prefered_lft = 86400
net.ipv6.conf.lxdbr0.temp_valid_lft = 604800
net.ipv6.conf.veth29c6508f.temp_prefered_lft = 86400
net.ipv6.conf.veth29c6508f.temp_valid_lft = 604800
net.ipv6.conf.veth31098ec1.temp_prefered_lft = 86400
net.ipv6.conf.veth31098ec1.temp_valid_lft = 604800
net.ipv6.conf.veth435c3ebf.temp_prefered_lft = 86400
net.ipv6.conf.veth435c3ebf.temp_valid_lft = 604800
net.ipv6.conf.veth62945bd0.temp_prefered_lft = 86400
net.ipv6.conf.veth62945bd0.temp_valid_lft = 604800
net.ipv6.conf.veth80b2d0ea.temp_prefered_lft = 86400
net.ipv6.conf.veth80b2d0ea.temp_valid_lft = 604800

Container:

anderson@build-armbian:~$ sudo sysctl -a 2>/dev/null | grep lft
net.ipv6.conf.all.temp_prefered_lft = 86400
net.ipv6.conf.all.temp_valid_lft = 604800
net.ipv6.conf.default.temp_prefered_lft = 86400
net.ipv6.conf.default.temp_valid_lft = 604800
net.ipv6.conf.eth0.temp_prefered_lft = 86400
net.ipv6.conf.eth0.temp_valid_lft = 604800
net.ipv6.conf.lo.temp_prefered_lft = 86400
net.ipv6.conf.lo.temp_valid_lft = 604800

tomp · August 11, 2021, 10:07pm

Cool glad its helped. I honestly don’t know where the 3600 is coming from. I suspect something on your host system or kernel is affecting the behaviour of the RAs.

tomp · August 11, 2021, 10:09pm

Does the expires ever go above 25sec? If not then that is likely the problem, its very low.

amcduffee · August 11, 2021, 10:10pm

Yes:

anderson@build-armbian:~$ ip -f inet6 route
fd42:c8f3:56ae:8db::/64 dev eth0 proto ra metric 100 expires 3516sec pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::216:3eff:fedc:4247 dev eth0 proto ra metric 100 expires 1716sec mtu 1500 pref medium

amcduffee · August 11, 2021, 10:23pm

Yea, I am scratching my head on some of these things too. I have done the following to see what happens:

lxc network unset lxdbr0 raw.dnsmasq

It appears the RAs are still working as expected but back to the 8-10min interval from before. It seems as if the lxc network set lxdbr0 ipv6.dhcp.stateful=true might have affected something important.

I am curious if it will revert back to being broken if I stop all my containers and let the address lifetimes expire on lxdbr0.

tomp · August 11, 2021, 10:31pm

Really lxdbr0 should only have static addresses assigned to it and no dynamic ones, so that too is also something unusual in the way your system is behaving/configured.

Here’s mine:

4: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:16:3e:f6:f3:4a brd ff:ff:ff:ff:ff:ff
    inet 10.156.236.1/24 scope global lxdbr1
       valid_lft forever preferred_lft forever
    inet6 fd42:5433:626e:4c0b::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fef6:f34a/64 scope link 
       valid_lft forever preferred_lft forever

amcduffee · August 11, 2021, 10:44pm

I have no idea where the dynamic ones are coming from. It isn’t any configuration that I recall setting myself. I use netplan to configure the host interface, here is that config:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp5s0:
      optional: true
      match:
        macaddress: a8:5e:45:e1:9c:ab

  bridges:
    br0:
      dhcp4: yes
      dhcp6: no
      interfaces: [enp5s0]
      parameters:
        forward-delay: 2
        stp: true

Also, I did the following:

anderson@anderson-ryzen9:~$ lxc stop --all
anderson@anderson-ryzen9:~$ lxc start build-armbian

So, now the build-armbian container is the only one running and now the IPv6 address aren’t working at all:

anderson@anderson-ryzen9:~$ ip addr show lxdbr0
4: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet 10.11.12.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:c8f3:56ae:8db:216:3eff:fedc:4247/64 scope global deprecated dynamic mngtmpaddr noprefixroute 
       valid_lft 2228sec preferred_lft 0sec
    inet6 fe80::216:3eff:fedc:4247/64 scope link 
       valid_lft forever preferred_lft forever

anderson@build-armbian:~$ ip addr show eth0
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:fa:f9:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.11.12.199/24 brd 10.11.12.255 scope global dynamic eth0
       valid_lft 2722sec preferred_lft 2722sec
    inet6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e/64 scope global deprecated dynamic mngtmpaddr noprefixroute 
       valid_lft 2083sec preferred_lft 0sec
    inet6 fe80::216:3eff:fefa:f95e/64 scope link 
       valid_lft forever preferred_lft forever

0sec preferred and marked as deprecated on host and in container…

amcduffee · August 12, 2021, 12:35am

I had to do a snap restart lxd a couple of times to get things back to what looked like a sane state. The first restart took a very long time (4-5 mins) and came back with lxdbr0 not having a inet6 fd42:c8f3:56ae:8db::1/64 scope global assignment at all. Odd…

After the second snap restart:

anderson@anderson-ryzen9:~$ ip addr show lxdbr0
30: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet 10.11.12.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:c8f3:56ae:8db::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fedc:4247/64 scope link 
       valid_lft forever preferred_lft forever

Looks ok, and then after the first RA was sent:

anderson@anderson-ryzen9:~$ ip addr show lxdbr0
30: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet 10.11.12.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:c8f3:56ae:8db:216:3eff:fedc:4247/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 3389sec preferred_lft 3389sec
    inet6 fd42:c8f3:56ae:8db::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fedc:4247/64 scope link 
       valid_lft forever preferred_lft forever

Back to having a dynamic assignment on lxdbr0:
inet6 fd42:c8f3:56ae:8db:216:3eff:fedc:4247/64

The container did refresh its IPv6 address back to 3600sec on the first RA, but it did not on subsequent ones.

After the second RA didn’t do anything I reapplied:
lxc network set lxdbr0 raw.dnsmasq="ra-param=lxdbr0,10"

And everything fixed itself with all lifetimes on host and container immediately starting at 3600sec and being refreshed often.

Then I stopped build-armbian which was the only container running and lxdbr0 lost both the IPv4 and IPv6 addresses:

anderson@anderson-ryzen9:~$ ip addr show lxdbr0
30: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::216:3eff:fedc:4247/64 scope link 
       valid_lft forever preferred_lft forever

If I start build-armbian again I see the following on the host and container:

anderson@anderson-ryzen9:~$ ip addr show lxdbr0
30: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet 10.11.12.1/24 brd 10.11.12.255 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fedc:4247/64 scope link 
       valid_lft forever preferred_lft forever

anderson@build-armbian:~$ ip addr show eth0
37: eth0@if38: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:fa:f9:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.11.12.199/24 brd 10.11.12.255 scope global dynamic eth0
       valid_lft 3560sec preferred_lft 3560sec
    inet6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e/64 scope global deprecated dynamic mngtmpaddr noprefixroute 
       valid_lft 3189sec preferred_lft 0sec
    inet6 fe80::216:3eff:fefa:f95e/64 scope link 
       valid_lft forever preferred_lft forever

Multiple start/stop cycles on build-armbian doesn’t fix it once the interfaces get in the preferred_lft 0sec / deprecated state, but they do continue to work due to, I presume, the valid_lft being non-zero:

anderson@anderson-ryzen9:~$ ping6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e
PING fd42:c8f3:56ae:8db:216:3eff:fefa:f95e(fd42:c8f3:56ae:8db:216:3eff:fefa:f95e) 56 data bytes
64 bytes from fd42:c8f3:56ae:8db:216:3eff:fefa:f95e: icmp_seq=1 ttl=64 time=0.096 ms
64 bytes from fd42:c8f3:56ae:8db:216:3eff:fefa:f95e: icmp_seq=2 ttl=64 time=0.077 ms

So, another sudo snap restart lxd to try to get things back to sane:

anderson@anderson-ryzen9:~$ lxc stop build-armbian
anderson@anderson-ryzen9:~$ sudo snap restart lxd
Restarted.
anderson@anderson-ryzen9:~$ ip addr show lxdbr0
41: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:16:3e:dc:42:47 brd ff:ff:ff:ff:ff:ff
    inet 10.11.12.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:c8f3:56ae:8db::1/64 scope global 
       valid_lft forever preferred_lft forever
anderson@anderson-ryzen9:~$ pgrep -fa dnsmasq
4277 dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdbr0 --dhcp-rapid-commit --listen-address=10.11.12.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-option-force=119,lxd,corp.terasci.com --dhcp-range 10.11.12.2,10.11.12.254,1h --listen-address=fd42:c8f3:56ae:8db::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd --interface-name _gateway.lxd,lxdbr0 -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd -g lxd
anderson@anderson-ryzen9:~$ cat /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw
ra-param=lxdbr0,60
anderson@anderson-ryzen9:~$ lxc start build-armbian

And back to not refreshing again:

anderson@build-armbian:~$ ip addr show eth0
42: eth0@if43: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:fa:f9:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.11.12.199/24 brd 10.11.12.255 scope global dynamic eth0
       valid_lft 3153sec preferred_lft 3153sec
    inet6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 3156sec preferred_lft 3156sec
    inet6 fe80::216:3eff:fefa:f95e/64 scope link 
       valid_lft forever preferred_lft forever

Applying a different raw.dnsmasq value fixes it , but only for a single RA:

anderson@anderson-ryzen9:~$ lxc network set lxdbr0 raw.dnsmasq="ra-param=lxdbr0,30"
anderson@anderson-ryzen9:~$ cat /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw
ra-param=lxdbr0,30

anderson@build-armbian:~$ ip addr show eth0
42: eth0@if43: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:fa:f9:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.11.12.199/24 brd 10.11.12.255 scope global dynamic eth0
       valid_lft 3082sec preferred_lft 3082sec
    inet6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 3587sec preferred_lft 3587sec
    inet6 fe80::216:3eff:fefa:f95e/64 scope link 
       valid_lft forever preferred_lft forever

// wait a while...

anderson@build-armbian:~$ ip addr show eth0
42: eth0@if43: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:fa:f9:5e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.11.12.199/24 brd 10.11.12.255 scope global dynamic eth0
       valid_lft 2482sec preferred_lft 2482sec
    inet6 fd42:c8f3:56ae:8db:216:3eff:fefa:f95e/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 2987sec preferred_lft 2987sec
    inet6 fe80::216:3eff:fefa:f95e/64 scope link 
       valid_lft forever preferred_lft forever

So, still trying to figure out what action in particular gets it to refresh correctly and consistently.

tomp · August 12, 2021, 8:08am

You don’t happen to have something running inside that container that is also advertising IPv6 routes by any chance do you?

How about this:

Stop all containers.
Restart LXD so you get just the static IPv6 assigned and not the dynamic one on lxdbr0.
Wait for an RA from dnsmasq and check no dynamic assignments get added on lxdbr0.
Now launch a fresh container using images:ubuntu/focal and check A) that no dynamic IPs get added to lxdbr0, and B) that the dynamic address inside the new container gets refreshed.

If that all works then we can be more confident there is something unusual about your existing container.