Hallo! I am now observing strange behaviour in lxc containers networking. I use my custom bridge device (br0) to provide network connection as follows:
When container is started, vethXXXXXX interface is created and added to br0 bridge.
So far so good.
When I stop the container, this interface is not deleted and on next startup a new one is created. After that weird things start to happen: duplicate ipv6 ip addresses are reported, when container is stopped it’s ipv6 address can still be pinged, though nmap does not discover any port on it. Connectivity to container via ipv6 is lost after such restart.
When I manually identify and remove (using brctl delif and ip link del commands) all leftover interfaces, container starts working properly.
I am using lxc 3.1.0+really3.0.3-8 version from debian sid.
Normally when the last process in a container dies, the kernel destroys all the namespaces, including the network namespace which contains the container side interface of the veth pair used to give your container connectivity.
In your case, something is keeping that network namespace active, which then keeps that veth device active in the container including its IP address.
Unfortunately you deleting the host side device just papers over the issue, you’re still leaking kernel resources in the background with little you can really do to track and fix this.
We have planned kernel work which should make it easier to identify such issues in the future.
Having same issue on latest LXD on Arch Linux. I’ve stopped all the containers but still have a ton of veth* interfaces from stopped (and even removed) containers
I’m sorry, I cleared all dangling interfaces manually and can no logner reproduce it for now. I will post any required info when it will happen again. Aa I mentioned, I use Arch Linux with newest version of kernel (6.3.2) and LXD (5.13)
I was having a similar issue, and after chasing down a number of dead ends, lxc kept leaving the old devices lying behind and worse, I couldn’t stop network manager from picking up the veth device and managing it (which then stuffed up networking for all other containers).
I kept seeing this entry in syslog:
lxd.daemon[2941]: time="...." level=error msg="Failed to stop device" device=eth0 err="Failed clearing netprio rules for instance \"default\" in project \"nickg-test-platform-test-admin\": device name is empty" instance=nickg-test-platform-test-admin instanceType=container project=default
After this, NetworkManager would start managing the device and things went downhill from there
It would be nice if lxd could actually delete these and clean them up, but in the meantime I’ve given up and added some config to tell NetworkManager to leave the device alone:
[keyfile]
unmanaged-devices=interface-name:veth*
I then periodically run a script to clean up all the old veth devices that lxc has left lying behind.
And here’s what’s in syslog right after the container is stopped:
ov 1 09:45:01 nickg-lp CRON[1471700]: (root) CMD (/usr/local/bin/lldp2facts)
Nov 1 09:45:03 nickg-lp systemd[3457]: Started snap.lxd.lxc-42ad6bef-0966-42d5-9644-2b33bd44563a.scope.
Nov 1 09:45:03 nickg-lp systemd[3457]: snap.lxd.lxc-42ad6bef-0966-42d5-9644-2b33bd44563a.scope: Succeeded.
Nov 1 09:45:47 nickg-lp kernel: [56810.622022] physkmbr73: renamed from eth0
Nov 1 09:45:47 nickg-lp NetworkManager[674263]: <info> [1698785147.8496] manager: (eth0): new Veth device (/org/freedesktop/NetworkManager/Devices/125)
Nov 1 09:45:47 nickg-lp NetworkManager[674263]: <info> [1698785147.8563] device (eth0): interface index 256 renamed iface from 'eth0' to 'physkmbr73'
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: could not get ethtool features for eth0
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: Could not set offload features of eth0: No such device
Nov 1 09:45:47 nickg-lp NetworkManager[674263]: <info> [1698785147.8778] device (physkmbr73): interface index 256 renamed iface from 'physkmbr73' to 'vetha3a93bb0'
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 1 09:45:47 nickg-lp lxd.daemon[2941]: time="2023-11-01T09:45:47+13:00" level=error msg="Failed to stop device" device=eth0 err="Failed clearing netprio rules for instance \"default\" in project \"nickg-test-platform-test-image-build\": device name is empty" instance=nickg-test-platform-test-image-build instanceType=container project=default
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: could not get ethtool features for eth0
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: Could not set offload features of eth0: No such device
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: could not get ethtool features for physkmbr73
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: Could not set offload features of physkmbr73: No such device
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 1 09:45:47 nickg-lp systemd-udevd[1472493]: Using default interface naming scheme 'v245'.
Nov 1 09:45:48 nickg-lp kernel: [56811.507470] kauditd_printk_skb: 1 callbacks suppressed
Nov 1 09:45:48 nickg-lp kernel: [56811.507473] audit: type=1400 audit(1698785148.704:2487): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-nickg-test-platform-test-image-build_</var/snap/lxd/common/lxd>" pid=1472560 comm="apparmor_parser"
Note that I’ve told NetworkManager to leave all veth devices alone, but without that config change to network manager I’d see dhcp starting up setting up routes etc.
Also, here’s the two relevant spammy entries I see in ip addr:
256: vetha3a93bb0@veth37f4aa29: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 00:16:3e:4d:2c:e4 brd ff:ff:ff:ff:ff:ff
257: veth37f4aa29@vetha3a93bb0: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master lxdbr0 state LOWERLAYERDOWN group default qlen 1000
link/ether aa:1e:ae:ea:f0:75 brd ff:ff:ff:ff:ff:ff
And here’s my lxd bridge:
94: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:9b:14:aa brd ff:ff:ff:ff:ff:ff
inet 192.168.110.1/24 scope global lxdbr0
valid_lft forever preferred_lft forever
Its strange you dont see a volatile.eth0.host_name setting, as this records the host side name of the veth interface used at start time. This is then used as part of the clean up, and it its missing, as it is in this case, then you will get cleanup issues and errors.
Can you launch a fresh instance and see if that volatile key appears and whether it gets removed before stopping the instance (as it should be removed during the instance stop process).
Thanks. Please could you open an issue for this over at Issues · canonical/lxd · GitHub with the info you have posted here so we can keep track of it. Thanks
I could reproduce the issue on OpenSuse 15.6 leap, using Incus 6.5, my container failed to start (due setuid misconfiguration), stopped incus to fix the missconfiguration but the network interface stayed there. Now that Incus forked from LXD I wonder if we should create again the issue on Incus github to keep track.