Is that reproducible independent of the exec issue, I’d like to know more about that.
Ah I don’t have any lxc tools in my instance… So I can’t run that test… I’m just about to have another go with my sharding script… I’ll let you know…
OK so yes my script is still creating dead instances which override the hosts default route with one to the primary lxdbr0 host.
Now it occurs to me that the issue might be that the script is running inside a container of the host its connecting to. Now I’ve no idea if could cause a problem as I’m unfamiliar with the code.
In the meantime I’ll create a ubuntu instance so I can see if creation using the lxd client tools causes the issue.
lxdsocket: bind: container connect: unix:/var/snap/lxd/common/lxd/unix.socket gid: "1001" listen: unix:/var/run/lxd.socket mode: "0660" type: proxy uid: "0"
There’s nothing in LXD off the top of my head that would add a default route on the host. Its possible that something on your system is adding a default route when the bridge comes up though. Or if there is a dhcp client running on the host that is somehow talking to the dnsmasq DHCP server on lxdbr0.
Its something out of the ordinary.
The obvious culprit is NetworkManager its managing lxdbr0 (That’s where the default route is from) after all… But its strange that it creates a new default profile the time the instance has to be force deleted/restarted.
I’ve not to my knowledge created the problem with the lxc tools… but then 99% of the time the script creates the containers…
Well I just did a ‘lxc stop --force’ on three dud instances and the host box has gained three new default routes one for each instance.
default via 10.208.75.1 dev vethccc9e82e proto dhcp metric 102
default via 10.208.75.1 dev veth2192f10f proto dhcp metric 103
default via 10.208.75.1 dev veth17a757ab proto dhcp metric 105
default via 10.21.75.214 dev bond0 proto static metric 300
10.21.0.0/16 dev bond0 proto kernel scope link src 10.21.75.39 metric 300
10.208.75.0/24 dev lxdbr0 proto kernel scope link src 10.208.75.1
10.208.75.0/24 dev vethccc9e82e proto kernel scope link src 10.208.75.190 metric 102
10.208.75.0/24 dev veth2192f10f proto kernel scope link src 10.208.75.96 metric 103
10.208.75.0/24 dev veth17a757ab proto kernel scope link src 10.208.75.209 metric 105
It’ll be something on your host (NetworkManager likely) that is trying to do DHCP.
Although whats interesting is do those veth interfaces still exist once the container has stopped? If so it suggests something is hanging onto them, as we delete the host-side of the interface and the kernel should then delete the container side, and if the interface was deleted then so should any routes to it.
This would explain why it only happens when the container is stopped as thats only when the veth pair returns to the host from the instance.
Your correct there are a lot of veth (lxdbr0) devices left behind after the container is removed.
[22999.456550] lxdbr0: port 3(veth619fe1dc) entered blocking state
[22999.456553] lxdbr0: port 3(veth619fe1dc) entered forwarding state
[24249.493510] lxdbr0: port 4(veth07be26d1) entered blocking state
[24249.493515] lxdbr0: port 4(veth07be26d1) entered disabled state
[24249.493690] device veth07be26d1 entered promiscuous mode
[24249.740301] physqNtLgx: renamed from veth149e95ea
[24249.759314] eth0: renamed from physqNtLgx
[24249.774023] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[24249.778204] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[24249.778344] lxdbr0: port 4(veth07be26d1) entered blocking state
[24249.778347] lxdbr0: port 4(veth07be26d1) entered forwarding state
[24251.267838] lxdbr0: port 4(veth07be26d1) entered disabled state
[24314.170018] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[24314.170422] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[24314.170565] lxdbr0: port 4(veth07be26d1) entered blocking state
[24314.170568] lxdbr0: port 4(veth07be26d1) entered forwarding state
[24376.259717] lxdbr0: port 5(vetheffcf5af) entered blocking state
[24376.259721] lxdbr0: port 5(vetheffcf5af) entered disabled state
[24376.260035] device vetheffcf5af entered promiscuous mode
[24376.458384] physOcfKGw: renamed from veth86bb66ec
[24376.474620] eth0: renamed from physOcfKGw
[24376.489266] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
I’m seeing a lot of IPv6 eth0 link not ready errors as well…
Ok so its NetworkManager…
Mar 16 17:27:04 DEV-CACHE-QA1 NetworkManager: [1647451624.1014] policy: set ‘Wired connection 6’ (veth0e739796) as default for IPv4 routing and DNS
Mar 16 17:27:05 DEV-CACHE-QA1
Just a hunch, but were your project and/or instance names quite long, as perhaps your original error was caused by this:
I can confirm that the issue with the default routes was caused by NetworkManager… I’ve added the following to the /etc/NetworkManager/NetworkManager.conf file and that issue no longer occurs. This configuration instructed NetworkManger to leave all veth devices unmanaged. Alternatively you can just disable NetworkManager but I note that the network-scripts fallback is deprecated so might not be a good long term choice.
Do you still see the host-side veth interfaces left over after stopping btw?
Shortening the name sizes to 13 chars made no different to the ‘Value too large for defined data type’ issue…
Yes virtual devices are left orphaned after I force delete the broken instances…
Sounds like a kernel problem not clearing a reference to the peer veth end.
I can clean up orphaned veth devices with a python script… But the Value to large for defined data type is a problem… Lets see if a lxc restart --force can workaround the issue… Now that the default routes are sorted…
Yes so after a lxc restart --force name exec works fine… So at least I have a workaround I can always restart the instance from my script. I can simply enable the services and they will only be started boot…
Its not elegant but it will work…
Quick update I switched from using the centos/8-Stream/cloud to centos/8-Stream (which doesn’t have cloud-init installed). No difference to the state of the instances.
Interestingly doing a con.restart(wait=True) in my script also didn’t make any difference.
OK so if I comment out my proxy maintenance lines then I don’t have any issues… So maybe its a timing/race condition.
My script does the following…
If the previous version of an instance has proxies it removes them it does this by removing them from con.devices and then saves the container con.save()
Creates at brand new container via lxd.containers.create(task, wait=True)
Starts the newly created container via con.start(wait=True)
Then it adds 1-2 proxies depending on its role by adding entries into con.devices()
if devices have been modified it does a con.save()
The issue appears to be the proxies config or the saves() that is producing unusable instances.
If I comment out the con.save(wait=True) on the new devices the problem doesn’t occur…
I’ll try changing the code so the proxies are created with the container and not as a modification after the instance is running.