'Error: value too large for defined data type' problem when exec'ing newly created instances

tomp · March 16, 2022, 11:35am

Is that reproducible independent of the exec issue, I’d like to know more about that.

Ozymandias · March 16, 2022, 11:36am

Ah I don’t have any lxc tools in my instance… So I can’t run that test… I’m just about to have another go with my sharding script… I’ll let you know…

Ozymandias · March 16, 2022, 3:46pm

OK so yes my script is still creating dead instances which override the hosts default route with one to the primary lxdbr0 host.

Now it occurs to me that the issue might be that the script is running inside a container of the host its connecting to. Now I’ve no idea if could cause a problem as I’m unfamiliar with the code.

In the meantime I’ll create a ubuntu instance so I can see if creation using the lxd client tools causes the issue.

  lxdsocket:
    bind: container
    connect: unix:/var/snap/lxd/common/lxd/unix.socket
    gid: "1001"
    listen: unix:/var/run/lxd.socket
    mode: "0660"
    type: proxy
    uid: "0"

tomp · March 16, 2022, 3:54pm

There’s nothing in LXD off the top of my head that would add a default route on the host. Its possible that something on your system is adding a default route when the bridge comes up though. Or if there is a dhcp client running on the host that is somehow talking to the dnsmasq DHCP server on lxdbr0.

Ozymandias · March 16, 2022, 4:03pm

Its something out of the ordinary.

The obvious culprit is NetworkManager its managing lxdbr0 (That’s where the default route is from) after all… But its strange that it creates a new default profile the time the instance has to be force deleted/restarted.

Ozymandias · March 16, 2022, 4:50pm

I’ve not to my knowledge created the problem with the lxc tools… but then 99% of the time the script creates the containers…

Ozymandias · March 16, 2022, 4:51pm

Well I just did a ‘lxc stop --force’ on three dud instances and the host box has gained three new default routes one for each instance.

default via 10.208.75.1 dev vethccc9e82e proto dhcp metric 102
default via 10.208.75.1 dev veth2192f10f proto dhcp metric 103
default via 10.208.75.1 dev veth17a757ab proto dhcp metric 105
default via 10.21.75.214 dev bond0 proto static metric 300
10.21.0.0/16 dev bond0 proto kernel scope link src 10.21.75.39 metric 300
10.208.75.0/24 dev lxdbr0 proto kernel scope link src 10.208.75.1
10.208.75.0/24 dev vethccc9e82e proto kernel scope link src 10.208.75.190 metric 102
10.208.75.0/24 dev veth2192f10f proto kernel scope link src 10.208.75.96 metric 103
10.208.75.0/24 dev veth17a757ab proto kernel scope link src 10.208.75.209 metric 105

tomp · March 16, 2022, 4:57pm

It’ll be something on your host (NetworkManager likely) that is trying to do DHCP.

tomp · March 16, 2022, 5:01pm

Although whats interesting is do those veth interfaces still exist once the container has stopped? If so it suggests something is hanging onto them, as we delete the host-side of the interface and the kernel should then delete the container side, and if the interface was deleted then so should any routes to it.

This would explain why it only happens when the container is stopped as thats only when the veth pair returns to the host from the instance.

github.com

lxc/lxd/blob/master/lxd/device/nic_bridged.go#L795-L798

      
        
            		// Removing host-side end of veth pair will delete the peer end too.
            		err = network.InterfaceRemove(d.config["host_name"])
            		if err != nil {
            			return fmt.Errorf("Failed to remove interface %q: %w", d.config["host_name"], err)

Ozymandias · March 16, 2022, 5:22pm

Your correct there are a lot of veth (lxdbr0) devices left behind after the container is removed.

[22999.456550] lxdbr0: port 3(veth619fe1dc) entered blocking state
[22999.456553] lxdbr0: port 3(veth619fe1dc) entered forwarding state
[24249.493510] lxdbr0: port 4(veth07be26d1) entered blocking state
[24249.493515] lxdbr0: port 4(veth07be26d1) entered disabled state
[24249.493690] device veth07be26d1 entered promiscuous mode
[24249.740301] physqNtLgx: renamed from veth149e95ea
[24249.759314] eth0: renamed from physqNtLgx
[24249.774023] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[24249.778204] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[24249.778344] lxdbr0: port 4(veth07be26d1) entered blocking state
[24249.778347] lxdbr0: port 4(veth07be26d1) entered forwarding state
[24251.267838] lxdbr0: port 4(veth07be26d1) entered disabled state
[24314.170018] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[24314.170422] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[24314.170565] lxdbr0: port 4(veth07be26d1) entered blocking state
[24314.170568] lxdbr0: port 4(veth07be26d1) entered forwarding state
[24376.259717] lxdbr0: port 5(vetheffcf5af) entered blocking state
[24376.259721] lxdbr0: port 5(vetheffcf5af) entered disabled state
[24376.260035] device vetheffcf5af entered promiscuous mode
[24376.458384] physOcfKGw: renamed from veth86bb66ec
[24376.474620] eth0: renamed from physOcfKGw
[24376.489266] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready

I’m seeing a lot of IPv6 eth0 link not ready errors as well…

Ozymandias · March 16, 2022, 5:28pm

Ok so its NetworkManager…

Mar 16 17:27:04 DEV-CACHE-QA1 NetworkManager[2576]: [1647451624.1014] policy: set ‘Wired connection 6’ (veth0e739796) as default for IPv4 routing and DNS
Mar 16 17:27:05 DEV-CACHE-QA1

tomp · March 17, 2022, 9:09am

Just a hunch, but were your project and/or instance names quite long, as perhaps your original error was caused by this:

Ozymandias · March 17, 2022, 9:17am

I can confirm that the issue with the default routes was caused by NetworkManager… I’ve added the following to the /etc/NetworkManager/NetworkManager.conf file and that issue no longer occurs. This configuration instructed NetworkManger to leave all veth devices unmanaged. Alternatively you can just disable NetworkManager but I note that the network-scripts fallback is deprecated so might not be a good long term choice.

[keyfile]
unmanaged-devices=interface-name:veth*

tomp · March 17, 2022, 9:22am

Do you still see the host-side veth interfaces left over after stopping btw?

Ozymandias · March 17, 2022, 10:07am

Shortening the name sizes to 13 chars made no different to the ‘Value too large for defined data type’ issue…

Yes virtual devices are left orphaned after I force delete the broken instances…

tomp · March 17, 2022, 10:08am

Sounds like a kernel problem not clearing a reference to the peer veth end.

Ozymandias · March 17, 2022, 10:10am

I can clean up orphaned veth devices with a python script… But the Value to large for defined data type is a problem… Lets see if a lxc restart --force can workaround the issue… Now that the default routes are sorted…

Ozymandias · March 17, 2022, 10:13am

Yes so after a lxc restart --force name exec works fine… So at least I have a workaround I can always restart the instance from my script. I can simply enable the services and they will only be started boot…

Its not elegant but it will work…

Ozymandias · March 17, 2022, 11:51am

Quick update I switched from using the centos/8-Stream/cloud to centos/8-Stream (which doesn’t have cloud-init installed). No difference to the state of the instances.

Interestingly doing a con.restart(wait=True) in my script also didn’t make any difference.

Ozymandias · March 17, 2022, 12:08pm

OK so if I comment out my proxy maintenance lines then I don’t have any issues… So maybe its a timing/race condition.

My script does the following…
If the previous version of an instance has proxies it removes them it does this by removing them from con.devices[] and then saves the container con.save()
Creates at brand new container via lxd.containers.create(task, wait=True)
Starts the newly created container via con.start(wait=True)
Then it adds 1-2 proxies depending on its role by adding entries into con.devices()
if devices have been modified it does a con.save()

The issue appears to be the proxies config or the saves() that is producing unusable instances.
If I comment out the con.save(wait=True) on the new devices the problem doesn’t occur…

I’ll try changing the code so the proxies are created with the container and not as a modification after the instance is running.