Is the container operational if you don’t use cloud-init config? I’d first confirm that the container functions with/without cloud-init before looking at the exec problem.
I’ve not noticed any different if cloud-init runs or not…
My pylxd script adds proxy devices to the container after cloud-init has completed… I’ve commented those out and it doesn’t appear to lock up any more…
I’ll have to run the script a few times to see if its a solid lead…
What I mean is your post here 'Error: value too large for defined data type' problem when exec'ing newly created instances - #5 by Ozymandias suggests there are larger issues with the container than not being able to run specific exec commands.
So those need to be resolved first.
Can we back up a bit and clarify the problem.
Can you launch a fresh container with that configuration and then just run
lxc exec <instance> -- bash to get into it?
I created three instances test,test2,test3 using a command like…
lxc launch images:centos/8-Stream/cloud test2
And there are no problems with lxc exec…
However none of them have been assigned an IP address. This is due to the fact that NetworkManager in the cloud image ignores veth devices.
I work around that in my cloud-init by doing the following commands in the bootcmd section…
bootcmd: - [ cloud-init-per, once, nmdis, systemctl, disable, NetworkManager, --now ] - [ cloud-init-per, once, eth0up, dhclient, eth0 ] - [ cloud-init-per, once, epel, yum, -y, install, epel-release, network-scripts ] - [ cloud-init-per, once, nwup, systemctl, enable, network, --now ]
And then you can exec into them manually OK?
No problems so far… Though its worth noting sometimes I can exec into the cloud-inited instances as well… Its not reproducible 100% of the time…
OK so 'Error: value too large for defined data type' problem when exec'ing newly created instances - #5 by Ozymandias wasn’t really about this thread?
So looking at this, the command being run is just
bash? is that correct, and that generates the error?
I’m wondering how it is different to what you’re doing when it works in 'Error: value too large for defined data type' problem when exec'ing newly created instances - #12 by Ozymandias
Step by step reproducer steps would be great, thanks!
I don’t understand?. There appears is a problem when I create instances where they stop working and I can’t exec into them.
The NetworkManger issue with the RHEL derivatives has a workaround and hopefully somebody will fix the images at some point in the future. This isn’t relevant to this problem but since none of the containers have IP addresses its a problem…
I’ve just deleted and recreated the three test instances using cloud-init and cloud-init has succeeded and they are all contactable.
So no difference with or without cloud-init. Which is why Started to look at what else my pylxd script was doing… e.g. proxies…
Yes the whole networking post confused me, as you should be able to exec into a container even if the network isn’t running. I’ll just discount that part in my head.
But I’m still a bit confused as to the steps that cause the issue, as in 'Error: value too large for defined data type' problem when exec'ing newly created instances - #12 by Ozymandias you said you can exec into the container.
Some more information:-
My pylxd script creates the instance, starts the instance and then execs ‘cloud-init status --wait’ so that I can see the result of the install. That appears to work…
However the containers are sometimes broken after that point… Which is odd as they have IP addresses and the cloud-init ran and produced.
When I attempt to restart them 'lxc restart ’ that hangs…
When I force delete them 'lxc restore --force that adds a default route to the host machines routing table to lxdbr0 and that breaks networking outside the local subnet.
A thought as occurred I’m running my script inside a container which is connecting to the hosts lxd unix socket… Let me repeat the tests from there…
Can you recreate the issue without pylxd out of interest?
lxc restore doesn’t delete? Do you mean
lxc delete -f?
Did you try
lxc restart -f too?
Is that reproducible independent of the exec issue, I’d like to know more about that.
Ah I don’t have any lxc tools in my instance… So I can’t run that test… I’m just about to have another go with my sharding script… I’ll let you know…
OK so yes my script is still creating dead instances which override the hosts default route with one to the primary lxdbr0 host.
Now it occurs to me that the issue might be that the script is running inside a container of the host its connecting to. Now I’ve no idea if could cause a problem as I’m unfamiliar with the code.
In the meantime I’ll create a ubuntu instance so I can see if creation using the lxd client tools causes the issue.
lxdsocket: bind: container connect: unix:/var/snap/lxd/common/lxd/unix.socket gid: "1001" listen: unix:/var/run/lxd.socket mode: "0660" type: proxy uid: "0"
There’s nothing in LXD off the top of my head that would add a default route on the host. Its possible that something on your system is adding a default route when the bridge comes up though. Or if there is a dhcp client running on the host that is somehow talking to the dnsmasq DHCP server on lxdbr0.
Its something out of the ordinary.
The obvious culprit is NetworkManager its managing lxdbr0 (That’s where the default route is from) after all… But its strange that it creates a new default profile the time the instance has to be force deleted/restarted.
I’ve not to my knowledge created the problem with the lxc tools… but then 99% of the time the script creates the containers…