This appears to be a very bad bug in latest SNAP LXD package.
Reboot doesn’t clear the problem, as no containers can be brought online. Same message emits for the base interface, which is enp3s0f0 on this machine.
Since this same problem occurs across many machines, this suggests rebooting any machine running LXD will result in all containers failing to come back online.
Be great if someone can provide some clue about fixing this.
The link you provided seems unrelated to what I’m seeing, which appears to involve how dnsmasq caches container + interface relations.
Since this problem only appears after the SNAP update from 5.6 → 5.7 (reversion to 5.6 fixes problem), likely place to look is in code base changes between these versions.
It appears you’ve correctly expressed the exact nature of the bug…
net17 # ip link | head
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether d4:5d:64:3f:ff:24 brd ff:ff:ff:ff:ff:ff
3: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether d4:5d:64:3f:ff:25 brd ff:ff:ff:ff:ff:ff
4: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:16:3e:7f:38:43 brd ff:ff:ff:ff:ff:ff
6: veth23aef49a@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lxdbr0 state UP mode DEFAULT group default qlen 1000
link/ether da:19:20:44:85:0f brd ff:ff:ff:ff:ff:ff link-netnsid 0
So there are no eno* or eth* interfaces + never have been on this machine.
Only the above interfaces exist, so this appears to be 5.7 mistakenly “guessing” at what base interface names “should” be, rather than looking them up.
No, in your instance config you have shown two NIC devices both connected to the same lxdbr0:
devices:
eno1:
nictype: bridged
parent: lxdbr0
type: nic
eth0:
name: eth0
network: lxdbr0
type: nic
The devices are named eno1 and eth0.
The instance name will be setup in dnsmasq’s DNS pointing to the NIC’s DHCP IP address.
However if you connect multiple NICs to the the same parent bridge (lxdbr0 in this case) then there is the possibility that both NICs will have DHCP run on them and this would result in multiple IPs for the same DNS name, causing unpredictability.
If you don’t know of the reason why you have two NICs on your container, then I suggest removing one (probably the eno1 one, as the eth0 is more conventional) and see if that solves the issue.
Regardless of what LXD did to create this situation, I’m only interested in a solution.
The current solution is to revert to 5.6, which fixes all problems.
If you can provide exact commands to attempt fixing this, I currently have 100s of containers in this state where I can try your fix, for example… sounds like the fix to try is some sort of lxc config command.
Provide an command to try. I’ll try it, then update this thread with results.
Indeed. That is what I am trying to get to. But first I need to understand why you have 2 NICs connected to the same bridge. Without understanding that I cannot suggest a way forward.
No clue why. This is something LXD has done internally.
This machine has never had an “eth0” or “eno1”, so unsure how to proceed.
I still have some machines in this state, so if you can provide me with commands to kill off bad interfaces, let me know + I’ll run the command on one of the… still broken machines… then report back on what occurs…
No this isn’t correct. LXD never adds a NIC called eno1 automatically.
But its possible this was added by yourself in the past and it was never actively used, nor did it cause problems until the LXD validation change.
The eth0 NIC is part of the default profile that LXD generates during initialization.
There’s no way for me to know whether your containers are configured to use eno1 or eth0 for their connectivity, but if I were a betting man I would say that as eth0 is the default NIC, its more likely that the manually added (and apparently forgotten about) eno1 NIC would be a good candidate for removal.
So to remove this from the container use:
lxc config device remove <instance> eno1
If this fails saying the device doesn’t exist, then its likely its part of the profile.
You can check this by doing lxc config show <instance> and if it doesn’t show without the --expanded flag then you can see its coming from the profile.
To remove it from the profile so you can use:
lxc profile device remove <profile> eno1
Keep in mind this will remove it from all instances using that profile.