Error: Failed start validation for device "enp3s0f0": Instance DNS name "net17-nicole-munoz-marketing" already used on network

davidfavor · November 9, 2022, 10:37pm

I’m getting a similar error for several containers.

Someone let me know how to unwedge these containers, without rebooting the machine.

Maybe a hard bounce (stop/restart) of LXD will do the trick… and… let me know the best way to do this, if this is an option.

Thanks.

net17 # lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal

net17 # snap list
Name     Version      Rev    Tracking         Publisher     Notes
certbot  1.32.0       2511   latest/stable    certbot-eff✓  classic
core18   20221027     2620   latest/stable    canonical✓    base
core20   20221027     1695   latest/stable    canonical✓    base
core22   20220902     310    latest/stable    canonical✓    base
googler  4.3.2        798    latest/stable    snapcrafters  -
lxd      5.7-c62733b  23889  latest/stable/…  canonical✓    -
nmap     7.93         2864   latest/stable    maxiberta     -
snapd    2.57.4       17336  latest/stable    canonical✓    snapd

davidfavor · November 10, 2022, 12:10pm

Now this seems to be occurring over many machines, so maybe this relates to a recent SNAP update.

If I move one of the downed containers to a new machine, change the IP and container name, attempting a restart produces the same error.

This suggests maybe each containers associated MAC address is involved.

davidfavor · November 11, 2022, 5:17pm

This appears to be a very bad bug in latest SNAP LXD package.

Reboot doesn’t clear the problem, as no containers can be brought online. Same message emits for the base interface, which is enp3s0f0 on this machine.

Since this same problem occurs across many machines, this suggests rebooting any machine running LXD will result in all containers failing to come back online.

Be great if someone can provide some clue about fixing this.

Thanks.

davidfavor · November 11, 2022, 7:36pm

SOLVED: It appears LXD 5.7-c62733b contains a very ugly bug related to dnsmasq caching.

The… ugly fix…

snap refresh --hold=forever --channel=5.6 lxd

Then either manually start containers or do a reboot to start all containers.

tomp · November 11, 2022, 9:03pm

Please can you show output of

lxc config show instance --expanded

I think this change is intentional but want to check your config first.

See also

davidfavor · November 12, 2022, 3:43pm

Possible LXD-5.7 dnsmasq management bug - containers die over time + container restart fail · Issue #11111 · lxc/lxd · GitHub - Bug ticket opened for this problem, with additional information.

davidfavor · November 12, 2022, 3:45pm

The link you provided seems unrelated to what I’m seeing, which appears to involve how dnsmasq caches container + interface relations.

Since this problem only appears after the SNAP update from 5.6 → 5.7 (reversion to 5.6 fixes problem), likely place to look is in code base changes between these versions.

tomp · November 12, 2022, 3:49pm

Please can you answer my question by providing the expanded container configs though.

The error is unrelated to dnsmasq I think.

davidfavor · November 13, 2022, 3:13pm

net13 # lxc start net13-template-focal 
Error: Failed start validation for device "eno1": Instance DNS name "net13-template-focal" already used on network
Try `lxc info --show-log net13-template-focal` for more info

net13 # lxc config show net13-template-focal --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20200402_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20200402_07:42"
  image.type: squashfs
  volatile.base_image: 47e9e45537ddf24cc2c5c13c00c3c4dbf36ec188b2598b560b714bc266f79834
  volatile.eno1.hwaddr: 00:16:3e:ce:b0:46
  volatile.eno1.name: eth1
  volatile.eth0.hwaddr: 00:16:3e:8b:c7:89
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.uuid: e0006a6a-83fa-4dbb-a2f1-0f5651d2818d
devices:
  eno1:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
  tools:
    path: /david-favor
    source: /david-favor
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

tomp · November 13, 2022, 3:17pm

So the issue is you have two NICs, eno1 and eth0, connected to the same parent bridge, lxdbr0.

This can cause dns name conflicts in dnsmasq and so lxd 5.7 added a start time check for this scenario.

What is the reason you have 2 NICs connected to the same bridge?

davidfavor · November 14, 2022, 2:45pm

It appears you’ve correctly expressed the exact nature of the bug…

net17 # ip link | head
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether d4:5d:64:3f:ff:24 brd ff:ff:ff:ff:ff:ff
3: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether d4:5d:64:3f:ff:25 brd ff:ff:ff:ff:ff:ff
4: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:7f:38:43 brd ff:ff:ff:ff:ff:ff
6: veth23aef49a@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lxdbr0 state UP mode DEFAULT group default qlen 1000
    link/ether da:19:20:44:85:0f brd ff:ff:ff:ff:ff:ff link-netnsid 0

So there are no eno* or eth* interfaces + never have been on this machine.

Only the above interfaces exist, so this appears to be 5.7 mistakenly “guessing” at what base interface names “should” be, rather than looking them up.

tomp · November 14, 2022, 2:50pm

No, in your instance config you have shown two NIC devices both connected to the same lxdbr0:

devices:
  eno1:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth0:
    name: eth0
    network: lxdbr0
    type: nic

The devices are named eno1 and eth0.

The instance name will be setup in dnsmasq’s DNS pointing to the NIC’s DHCP IP address.
However if you connect multiple NICs to the the same parent bridge (lxdbr0 in this case) then there is the possibility that both NICs will have DHCP run on them and this would result in multiple IPs for the same DNS name, causing unpredictability.

If you don’t know of the reason why you have two NICs on your container, then I suggest removing one (probably the eno1 one, as the eth0 is more conventional) and see if that solves the issue.

davidfavor · November 15, 2022, 3:39pm

Regardless of what LXD did to create this situation, I’m only interested in a solution.

The current solution is to revert to 5.6, which fixes all problems.

If you can provide exact commands to attempt fixing this, I currently have 100s of containers in this state where I can try your fix, for example… sounds like the fix to try is some sort of lxc config command.

Provide an command to try. I’ll try it, then update this thread with results.

tomp · November 15, 2022, 3:40pm

Indeed. That is what I am trying to get to. But first I need to understand why you have 2 NICs connected to the same bridge. Without understanding that I cannot suggest a way forward.

On the related thread Usecase for multiple interfaces in a single bridge we had a productive discussion around their use case and were able to come up with a solution.

We need to do the same thing here.

davidfavor · November 21, 2022, 6:25pm

No clue why. This is something LXD has done internally.

This machine has never had an “eth0” or “eno1”, so unsure how to proceed.

I still have some machines in this state, so if you can provide me with commands to kill off bad interfaces, let me know + I’ll run the command on one of the… still broken machines… then report back on what occurs…

tomp · November 22, 2022, 1:23pm

No this isn’t correct. LXD never adds a NIC called eno1 automatically.
But its possible this was added by yourself in the past and it was never actively used, nor did it cause problems until the LXD validation change.

The eth0 NIC is part of the default profile that LXD generates during initialization.

There’s no way for me to know whether your containers are configured to use eno1 or eth0 for their connectivity, but if I were a betting man I would say that as eth0 is the default NIC, its more likely that the manually added (and apparently forgotten about) eno1 NIC would be a good candidate for removal.

So to remove this from the container use:

lxc config device remove <instance> eno1

If this fails saying the device doesn’t exist, then its likely its part of the profile.
You can check this by doing lxc config show <instance> and if it doesn’t show without the --expanded flag then you can see its coming from the profile.

To remove it from the profile so you can use:

lxc profile device remove <profile> eno1

Keep in mind this will remove it from all instances using that profile.

davidfavor · November 24, 2022, 4:43pm

Fails…

net14 # lxc config device remove net14-ian-farr-infoanywhere-spare-v2 eno1
Error: Device from profile(s) cannot be removed from individual instance. Override device or modify profile instead

Provide the command to determine what profile is being used for a device or container, unsure as I’ve never worked with profiles.

This also fails…

net14 # lxc profile eno1 show
Error: unknown command "eno1" for "lxc profile"

tomp · November 24, 2022, 4:44pm

I did already.

To remove it from the profile so you can use:

lxc profile device remove <profile> eno1

Keep in mind this will remove it from all instances using that profile.

davidfavor · November 24, 2022, 4:45pm

What I’m looking for is the value for , for example, to show all profiles.

tomp · November 24, 2022, 4:45pm

To see the profiles you’re using for your instance do lxc config show <instance> and see the profiles section.