BUG: lxc copy virtual-machines get the same ipv4 address from dnsmasq

andrianaivo · October 15, 2022, 9:09am

Using the currently latest lxd v5.6-794016a rev 23680, I have the situation that virtual machines created with the lxc copy command are getting the same ipv4 address as the source vm.

In fact, dnsmasq seems to just provide the same ipv4-address to the last-recently-started vm, overriding the existing lease with its correct MAC address.

Am I missing some steps in the cloning workflow or is that a current misbehavior in lxd/dnsmasq?
Can I do something to fix/workaround this, so I can keep the cloning operations automated?

Quick facts:

source & destination vms have different MAC addresses
it doesn’t matter if source is a vm instance or a vm snapshot
it doesn’t matter if source instance is stopped or running at copy time

Detailed tests

initial state:

lxc list sqlnode2
+----------+---------+--------------------------+------+-----------------+-----------+
|   NAME   |  STATE  |           IPV4           | IPV6 |      TYPE       | SNAPSHOTS |
+----------+---------+--------------------------+------+-----------------+-----------+
| sqlnode2 | RUNNING | 192.168.111.106 (enp5s0) |      | VIRTUAL-MACHINE | 2         |
+----------+---------+--------------------------+------+-----------------+-----------+

grep sqlnode2 /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases
1665826039 00:16:3e:31:ec:db 192.168.111.106 sqlnode2 ff:49:72:1f:47:00:02:00:00:ab:11:d1:ae:8f:7f:50:7e:b1:88

cat /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/sqlnode2*
00:16:3e:31:ec:db,sqlnode2

lxc exec sqlnode2 -- ip address show dev enp5s0
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:16:3e:31:ec:db brd ff:ff:ff:ff:ff:ff
    inet 192.168.111.106/24 brd 192.168.111.255 scope global dynamic enp5s0
       valid_lft 3471sec preferred_lft 3471sec
    inet6 fe80::216:3eff:fe31:ecdb/64 scope link
       valid_lft forever preferred_lft forever

After lxc copy --instance-only sqlnode2 sqlnode21 :

lxc config get sqlnode2 volatile.eth0.hwaddr
00:16:3e:31:ec:db

lxc config get sqlnode21 volatile.eth0.hwaddr
00:16:3e:f7:39:36

cat /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/sqlnode2*
00:16:3e:f7:39:36,sqlnode21
00:16:3e:31:ec:db,sqlnode2

After starting the newly created sqlnode21:

lxc list sqlnode2
+-----------+---------+--------------------------+------+-----------------+-----------+
|   NAME    |  STATE  |           IPV4           | IPV6 |      TYPE       | SNAPSHOTS |
+-----------+---------+--------------------------+------+-----------------+-----------+
| sqlnode2  | RUNNING | 192.168.111.106 (enp5s0) |      | VIRTUAL-MACHINE | 2         |
+-----------+---------+--------------------------+------+-----------------+-----------+
| sqlnode21 | RUNNING | 192.168.111.106 (enp5s0) |      | VIRTUAL-MACHINE | 0         |
+-----------+---------+--------------------------+------+-----------------+-----------+

lxc exec sqlnode21 -- ip address show dev enp5s0
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:16:3e:f7:39:36 brd ff:ff:ff:ff:ff:ff
    inet 192.168.111.106/24 brd 192.168.111.255 scope global dynamic enp5s0
       valid_lft 3554sec preferred_lft 3554sec
    inet6 fe80::216:3eff:fef7:3936/64 scope link
       valid_lft forever preferred_lft forever

grep sqlnode2 /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases
1665827202 00:16:3e:f7:39:36 192.168.111.106 sqlnode21 ff:49:72:1f:47:00:02:00:00:ab:11:d1:ae:8f:7f:50:7e:b1:88

In the last output, we see that dnsmasq just provided the ipv4 address 192.168.111.106 to the new vm

andrianaivo · October 15, 2022, 9:55am

Ok. the root cause seems to be that the dhcp-client-identifier (last column in dnsmasq.leases file) is still the same for both source and copied vm.

As I’m using a Debian os inside of the VM, it seems to come from the /etc/machine-id file, which indeed has the sam content on the both vms.

So I’m currently trying to find a way to fixed that. I’ve tried following approaches so far, but without success yet:

andrianaivo · October 15, 2022, 10:12am

My solution/workaround so far is to use the current workflow:

clone vm

lxc copy --instance-only sqlnode2 sqlnode21

regenerate machine-id on new vm

lxc exec sqlnode21 -- rm -v /var/lib/dbus/machine-id /etc/machine-id
lxc exec sqlnode21 -- dbus-uuidgen --ensure
lxc exec sqlnode21 -- systemd-machine-id-setup
lxc exec sqlnode21 -- grep "[a-z]" /var/lib/dbus/machine-id /etc/machine-id

start source vm first

so it can get back its original ipv4 address:

lxc start sqlnode2

wait for leases file to be updated:

while true; do grep " sqlnode2 " /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases && break;sleep 5;done

restart new vm

it will now get its own ipv4 address:

lxc restart sqlnode21

tomp · October 15, 2022, 3:38pm

Can you show lxc config show <instance> --expanded for source and copy please.

andrianaivo · October 15, 2022, 4:55pm

here it is for the source:

architecture: x86_64
config:
  environment.TZ: Europe/Vienna
  image.architecture: amd64
  image.description: Debian bullseye amd64 (20220715_06:06)
  image.os: Debian
  image.release: bullseye
  image.serial: "20220715_06:06"
  image.type: disk-kvm.img
  image.variant: default
  limits.cpu: "4"
  limits.memory: 8GB
  user.user-data: |
    ssh_pwauth: yes
    users:
      - name: automation
        passwd: 'xxx'  # Use a pw in /etc/shadow format!
        lock_passwd: false
        groups: lxd
        shell: /bin/bash
        sudo: ALL=(ALL) NOPASSWD:ALL
    # autogrow root partition
    growpart:
      mode: auto
      devices: ['/']
      ignore_growroot_disabled: false
  volatile.base_image: b6c7a8f75b2cabe42c8bbf13ebc2c364005327e1a18c6dc61fe08a2e43a4fdbf
  volatile.cloud-init.instance-id: 2d90ef76-5240-41e4-99e9-abc2b4dcc840
  volatile.eth0.host_name: tap91b7fa4b
  volatile.eth0.hwaddr: 00:16:3e:31:ec:db
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 630aa3f5-1f39-4be9-b88d-f19d1494aa90
  volatile.vsock_id: "49"
devices:
  config:
    source: cloud-init:config
    type: disk
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    size: 100GB
    type: disk
ephemeral: false
profiles:
- vms
stateful: false
description: ""

And here for the cloned vm:

architecture: x86_64
config:
  environment.TZ: Europe/Vienna
  image.architecture: amd64
  image.description: Debian bullseye amd64 (20220715_06:06)
  image.os: Debian
  image.release: bullseye
  image.serial: "20220715_06:06"
  image.type: disk-kvm.img
  image.variant: default
  limits.cpu: "4"
  limits.memory: 8GB
  user.user-data: |
    ssh_pwauth: yes
    users:
      - name: automation
        passwd: 'xxx'  # Use a pw in /etc/shadow format!
        lock_passwd: false
        groups: lxd
        shell: /bin/bash
        sudo: ALL=(ALL) NOPASSWD:ALL
    # autogrow root partition
    growpart:
      mode: auto
      devices: ['/']
      ignore_growroot_disabled: false
  volatile.base_image: b6c7a8f75b2cabe42c8bbf13ebc2c364005327e1a18c6dc61fe08a2e43a4fdbf
  volatile.cloud-init.instance-id: d6a536f4-f9ff-472c-b69f-364b250e48a5
  volatile.eth0.host_name: tap8ec6a839
  volatile.eth0.hwaddr: 00:16:3e:3b:48:1e
  volatile.last_state.power: RUNNING
  volatile.uuid: 1448c33e-1f3f-4434-a3fe-11ea2d776fd1
  volatile.vsock_id: "60"
devices:
  config:
    source: cloud-init:config
    type: disk
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    size: 100GB
    type: disk
ephemeral: false
profiles:
- vms
stateful: false
description: ""

tomp · October 17, 2022, 11:37am

I’ve just tested this using images:ubuntu/jammy (which doesn’t have cloud-init installed and doesn’t exhibit the problem) and images:ubuntu/jammy/cloud (which has cloud-init installed and does exhibit the problem).

So this seems to be an issue with cloud-init not regenerating the machine-id.

You can see from the configs you pasted above that LXD is doing the right thing as both VMs have different:

volatile.uuid - which is provided to QEMU to set machine ID.
volatile.cloud-init.instance-id - which should be used by cloud-init to trigger regenerating configs if it changes.
volatile.eth0.hwaddr - MAC address, which if the DHCP client used this as its identifier would exhibit the issue.

This appears to be the same issue as:

github.com/lxc/lxd

Copying VMs doesn't reset /etc/machine-id, causing duplicate IPv4 IPs.

opened 05:50PM - 27 Apr 21 UTC

closed 09:45PM - 05 May 21 UTC

KellenRenshaw

# Required information * Distribution: Ubuntu * Distribution version: 20.04 * The output of "lxc info": [host_4.0.5_lxc-info.txt](https://github.com/lxc/lxd/files/6386486/host_4.0.5_lxc-info.txt) [host_4.13_lxc-info.txt](https://github.com/lxc/lxd/files/6386488/host_4.13_lxc-info.txt) # Issue description Copying a Focal VM (containers appear to be unaffected) results in the duplicated VM getting the same IPv4 address via DHCP. This happens both with LXD's in-built dnsmasq service and a third-party DHCP solution. The MAC addresses on the copy are different from the original. Tracked down to the machine-id in /etc/machine-id being the same in the copies/clones. Resetting the machine-id of the copy with `echo -n > /etc/machine-id` and rebooting results in the expected DHCP behavior of receiving a different IPv4 address relative to the parent VM. Example with "testapp" being the parent, "clone" showing the issue, and "testapp-clone" showing the result of the workaround. ``` +---------------+---------+-----------------------+------------------------------------------------+-----------------+-----------+ | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | +---------------+---------+-----------------------+------------------------------------------------+-----------------+-----------+ | clone | RUNNING | 10.14.70.198 (enp5s0) | fd42:e7bf:4ff:10bb:216:3eff:fe3a:b09e (enp5s0) | VIRTUAL-MACHINE | 0 | +---------------+---------+-----------------------+------------------------------------------------+-----------------+-----------+ | testapp | RUNNING | 10.14.70.198 (enp5s0) | fd42:e7bf:4ff:10bb:216:3eff:fe84:a49b (enp5s0) | VIRTUAL-MACHINE | 0 | +---------------+---------+-----------------------+------------------------------------------------+-----------------+-----------+ | testapp-clone | RUNNING | 10.14.70.45 (enp5s0) | fd42:e7bf:4ff:10bb:216:3eff:feeb:1e1 (enp5s0) | VIRTUAL-MACHINE | 0 | +---------------+---------+-----------------------+------------------------------------------------+-----------------+-----------+ Notes: testapp-clone has had the workaround of "echo -n > /etc/machine-id ; reboot" applied to it. ``` # Steps to reproduce 1. Obtain fresh Ubuntu Focal host, install LXD, tested with 4.0.5 and 4.13. 2. Run `lxd init`, taking the defaults for all choices. 3. Launch Focal VM with `lxc launch ubuntu:focal --vm test1`. 4. Shutdown the test1 VM and copy it to "test2" with `lxc stop test1; lxc copy test1 test2` 5. Start both VMs. 6. They will end up with identical IPv4 addresses. # Information to attach - [ ] Any relevant kernel output (`dmesg`) - [ ] Container log (`lxc info NAME --show-log`) - [ ] Container configuration (`lxc config show NAME --expanded`) - [X ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log) for 4.0.5 [lxd_4.0.5_log.txt](https://github.com/lxc/lxd/files/6386554/lxd_4.0.5_log.txt) - [ ] Output of the client with --debug - [ ] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue)

Which then links to:

Perhaps you could post on there with your use case.

andrianaivo · October 17, 2022, 12:29pm

Thx very much @tomp for your investigations!
I’ll try to switch to the non-cloud image for now, but I’ll add my use-case to the cloud-init issue as well.

Yosu_Cadilla · March 23, 2023, 3:29pm

Just today, with latest LXD (I think it installed from stable = 5.12)

lxc launch ubuntu:22.04 c01n1 --vm
lxc stop c01n1

lxc cp c01n1 c01n2
lxc cp c01n1 c01n3
lxc cp c01n1 c01n4
lxc cp c01n1 c01n5

lxc start c01n1
lxc start c01n2
lxc start c01n3
lxc start c01n4
lxc start c01n5apt update && apt upgrade -y && apt autoremove && reboot

lxc list
±------±--------±--------------------±-----±----------------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±------±--------±--------------------±-----±----------------±----------+
| c01n1 | RUNNING | 10.2.2.248 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n2 | RUNNING | 10.2.2.248 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n3 | RUNNING | 10.2.2.248 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n4 | RUNNING | 10.2.2.248 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n5 | RUNNING | 10.2.2.248 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
(All VM’s got the same IP)

Deleted all VMs and ran Experiment 2:

On c01n1:
rm /etc/machine-id
exit

Back on host I did:
lxc cp c01n1 c01n2
lxc start c01n2
lxc list
±------±--------±-----±-----±----------------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±------±--------±-----±-----±----------------±----------+
| c01n1 | STOPPED | | | VIRTUAL-MACHINE | 0 |
±------±--------±-----±-----±----------------±----------+
| c01n2 | RUNNING | | | VIRTUAL-MACHINE | 0 |
±------±--------±-----±-----±----------------±----------+

New VM never got an IP (Waited for some minutes and listed again, same result)

Deleted all existing VMs

Experiment 3:
lxc launch ubuntu:22.04 c01n1 --vm
lxc launch ubuntu:22.04 c01n2 --vm
lxc launch ubuntu:22.04 c01n3 --vm
lxc launch ubuntu:22.04 c01n4 --vm
lxc launch ubuntu:22.04 c01n5 --vm
lxc list
±------±--------±--------------------±-----±----------------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±------±--------±--------------------±-----±----------------±----------+
| c01n1 | RUNNING | 10.2.2.165 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n2 | RUNNING | 10.2.2.150 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n3 | RUNNING | 10.2.2.77 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c01n4 | RUNNING | 10.2.2.98 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+

Yosu_Cadilla · March 23, 2023, 3:31pm

@tomp Should I post an issue on github?
Or am I not understanding something obvious?

tomp · March 23, 2023, 3:38pm

Please dont post an issue just yet.

Please show lxc config show <instance> --expanded for the source and the copied instance.

Yosu_Cadilla · March 23, 2023, 3:44pm

lxc launch ubuntu:20.04 c02n1 --vm

lxc stop c02n1
lxc cp c02n1 c02n2

lxc list
±------±--------±--------------------±-----±----------------±----------+
| c02n1 | RUNNING | 10.2.2.117 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+
| c02n2 | RUNNING | 10.2.2.117 (enp5s0) | | VIRTUAL-MACHINE | 0 |
±------±--------±--------------------±-----±----------------±----------+

lxc config show c02n1 --expanded
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20230209)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: “20230209”
image.type: disk-kvm.img
image.version: “20.04”
volatile.base_image: 0bff5ed9f18b9d9ff90c9f49a7071c8d87384964b1b772a86329c4a94a225837
volatile.cloud-init.instance-id: a736971f-bd19-4e4f-aeaa-30109f8a88c4
volatile.eth0.host_name: tapa45f6675
volatile.eth0.hwaddr: 00:16:3e:58:47:3c
volatile.last_state.power: RUNNING
volatile.last_state.ready: “false”
volatile.uuid: 24e72de7-d72b-4600-a56f-c6d8b522cb2c
volatile.uuid.generation: 24e72de7-d72b-4600-a56f-c6d8b522cb2c
volatile.vsock_id: “16”
devices:
eth0:
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: lxdpool
type: disk
ephemeral: false
profiles:

default
stateful: false
description: “”

lxc config show c02n2 --expanded
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20230209)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: “20230209”
image.type: disk-kvm.img
image.version: “20.04”
volatile.base_image: 0bff5ed9f18b9d9ff90c9f49a7071c8d87384964b1b772a86329c4a94a225837
volatile.cloud-init.instance-id: b8d6e7d3-3b05-46af-beba-f4a99478bb74
volatile.eth0.host_name: tape8a597c3
volatile.eth0.hwaddr: 00:16:3e:3a:bc:b9
volatile.last_state.power: RUNNING
volatile.last_state.ready: “false”
volatile.uuid: 9ce83df3-77a2-4eef-a77f-9a842c4c9a90
volatile.uuid.generation: 9ce83df3-77a2-4eef-a77f-9a842c4c9a90
volatile.vsock_id: “17”
devices:
eth0:
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: lxdpool
type: disk
ephemeral: false
profiles:

default
stateful: false
description: “”

Yosu_Cadilla · March 23, 2023, 3:57pm

@tomp

tomp · April 27, 2023, 10:17am

This is caused because the official Ubuntu cloud images don’t have:

dhcp-identifier: mac

in the netplan configuration inside the image.

https://netplan.readthedocs.io/en/stable/examples/#integration-with-a-windows-dhcp-server

This means that the DHCP client uses /etc/machine-id for its identifier (which is not changing between copies inside the guest).

In comparison the images:ubuntu/22.04 image from the LXD project uses dhcp-identifier: mac and does not experience this issue.

Now, LXD does generate and pass a different UUID to QEMU as the VM identifier, and this appears to be being used inside the guest initially for /etc/machine-id, but during a copy, despite the UUID being passed to the VM changing, it doesn’t appear to trigger a refresh of the /etc/machine-id file.

@Chad_Smith do you know if cloud-init should refresh /etc/machine-id if the QEMU UUID changes?