LXD 3.23 - Cluster setup, lxc exec has different behavior for containers and VMs

Vijay_Karamcheti · March 31, 2020, 5:09pm

dig output below …

ubuntu@ctrlr:~$ dig @240.204.0.1 k8s-master1.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 k8s-master1.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57678
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;k8s-master1.lxd.               IN      A

;; ANSWER SECTION:
k8s-master1.lxd.        0       IN      A       240.204.0.82

;; Query time: 0 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:07:04 UTC 2020
;; MSG SIZE  rcvd: 60

ubuntu@ctrlr:~$ dig @240.204.0.1 k8s-worker1.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 k8s-worker1.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41044
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;k8s-worker1.lxd.               IN      A

;; Query time: 5 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:07:37 UTC 2020
;; MSG SIZE  rcvd: 33

ubuntu@ctrlr:~$ dig @240.204.0.1 k8s-lb.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 k8s-lb.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51110
;; flags: qr aa ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;k8s-lb.lxd.                    IN      A

;; ANSWER SECTION:
k8s-lb.lxd.             0       IN      A       240.205.0.177

;; Query time: 5 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:07:51 UTC 2020
;; MSG SIZE  rcvd: 54

ubuntu@ctrlr:~$ dig @240.204.0.1 k8s-worker4.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 k8s-worker4.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 39305
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;k8s-worker4.lxd.               IN      A

;; Query time: 5 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:08:06 UTC 2020
;; MSG SIZE  rcvd: 33

Vijay_Karamcheti · March 31, 2020, 5:13pm

forkdns.servers/servers.conf output below:

akriadmin@c4akri01:~/scripts$ ./run-physical-hosts.sh "sudo cat /var/snap/lxd/common/lxd/networks/lxdfan0/forkdns.servers/servers.conf"
[c4akri01]:
240.221.0.1
240.222.0.1
240.223.0.1
240.197.0.1
240.215.0.1
240.205.0.1
[c4akri02]:
240.215.0.1
240.205.0.1
240.221.0.1
240.222.0.1
240.223.0.1
240.204.0.1
[c4akri03]:
240.221.0.1
240.222.0.1
240.223.0.1
240.204.0.1
240.197.0.1
240.205.0.1
[c4akri04]:
240.204.0.1
240.197.0.1
240.215.0.1
240.221.0.1
240.222.0.1
240.223.0.1
[c4astore01]:
240.205.0.1
240.222.0.1
240.223.0.1
240.204.0.1
240.197.0.1
240.215.0.1
[c4astore02]:
240.215.0.1
240.205.0.1
240.221.0.1
240.223.0.1
240.204.0.1
240.197.0.1
[c4astore03]:
240.215.0.1
240.205.0.1
240.221.0.1
240.222.0.1
240.204.0.1
240.197.0.1
akriadmin@c4akri01:~/scripts$

tomp · March 31, 2020, 5:29pm

OK so we can say that the reason your query for k8s-worker1.lxd is not working is because it is not in any of the dnsmasq leases files.

Can you show me the output of lxc config show k8s-worker1 --expanded please?

tomp · March 31, 2020, 5:31pm

Can you also send me the output of lxc list so I can get a better picture of your instance list.

Also can also confirm that k8s-worker1 instance is using DHCP to configure its networking?

Vijay_Karamcheti · March 31, 2020, 5:32pm

k8s-worker1 --expanded output:

akriadmin@c4akri01:~/scripts$ lxc config show k8s-worker1 --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu bionic amd64 (20200327_07:42)
  image.os: Ubuntu
  image.release: bionic
  image.serial: "20200327_07:42"
  image.type: disk-kvm.img
  limits.cpu: "24"
  limits.memory: 73728MB
  raw.qemu: -device vfio-pci,host=41:00.0
  volatile.base_image: ed698b985cce87c0527b061d81662009472d5df210cbe297c1d90ed607415c18
  volatile.eth0.host_name: tapdef2ce9c
  volatile.eth0.hwaddr: 00:16:3e:e4:d9:96
  volatile.vm.uuid: c55da613-56d4-4493-822e-68af3d1f119a
devices:
  eth0:
    name: eth0
    network: lxdfan0
    type: nic
  root:
    path: /
    pool: local
    size: 50GB
    type: disk
  sda:
    source: /dev/sda
    type: disk
  sdb:
    source: /dev/sdb
    type: disk
  sdc:
    source: /dev/sdc
    type: disk
  sdd:
    source: /dev/sdd
    type: disk
  sde:
    source: /dev/sde
    type: disk
  sdf:
    source: /dev/sdf
    type: disk
ephemeral: false
profiles:
- default
- compute-vm
stateful: false
description: ""
akriadmin@c4akri01:~/scripts$

tomp · March 31, 2020, 5:33pm

Can you show me ip a output inside k8s-worker1 too please.

Vijay_Karamcheti · March 31, 2020, 5:34pm

lxc list output below …
Yes, all instances are picking up IP addresses using DHCP. As I wrote earlier, the VMs are a straight-up launch of images:ubuntu/18.04 and images:ubuntu/19.10.

akriadmin@c4akri01:~/scripts$ lxc list
+----------------+---------+------------------------+------+-----------------+-----------+------------+
|      NAME      |  STATE  |          IPV4          | IPV6 |      TYPE       | SNAPSHOTS |  LOCATION  |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| ctrlr          | RUNNING | 240.204.0.186 (eth0)   |      | CONTAINER       | 0         | c4akri01   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| hdfs-datanode1 | RUNNING | 240.221.0.52 (enp5s0)  |      | VIRTUAL-MACHINE | 0         | c4astore01 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| hdfs-datanode2 | RUNNING | 240.222.0.137 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4astore02 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| hdfs-datanode3 | RUNNING | 240.223.0.249 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4astore03 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| hdfs-namenode  | RUNNING | 240.221.0.237 (eth0)   |      | CONTAINER       | 0         | c4astore01 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-lb         | RUNNING | 240.205.0.177 (eth0)   |      | CONTAINER       | 0         | c4akri04   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-master1    | RUNNING | 240.204.0.82 (eth0)    |      | CONTAINER       | 0         | c4akri01   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-master2    | RUNNING | 240.197.0.18 (eth0)    |      | CONTAINER       | 0         | c4akri02   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-master3    | RUNNING | 240.215.0.110 (eth0)   |      | CONTAINER       | 0         | c4akri03   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-worker1    | RUNNING | 240.204.0.142 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri01   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-worker2    | RUNNING | 240.197.0.215 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri02   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-worker3    | RUNNING | 240.215.0.200 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri03   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| k8s-worker4    | RUNNING | 240.205.0.196 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri04   |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| minio1         | RUNNING | 240.221.0.117 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4astore01 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| minio2         | RUNNING | 240.222.0.12 (enp5s0)  |      | VIRTUAL-MACHINE | 0         | c4astore02 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
| minio3         | RUNNING | 240.223.0.187 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4astore03 |
+----------------+---------+------------------------+------+-----------------+-----------+------------+
akriadmin@c4akri01:~/scripts$

Vijay_Karamcheti · March 31, 2020, 5:36pm

ip a output on k8s-worker1

ubuntu@k8s-worker1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:e4:d9:96 brd ff:ff:ff:ff:ff:ff
    inet 240.204.0.142/8 brd 240.255.255.255 scope global dynamic enp5s0
       valid_lft 2796sec preferred_lft 2796sec
    inet6 fe80::216:3eff:fee4:d996/64 scope link 
       valid_lft forever preferred_lft forever
ubuntu@k8s-worker1:~$

tomp · March 31, 2020, 5:38pm

Thanks and is this happening with VMs using both ubuntu 18.04 and 19.10 images?

This is very weird, as if the VMs are getting a DHCP allocation from dnsmasq, then what is causing them to be removed from the leases file.

tomp · March 31, 2020, 5:40pm

Out of interest, what are you using raw.qemu: -device vfio-pci,host=41:00.0 for?

Vijay_Karamcheti · March 31, 2020, 5:42pm

I am doing pass-through of a GPU, and also 6 block devices into the VM instance … Stephane and ‘morphis’ guided me towards using this mechanism.

Vijay_Karamcheti · March 31, 2020, 5:45pm

Yes, same issue for both 18.04 and 19.10 images … (hdfs-datanode1 is 18.04, minio1 is 19.10):

ubuntu@ctrlr:~$ dig @240.204.0.1 hdfs-datanode1.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 hdfs-datanode1.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51460
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;hdfs-datanode1.lxd.            IN      A

;; Query time: 5 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:40:21 UTC 2020
;; MSG SIZE  rcvd: 36

ubuntu@ctrlr:~$ dig @240.204.0.1 minio1.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 minio1.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 39275
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;minio1.lxd.                    IN      A

;; Query time: 5 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:40:41 UTC 2020
;; MSG SIZE  rcvd: 28

ubuntu@ctrlr:~$ ssh 240.221.0.52 uname -a
Linux hdfs-datanode1 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@ctrlr:~$ ssh 240.221.0.117 uname -a
Linux minio1 5.3.0-42-generic #34-Ubuntu SMP Fri Feb 28 05:49:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@ctrlr:~$

tomp · March 31, 2020, 5:49pm

OK, can you place your LXD daemon into debug mode on the hosts running the problem VM(s) (you can just pick one of them, as long as you test the next part on a VM running on the same host as you changed to debug mode).

sudo snap set lxd daemon.debug=true
sudo systemctl reload snap.lxd.daemon

Then run on the host sudo journalctl -f and keep a lookout for dnsmasq related log entries.

Inside the VM then run:

netplan apply

This should make a fresh DHCP request.

Finally from the same LXD host, can you then send the contents of: /var/snap/lxd/common/lxd/networks/lxdfan0/dnsmasq.leases

We need to try and figure out why your setup differs from my local one and the dnsmasq.leases file is not being populated (or being cleared) of VM DHCP allocations.

Vijay_Karamcheti · March 31, 2020, 5:55pm

On a lark, I tried the following:

Created an images:ubuntu/16.04 VM on a remote host (c4akri02)
Tried seeing whether I could resolve that VM’s name … turns out this works!

I had previously reported an issue with images:ubuntu/18.04, where the same IP address was getting allocated to multiple VM instances launched on the same host. Perhaps these issues are related? (and also impact ubuntu/19.10)

akriadmin@c4akri01:~/scripts$ lxc list | grep test-vm
| test-vm        | RUNNING | 240.197.0.177 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri02   |

ubuntu@ctrlr:~$ dig @240.204.0.1 test-vm.lxd

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @240.204.0.1 test-vm.lxd
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55620
;; flags: qr aa ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;test-vm.lxd.                   IN      A

;; ANSWER SECTION:
test-vm.lxd.            0       IN      A       240.197.0.177

;; Query time: 4 msec
;; SERVER: 240.204.0.1#53(240.204.0.1)
;; WHEN: Tue Mar 31 17:50:34 UTC 2020
;; MSG SIZE  rcvd: 56

ubuntu@ctrlr:~$

Vijay_Karamcheti · March 31, 2020, 5:56pm

Link to the other topic:

tomp · March 31, 2020, 6:01pm

Interesting, I feel this is likely the issue.

I’d like to compare the MAC addresses inside the VMs to those allocated by LXD.

Can you send me the output of lxc config get <vm name> volatile.eth0.hwaddr for each VM, and then also for each VM the output of ip link show enp5s0

Finally can you also get me the netplan config file for each VM, it should be in /etc/netplan/50-cloud-init.yaml

Thanks
Tom

tomp · March 31, 2020, 6:12pm

Ah I missed this post, so yes it looks like the VM images had an issue. If possible it may be best to regenerate your VMS from fresh images. Although @stgraber may be able to advise how to fix existing VMS.

stgraber · March 31, 2020, 6:15pm

To reset things, you can do:

rm /var/lib/dbus/machine-id
> /etc/machine-id

Vijay_Karamcheti · March 31, 2020, 6:18pm

Okay, running these commands on the host that has both an 18.04 VM (k8s-worker2) and a 16.04 VM (test-vm) …

akriadmin@c4akri02:~$ lxc list | grep c4akri02
| k8s-master2    | RUNNING | 240.197.0.18 (eth0)    |      | CONTAINER       | 0         | c4akri02   |
| k8s-worker2    | RUNNING | 240.197.0.215 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri02   |
| test-vm        | RUNNING | 240.197.0.177 (enp5s0) |      | VIRTUAL-MACHINE | 0         | c4akri02   |

akriadmin@c4akri02:~$ lxc config get k8s-worker2 volatile.eth0.hwaddr
00:16:3e:58:5f:c9
akriadmin@c4akri02:~$ lxc exec k8s-worker2 -- ip link show enp5s0
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:58:5f:c9 brd ff:ff:ff:ff:ff:ff
akriadmin@c4akri02:~$ lxc exec k8s-worker2 -- cat /etc/netplan/50-cloud-init.yaml
cat: /etc/netplan/50-cloud-init.yaml: No such file or directory
akriadmin@c4akri02:~$ lxc exec k8s-worker2 -- ls /etc/netplan
10-lxc.yaml
akriadmin@c4akri02:~$ lxc exec k8s-worker2 -- cat /etc/netplan/10-lxc.yaml
network:
  version: 2
  ethernets:
    enp5s0: {dhcp4: true}
akriadmin@c4akri02:~$ 

akriadmin@c4akri02:~$ lxc config get test-vm volatile.eth0.hwaddr
00:16:3e:30:6a:31
akriadmin@c4akri02:~$ lxc exec test-vm -- ip link show enp5s0
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:30:6a:31 brd ff:ff:ff:ff:ff:ff
akriadmin@c4akri02:~$ lxc exec test-vm -- ls /etc/netplan
ls: cannot access '/etc/netplan': No such file or directory
akriadmin@c4akri02:~$ lxc exec test-vm -- cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

auto enp5s0
iface enp5s0 inet dhcp

source /etc/network/interfaces.d/*.cfg
akriadmin@c4akri02:~$

stgraber · March 31, 2020, 6:19pm

The commands must be run inside of affected VMs.