Container do not get IP addresses after a reboot — or internet connection

jamesponddotco · December 10, 2019, 6:31pm

Hi , everyone,

I have been using LXD on an Ubuntu 18.04 host to learn for a while now, and it seems I stumbled upon the first issue that I cannot resolve at all.

This dedicated server from Hetzner has been running a single container for a few months now, without any issues. Today I had to shut the server down to install two more drives, and when I booted the server back up, network on the contain was nowhere to be seen.

I went back to the host server, issue a lxc list command and saw that there were no IP addresses assigned to the container.

root@tardis-01.madpony.host:/home/doctor# lxc list
+----------+---------+------+------+------------+-----------+
|   NAME   |  STATE  | IPV4 | IPV6 |    TYPE    | SNAPSHOTS |
+----------+---------+------+------+------------+-----------+
| smith-01 | RUNNING |      |      | PERSISTENT | 0         |
+----------+---------+------+------+------------+-----------+

When I deployed this server I was running 4.15.0-72-generic, but the restart got us into 5.0.0-37-generic, so I though it could be a kernel issue, as the issue looked similar to this one reported yesterday — no luck though, still no IP address or any internet connection from inside the container.

I tried setting the old IP address to the container with lxc config device set, but no luck either.

root@tardis-01.madpony.host:~# lxc config device set smith-01 lxdbr0 ipv4.address 10.171.234.205
root@tardis-01.madpony.host:~# lxc list
+----------+---------+------+------+------------+-----------+
|   NAME   |  STATE  | IPV4 | IPV6 |    TYPE    | SNAPSHOTS |
+----------+---------+------+------+------------+-----------+
| smith-01 | RUNNING |      |      | PERSISTENT | 0         |
+----------+---------+------+------+------------+-----------+

I can give the container a dedicated IP address how the way I used to do it, but there is still no internet connection, since the bridge doesn’t seem to work.

root@tardis-01.madpony.host:~# ip route add 5.9.194.144/28 via 10.171.234.205 dev lxdbr0
root@tardis-01.madpony.host:~# lxc exec smith-01 -- bash
root@smith-01.madpony.space:~# ip addr add 5.9.194.144/28 dev eth0
root@smith-01.madpony.space:~# ping google.com
ping: google.com: Temporary failure in name resolution

So now I have no idea where to go.

Output of lxc config show smith-01 --expanded:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20190604)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20190604"
  image.version: "18.04"
  limits.memory: 16GB
  security.idmap.isolated: "true"
  volatile.base_image: c234ecee3baaee25db84af8e3565347e948bfceb3bf7c820bb1ce95adcffeaa8
  volatile.eth0.host_name: vethe5bca34d
  volatile.eth0.hwaddr: 00:16:3e:55:a3:5c
  volatile.idmap.base: "1065536"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
  volatile.lxdbr0.host_name: vethb3a18707
  volatile.lxdbr0.hwaddr: 00:16:3e:45:9c:5b
  volatile.lxdbr0.name: eth1
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  lxdbr0:
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: tardis
    size: 110GB
    type: disk
ephemeral: false
profiles:
- smith
stateful: false
description: ""

The volatile.lxdbr0.name: eth1 part wasn’t something I seen before, I believe.

Output of lxc profile show smith:

config:
  limits.memory: 16GB
  security.idmap.isolated: "true"
description: LXD profile for Smith servers
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: tardis
    size: 110GB
    type: disk
name: smith
used_by:
- /1.0/containers/smith-01

Output of ifconfig lxdbr0:

lxdbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.171.234.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::5c6e:b1ff:fe04:5231  prefixlen 64  scopeid 0x20<link>
        ether 02:9c:b6:5e:be:e1  txqueuelen 1000  (Ethernet)
        RX packets 53  bytes 9904 (9.9 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 180  bytes 7848 (7.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Output of lxc info smith-01:

Name: smith-01
Location: none
Remote: unix://
Architecture: x86_64
Created: 2019/06/10 18:24 UTC
Status: Running
Type: persistent
Profiles: smith
Pid: 3908
Ips:
  eth0:	inet	5.9.194.144	vethe5bca34d
  eth0:	inet6	fe80::216:3eff:fe55:a35c	vethe5bca34d
  eth1:	inet6	fe80::216:3eff:fe45:9c5b
  lo:	inet	127.0.0.1
  lo:	inet6	::1
Resources:
  Processes: 1131
  Disk usage:
    root: 6.57GB
  CPU usage:
    CPU usage (in seconds): 206
  Memory usage:
    Memory (current): 1.71GB
    Memory (peak): 1.72GB
  Network usage:
    lo:
      Bytes received: 1.97MB
      Bytes sent: 1.97MB
      Packets received: 23396
      Packets sent: 23396
    eth0:
      Bytes received: 23.28kB
      Bytes sent: 12.73kB
      Packets received: 491
      Packets sent: 48
    eth1:
      Bytes received: 34.77kB
      Bytes sent: 1.15kB
      Packets received: 523
      Packets sent: 15

Hm, eth1 here too.

Looking at my bash_history I can see that I disabled IPv6 for the container, but since everything worked before and I did reboot this server a few times before this happened, I believe this is not the cause.

Still, these are the commands I used at the time:

2019-06-10 18:00:41 lxc network set lxdbr0 ipv6.address none
2019-06-10 18:31:29 lxc network set lxdbr0 ipv6.nat false

I created the bridge by running lxd init and LXD is installed using the Snap package, if that matters.

Before I forget, new containers created on this server also lack IP addresses and internet connections.

root@tardis-01.madpony.host:~# lxc launch ubuntu:18.04 testing-network
Creating testing-network
Starting testing-network
root@tardis-01.madpony.host:~# lxc list
+-----------------+---------+--------------------+------+------------+-----------+
|      NAME       |  STATE  |        IPV4        | IPV6 |    TYPE    | SNAPSHOTS |
+-----------------+---------+--------------------+------+------------+-----------+
| smith-01        | RUNNING | 5.9.194.144 (eth0) |      | PERSISTENT | 0         |
+-----------------+---------+--------------------+------+------------+-----------+
| testing-network | RUNNING |                    |      | PERSISTENT | 0         |
+-----------------+---------+--------------------+------+------------+-----------+

Thanks in advance!

gpatel-fr · December 10, 2019, 7:26pm

The Ip address, the name resolution, the route: all this comes from a single source for a container using the default lxdbr0 bridge; the lxd dnsmasq instance. So it seems that your would be well inspired to try;
ps fauxww | grep dnsmasq

and, ah,

can’t possibly do anything, you have to restart the container after setting the ip address. Not that it was a good idea anyway, the problem is either dnsmasq not starting or the container not seeing dnsmasq (not likely but possible) so the proper fix is to restore the normal LXD behaviour.

Possibly

journalctl -u snap.lxd.daemon
sudo grep dnsmasq /var/snap/lxd/common/lxd/logs/lxd.log

could give some insight but I’m not too hopeful.

jamesponddotco · December 10, 2019, 8:24pm

@gpatel-fr Thanks for the help! Unfortunately, dnsmasq seems to be running on both the container and host.

root@tardis-01.madpony.host:~# ps fauxww | grep dnsmasq
lxd        3810  0.0  0.0  49984   396 ?        S    18:00   0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.171.234.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.171.234.2,10.171.234.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd
root     107918  0.0  0.0  17240  1112 pts/0    S+   20:18   0:00                          \_ grep --color=auto dnsmasq
root@tardis-01.madpony.host:~# lxc exec smith-01 -- bash
root@smith-01.madpony.space:~# ps fauxww | grep dnsmasq
root      12602  0.0  0.0  14856   904 ?        S+   20:18   0:00  \_ grep --color=auto dnsmasq

I went looking through logs to see if I can find anything related to dnsmasq, but this is all I could find, which doesn’t seem to be the issue either?

Output from cat /var/log/syslog | grep dnsmasq:

Dec 10 18:27:07 tardis-01 dnsmasq[3810]: reading /etc/resolv.conf
Dec 10 18:27:07 tardis-01 dnsmasq[3810]: using local addresses only for domain lxd
Dec 10 18:27:07 tardis-01 dnsmasq[3810]: using nameserver 127.0.0.53#53

gpatel-fr · December 10, 2019, 9:24pm

Wait, you don’t have something before, with first dnsmasq version and then dnsmasq-dhcp related messages in your syslog ?

jamesponddotco · December 10, 2019, 9:49pm

There is, now that I look at it.

Dec 10 18:27:12 tardis-01 dnsmasq-dhcp[3810]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/smith-01

Output of the file:

root@tardis-01.madpony.host:/var/log# cat /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/smith-01
00:16:3e:45:9c:5b,10.171.234.205,smith-01

gpatel-fr · December 10, 2019, 10:35pm

I think that this is the effect of your instruction

lxc config device set smith-01 lxdbr0 ipv4.address 10.171.234.205

and the MAC address is the one used for your new eth1.
So your container has already an eth0 and it’s trying to create a new one from the dnsmasq dhcp server.

Well, maybe something has changed in the host configuration and was only validated by the reboot and then it’s fighting with the LXD dnsmasq. I’d try to add some tracing to the LXD dnsmasq config by using something like that - see man dnsmasq for the available syntax.

yajrendrag · December 11, 2019, 1:52pm

I saw the same thing - see my similar post “ LXD Container not getting ip address from DHCP using linux bridge” a few days ago.

You are using lxd bridge but i was using a Linux bridge i defined on the host. I could not get an IP address for the container.

& i noticed the problem with kernel 5.0.0-37… after downgrading back to the 4.15.0-72 kernel, everything worked as it had before.

I didn’t know what else to look at to try & troubleshot it further, but seems to me there is something going on between lxd & 5.0.0-37 & newer kernels. IIRC, it all worked on 5.0.0-36 too…

gpatel-fr · December 11, 2019, 3:29pm

gerard@q1900:~$ uname -a
Linux q1900 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
gerard@q1900:~$ lxc launch ubuntu:18.04 test
Creating test
Starting test                               
gerard@q1900:~$ lxc list test
+------+---------+--------------------+------+------------+-----------+
| NAME |  STATE  |        IPV4        | IPV6 |    TYPE    | SNAPSHOTS |
+------+---------+--------------------+------+------------+-----------+
| test | RUNNING | 10.10.0.175 (eth0) |      | PERSISTENT | 0         |
+------+---------+--------------------+------+------------+-----------+

jamesponddotco · December 11, 2019, 3:40pm

Yeah, that is what I have been thinking, where the hell did eth1 come from? I will try to figure it out today after work and see what I can come up with.

If all goes wrong, I will probably just start over as I have backups for what is inside the container anyway. I would prefer to solve the puzzle, though.

Doesn’t seem to be my issue, since I booted with 4.15 and 4.18 for testing as well, with the same issue.

gpatel-fr · December 11, 2019, 4:34pm

That’s the spirit !
You can try also on the host, to find if there some other software listening on dns)

ss -tapnu | grep -v 127.0 | grep :53

and on a new container (one without your manual ip address)
lxc console your-new-container --show-log