Networking problem : unstable bridge connection

Hi all,

I’m experiencing a very annoying issue with LXD on a specific server. Everything runs smoothly for a while but after some time (hours, days or sometimes weeks) the bridge interface on every container loses its ipv4 address (there is no ipv6 bridge configured). Stopping and restarting the container(s) has not effect on restoring connectivity but restarting the lxd daemon (snap) solves the issue for a while.

I run the same configuration on several instances with no problem but this is a dedicated physical server (not a vm). I tried with both Ubuntu 18.04 and went back to 16.04 in case 18.04 wasn’t stable enough yet but the behaviour is identical.

I would appreciate any help in diagnosing this issue since I care about this server, it’s the most powerful one I have access to…

For now I’m running an up to date Ubuntu 16.04 with the following kernel
Linux lxd-metal 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

LXD is also up to date, running as a single host (no clustering):

Name  Version    Rev   Tracking  Publisher  Notes
lxd   3.4        8393  stable    canonical  -

There is obviously something specific to this server (Online.net START-2-M-SSD), I don’t think it is a LXD bug but probably a configuration problem since it can run flawlessly for several days in a row.

Thanks in advance.

I wonder if it could be linked to the snap update process. Here are two log sequences, the first one when I simply restart the daemon and the second one when a snap refresh occurs (after a system restart for instance):

Daemon restart

Aug 19 06:00:02 lxd-metal lxd.daemon[7495]: => Stop reason is: host shutdown
Aug 19 06:00:02 lxd-metal lxd.daemon[7495]: => Stopping LXD (with container shutdown)
Aug 19 06:00:32 lxd-metal lxd.daemon[25500]: action=shutdown created=2018-07-27T15:43:16+0200 ephemeral=false lvl=eror msg="Failed shutting down container" name=quickbox t=2018-08-19T06:00:32+0200 timeout=30s used=2018-08-18T21:51:06+0200
Aug 19 06:00:33 lxd-metal lxd.daemon[25500]: No such file or directory - Failed to receive file descriptor
Aug 19 06:00:33 lxd-metal lxd.daemon[25500]: lvl=warn msg="Unable to update backup.yaml at this time" name=quickbox t=2018-08-19T06:00:33+0200
Aug 19 06:00:33 lxd-metal lxd.daemon[25500]: => LXD exited cleanly
Aug 19 06:00:33 lxd-metal lxd.daemon[7495]: => Stopping LXCFS
Aug 19 06:00:34 lxd-metal lxd.daemon[7495]: umount: /var/snap/lxd/common/ns: not mounted
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: => Preparing the system
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Loading snap configuration
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Setting up mntns symlink (mnt:[4026532197])
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Setting up kmod wrapper
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Preparing /boot
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Preparing a clean copy of /run
Aug 19 06:00:34 lxd-metal lxd.daemon[7757]: ==> Preparing a clean copy of /etc
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Setting up ceph configuration
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Setting up LVM configuration
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Rotating logs
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Setting up ZFS (0.6)
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Escaping the systemd cgroups
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: ==> Escaping the systemd process resource limits
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: => Starting LXCFS
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: => Starting LXD
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: mount namespace: 5
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: hierarchies:
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   0: fd:   6: devices
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   1: fd:   7: memory
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   2: fd:   8: perf_event
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   3: fd:   9: blkio
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   4: fd:  10: freezer
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   5: fd:  11: pids
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   6: fd:  12: cpuset
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   7: fd:  13: hugetlb
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   8: fd:  14: cpu,cpuacct
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:   9: fd:  15: net_cls,net_prio
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]:  10: fd:  16: name=systemd
Aug 19 06:00:35 lxd-metal lxd.daemon[7757]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-19T06:00:35+0200
Aug 19 06:00:36 lxd-metal lxd.daemon[7757]: lvl=warn msg="Unable to update backup.yaml at this time" name=quickbox t=2018-08-19T06:00:36+0200
Aug 19 06:00:37 lxd-metal lxd.daemon[7757]: => LXD is ready

Snap refresh

Aug 18 06:02:17 lxd-metal lxd.daemon[2424]: => Stop reason is: snap refresh
Aug 18 06:02:17 lxd-metal lxd.daemon[2424]: => Stopping LXD
Aug 18 06:02:17 lxd-metal lxd.daemon[914]: lvl=warn msg="Failed to update instance types: Get https://images.linuxcontainers.org/meta/instance-types/.yaml: context canceled" t=2018-08-18T06:02:17+0200
Aug 18 06:02:17 lxd-metal lxd.daemon[914]: => LXD exited cleanly
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: => Preparing the system
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Loading snap configuration
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Setting up mntns symlink (mnt:[4026532197])
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Setting up kmod wrapper
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Preparing /boot
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Preparing a clean copy of /run
Aug 18 06:02:23 lxd-metal lxd.daemon[3514]: ==> Preparing a clean copy of /etc
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Setting up ceph configuration
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Setting up LVM configuration
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Rotating logs
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Setting up ZFS (0.6)
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Escaping the systemd cgroups
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: ==> Escaping the systemd process resource limits
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: => Re-using existing LXCFS
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: => Starting LXD
Aug 18 06:02:24 lxd-metal lxd.daemon[3514]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-18T06:02:24+0200
Aug 18 06:02:25 lxd-metal lxd.daemon[914]: mount namespace: 7
Aug 18 06:02:25 lxd-metal lxd.daemon[914]: hierarchies:
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   0: fd:   8: devices
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   1: fd:   9: memory
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   2: fd:  10: perf_event
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   3: fd:  11: blkio
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   4: fd:  12: freezer
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   5: fd:  13: pids
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   6: fd:  14: cpuset
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   7: fd:  15: hugetlb
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   8: fd:  16: cpu,cpuacct
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:   9: fd:  17: net_cls,net_prio
Aug 18 06:02:25 lxd-metal lxd.daemon[914]:  10: fd:  18: name=systemd
Aug 18 06:02:25 lxd-metal lxd.daemon[914]: lxcfs.c: 105: do_reload: lxcfs: reloaded
Aug 18 06:02:48 lxd-metal lxd.daemon[3514]: _err="Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.10.10.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.10.10.2,10.10.10.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd: dnsmasq: failed to create listening socket for fe80::6c00:d8ff:fea8:6782%lxdbr0: Address already in use" lvl=eror msg="Failed to bring up network" name=lxdbr0 t=2018-08-18T06:02:48+0200_
Aug 18 06:02:48 lxd-metal lxd.daemon[3514]: => LXD is ready

So this error “err=“Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.10.10.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.10.10.2,10.10.10.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd: dnsmasq: failed to create listening socket for fe80::6c00:d8ff:fea8:6782%lxdbr0: Address already in use” lvl=eror msg=“Failed to bring up network” name=lxdbr0 t=2018-08-18T06:02:48+0200_” seems like a clue but I have no idea on how to further diagnose this.

And this is not a new problem with LXD 3.4, I’ve been dealing with this ever since I started using this server three months ago (so I guess LXD was at v3.1 back then).

I’ve searched this forum and of course asked Google but found nothing similar which really makes me think it is related to my server configuration.

The closest issue I found was this thread and its associated Github issue. But it’s not really the same since most of the time dnsmasq works normally.

Just in case you ask, here is further information.

ls -lh /etc/resolv.conf
lrwxrwxrwx 1 root root 29 Jul 27 15:20 /etc/resolv.conf -> ../run/resolvconf/resolv.conf

readlink -f /etc/resolv.conf
/run/resolvconf/resolv.conf

ls -lh /proc/$(pgrep daemon.start)/root/etc/
total 20K
lrwxrwxrwx 1 root root 35 Aug 19 06:00 alternatives -> /snap/core/current/etc/alternatives
lrwxrwxrwx 1 root root 31 Aug 19 06:00 apparmor -> /snap/core/current/etc/apparmor
lrwxrwxrwx 1 root root 33 Aug 19 06:00 apparmor.d -> /snap/core/current/etc/apparmor.d
lrwxrwxrwx 1 root root 30 Aug 19 06:00 ceph -> /var/lib/snapd/hostfs/etc/ceph
lrwxrwxrwx 1 root root 31 Aug 19 06:00 group -> /var/lib/snapd/hostfs/etc/group
lrwxrwxrwx 1 root root 34 Aug 19 06:00 hostname -> /var/lib/snapd/hostfs/etc/hostname
lrwxrwxrwx 1 root root 31 Aug 19 06:00 hosts -> /var/lib/snapd/hostfs/etc/hosts
-rw-r–r-- 1 root root 14K Aug 19 06:00 ld.so.cache
lrwxrwxrwx 1 root root 35 Aug 19 06:00 localtime -> /var/lib/snapd/hostfs/etc/localtime
-rw-r–r-- 1 root root 86 Aug 19 06:00 logrotate.status
drwxr-xr-x 2 root root 60 Aug 19 06:00 lvm
lrwxrwxrwx 1 root root 12 Aug 19 06:00 mtab -> /proc/mounts
lrwxrwxrwx 1 root root 39 Aug 19 06:00 nsswitch.conf -> /var/lib/snapd/hostfs/etc/nsswitch.conf
lrwxrwxrwx 1 root root 36 Aug 19 06:00 os-release -> /var/lib/snapd/hostfs/etc/os-release
lrwxrwxrwx 1 root root 32 Aug 19 06:00 passwd -> /var/lib/snapd/hostfs/etc/passwd
lrwxrwxrwx 1 root root 37 Aug 19 06:00 resolv.conf -> /var/lib/snapd/hostfs/etc/resolv.conf
lrwxrwxrwx 1 root root 36 Aug 19 06:00 resolvconf -> /var/lib/snapd/hostfs/etc/resolvconf
lrwxrwxrwx 1 root root 29 Aug 19 06:00 ssl -> /var/lib/snapd/hostfs/etc/ssl
lrwxrwxrwx 1 root root 34 Aug 19 06:00 timezone -> /var/lib/snapd/hostfs/etc/timezone
lrwxrwxrwx 1 root root 26 Aug 19 06:00 vim -> /snap/core/current/etc/vim

ls -lh /proc/$(pgrep daemon.start)/root/run/
total 0
drwxr-xr-x 3 root root 60 Aug 19 06:00 lxc
drwx------ 4 root root 80 Aug 19 06:00 lxcfs
drwxr-xr-x 2 root root 60 Aug 19 06:00 mount
lrwxrwxrwx 1 root root 37 Aug 19 06:00 openvswitch -> /var/lib/snapd/hostfs/run/openvswitch
lrwxrwxrwx 1 root root 36 Aug 19 06:00 resolvconf -> /var/lib/snapd/hostfs/run/resolvconf
lrwxrwxrwx 1 root root 31 Aug 19 06:00 snapd -> /var/lib/snapd/hostfs/run/snapd
lrwxrwxrwx 1 root root 43 Aug 19 06:00 snapd-snap.socket -> /var/lib/snapd/hostfs/run/snapd-snap.socket
lrwxrwxrwx 1 root root 38 Aug 19 06:00 snapd.socket -> /var/lib/snapd/hostfs/run/snapd.socket
lrwxrwxrwx 1 root root 33 Aug 19 06:00 systemd -> /var/lib/snapd/hostfs/run/systemd
-rw------- 1 root root 0 Aug 19 06:00 xtables.lock

And the logs as recorded in /var/snap/lxd/common/lxd/logs (instead of journalctl output)
lvl=info msg=“LXD 3.4 is starting in normal mode” path=/var/snap/lxd/common/lxd t=2018-08-18T06:02:24+0200
lvl=info msg=“Kernel uid/gid map:” t=2018-08-18T06:02:24+0200
lvl=info msg=" - u 0 0 4294967295" t=2018-08-18T06:02:24+0200
lvl=info msg=" - g 0 0 4294967295" t=2018-08-18T06:02:24+0200
lvl=info msg=“Configured LXD uid/gid map:” t=2018-08-18T06:02:24+0200
lvl=info msg=" - u 0 1000000 1000000000" t=2018-08-18T06:02:24+0200
lvl=info msg=" - g 0 1000000 1000000000" t=2018-08-18T06:02:24+0200
lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.” t=2018-08-18T06:02:24+0200
lvl=info msg=“Initializing local database” t=2018-08-18T06:02:24+0200
lvl=info msg=“Initializing database gateway” t=2018-08-18T06:02:25+0200
address= id=1 lvl=info msg=“Start database node” t=2018-08-18T06:02:25+0200
lvl=info msg=“Raft: Restored from snapshot 1-133744-1534564769446” t=2018-08-18T06:02:25+0200
lvl=info msg=“Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}]” t=2018-08-18T06:02:25+0200
lvl=info msg=“Dqlite: starting event loop” t=2018-08-18T06:02:25+0200
lvl=info msg=“LXD isn’t socket activated” t=2018-08-18T06:02:25+0200
lvl=info msg=“Starting /dev/lxd handler:” t=2018-08-18T06:02:25+0200
lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock t=2018-08-18T06:02:25+0200
lvl=info msg=“REST API daemon:” t=2018-08-18T06:02:25+0200
lvl=info msg=" - binding Unix socket" socket=/var/snap/lxd/common/lxd/unix.socket t=2018-08-18T06:02:25+0200
lvl=info msg=“Initializing global database” t=2018-08-18T06:02:25+0200
lvl=info msg=“Dqlite: handling new connection (fd=20)” t=2018-08-18T06:02:25+0200
lvl=info msg=“Raft: Node at 0 [Leader] entering Leader state” t=2018-08-18T06:02:25+0200
lvl=info msg=“Dqlite: connected address=0 attempt=0” t=2018-08-18T06:02:25+0200
lvl=info msg=“Initializing storage pools” t=2018-08-18T06:02:25+0200
lvl=info msg=“Initializing networks” t=2018-08-18T06:02:25+0200
err=“Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.10.10.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.10.10.2,10.10.10.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd: dnsmasq: failed to create listening socket for fe80::6c00:d8ff:fea8:6782%lxdbr0: Address already in use” lvl=eror msg=“Failed to bring up network” name=lxdbr0 t=2018-08-18T06:02:48+0200
lvl=info msg=“Pruning leftover image files” t=2018-08-18T06:02:48+0200
lvl=info msg=“Done pruning leftover image files” t=2018-08-18T06:02:48+0200
lvl=info msg=“Loading configuration” t=2018-08-18T06:02:48+0200
lvl=info msg=“Connected to MAAS controller” t=2018-08-18T06:02:48+0200
lvl=info msg=“Pruning expired images” t=2018-08-18T06:02:48+0200
lvl=info msg=“Done pruning expired images” t=2018-08-18T06:02:48+0200
lvl=info msg=“Expiring log files” t=2018-08-18T06:02:48+0200
lvl=info msg=“Updating instance types” t=2018-08-18T06:02:48+0200
lvl=info msg=“Done expiring log files” t=2018-08-18T06:02:48+0200
lvl=info msg=“Updating images” t=2018-08-18T06:02:48+0200
lvl=info msg=“Done updating images” t=2018-08-18T06:02:48+0200
lvl=info msg=“Done updating instance types” t=2018-08-18T06:02:52+0200

It says there that the error detail is Address already in use. That means that for some reason, there is some other service on your server that binds (listens) to port 53 (DNS, domain), and does that early enough (on booting up) so that LXD’s dnsmasq does not get a chance to bind/listen to that port itself. Therefore, dnsmasq from LXD does not start at all.

Run the following command when you get the problem to see what process is listen to port 53 (domain):

sudo lsof -i -n | grep domain | grep LISTEN

The output should be similar to

COMMAND     PID            USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq    3041          nobody    5u  IPv4  40163      0t0  TCP 127.0.1.1:domain (LISTEN)
dnsmasq    4031 libvirt-dnsmasq    6u  IPv4  45162      0t0  TCP 192.168.12.1:domain (LISTEN)
dnsmasq    9024          nobody    7u  IPv4  44615      0t0  TCP 10.80.80.1:domain (LISTEN)
...

In the above case, you would note the line that mentions 10.80.80.1 (the subnet of the LXD containers), and then write down the PID (process ID) 9024. Use ps to check what process has that 9024 PID. Show those details here.

The referenced thread that you linked is about an issue that relates to resolve.conf.

Thanks for the answer. I don’t know how you can tell from this error message that this is a conflict on binding port 53 but I’ll trust you on that. I’ve disabled the daemon restart I programmed daily as a temporary patch and the next time the network is down I’ll run the commands and post the output.

The problem doesn’t only occur at boot time, I think it might occur at boot time sometimes because of the snap refresh being triggered then, but it can also happen after the system has been up and running for days. The containers keep running but lose their ip address on the bridge interface. Then a LXD daemon restart is necessary to get networking running again.

The server does nothing else at the moment : vanilla ubuntu 16.04 and one LXD container (also 16.04)…

I’ll keep you posted, thanks again.

So here we go again: last night there was a snap refresh and the problem came back. It doesn’t seem any process is preventing dnsmasq from servicing the LXD container network 10.10.10.0/24

lxc ls

+----------+---------+------+------+------------+-----------+
|   NAME   |  STATE  | IPV4 | IPV6 |    TYPE    | SNAPSHOTS |
+----------+---------+------+------+------------+-----------+
| quickbox | RUNNING |      |      | PERSISTENT | 1         |
+----------+---------+------+------+------------+-----------+

sudo lsof -i -n | grep domain | grep LISTEN

named     929  bind   21u  IPv6  12039      0t0  TCP *:domain (LISTEN)
named     929  bind   22u  IPv4  12043      0t0  TCP 127.0.0.1:domain (LISTEN)

ps 929

  PID TTY      STAT   TIME COMMAND
  929 ?        Ssl    0:00 /usr/sbin/named -f -u bind

/var/snap/lxd/common/lxd/logs# cat lxd.log

  lvl=info msg="LXD 3.4 is starting in normal mode" path=/var/snap/lxd/common/lxd t=2018-08-20T22:17:22+0200
  lvl=info msg="Kernel uid/gid map:" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - u 0 0 4294967295" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - g 0 0 4294967295" t=2018-08-20T22:17:22+0200
  lvl=info msg="Configured LXD uid/gid map:" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - u 0 1000000 1000000000" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - g 0 1000000 1000000000" t=2018-08-20T22:17:22+0200
  lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-20T22:17:22+0200
  lvl=info msg="Initializing local database" t=2018-08-20T22:17:22+0200
  lvl=info msg="Initializing database gateway" t=2018-08-20T22:17:22+0200
  address= id=1 lvl=info msg="Start database node" t=2018-08-20T22:17:22+0200
  lvl=info msg="Raft: Restored from snapshot 1-358942-1534621635284" t=2018-08-20T22:17:22+0200
  lvl=info msg="Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}]" t=2018-08-20T22:17:22+0200
  lvl=info msg="Raft: Node at 0 [Leader] entering Leader state" t=2018-08-20T22:17:22+0200
  lvl=info msg="Dqlite: starting event loop" t=2018-08-20T22:17:22+0200
  lvl=info msg="LXD isn't socket activated" t=2018-08-20T22:17:22+0200
  lvl=info msg="Starting /dev/lxd handler:" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock t=2018-08-20T22:17:22+0200
  lvl=info msg="REST API daemon:" t=2018-08-20T22:17:22+0200
  lvl=info msg=" - binding Unix socket" socket=/var/snap/lxd/common/lxd/unix.socket t=2018-08-20T22:17:22+0200
  lvl=info msg="Initializing global database" t=2018-08-20T22:17:22+0200
  lvl=info msg="Dqlite: handling new connection (fd=20)" t=2018-08-20T22:17:22+0200
  lvl=info msg="Dqlite: connected address=0 attempt=0" t=2018-08-20T22:17:22+0200
  lvl=info msg="Initializing storage pools" t=2018-08-20T22:17:22+0200
  lvl=info msg="Initializing networks" t=2018-08-20T22:17:22+0200
  err="Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.10.10.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.10.10.2,10.10.10.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd: dnsmasq: failed to create listening socket for fe80::9452:fcff:fe8e:c11b%lxdbr0: Address already in use" lvl=eror msg="Failed to bring up network" name=lxdbr0 t=2018-08-20T22:17:46+0200
  lvl=info msg="Pruning leftover image files" t=2018-08-20T22:17:46+0200
  lvl=info msg="Done pruning leftover image files" t=2018-08-20T22:17:46+0200
  lvl=info msg="Loading configuration" t=2018-08-20T22:17:46+0200
  lvl=info msg="Connected to MAAS controller" t=2018-08-20T22:17:46+0200
  lvl=info msg="Pruning expired images" t=2018-08-20T22:17:46+0200
  lvl=info msg="Done pruning expired images" t=2018-08-20T22:17:46+0200
  lvl=info msg="Expiring log files" t=2018-08-20T22:17:46+0200
  lvl=info msg="Updating instance types" t=2018-08-20T22:17:46+0200
  lvl=info msg="Updating images" t=2018-08-20T22:17:46+0200
  lvl=info msg="Done updating images" t=2018-08-20T22:17:46+0200
  lvl=info msg="Done expiring log files" t=2018-08-20T22:17:46+0200
  lvl=info msg="Done updating instance types" t=2018-08-20T22:17:50+0200
  lvl=info msg="Updating images" t=2018-08-21T04:17:46+0200
  lvl=info msg="Done updating images" t=2018-08-21T04:17:46+0200

journalctl -u snap.lxd.daemon | cat

  Aug 20 22:17:15 lxd-metal systemd[1]: Stopping Service for snap application lxd.daemon...
  Aug 20 22:17:16 lxd-metal lxd.daemon[15342]: => Stop reason is: snap refresh
  Aug 20 22:17:16 lxd-metal lxd.daemon[15342]: => Stopping LXD
  Aug 20 22:17:16 lxd-metal lxd.daemon[7757]: => LXD exited cleanly
  Aug 20 22:17:17 lxd-metal systemd[1]: Stopped Service for snap application lxd.daemon.
  Aug 20 22:17:21 lxd-metal systemd[1]: Started Service for snap application lxd.daemon.
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: => Preparing the system
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Loading snap configuration
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Setting up mntns symlink (mnt:[4026532197])
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Setting up kmod wrapper
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Preparing /boot
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Preparing a clean copy of /run
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Preparing a clean copy of /etc
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Setting up ceph configuration
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Setting up LVM configuration
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Rotating logs
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Setting up ZFS (0.6)
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Escaping the systemd cgroups
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: ==> Escaping the systemd process resource limits
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: => Re-using existing LXCFS
  Aug 20 22:17:21 lxd-metal lxd.daemon[15565]: => Starting LXD
  Aug 20 22:17:22 lxd-metal lxd.daemon[15565]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-20T22:17:22+0200
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]: mount namespace: 7
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]: hierarchies:
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   0: fd:   8: devices
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   1: fd:   9: memory
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   2: fd:  10: perf_event
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   3: fd:  11: blkio
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   4: fd:  12: freezer
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   5: fd:  13: pids
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   6: fd:  14: cpuset
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   7: fd:  15: hugetlb
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   8: fd:  16: cpu,cpuacct
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:   9: fd:  17: net_cls,net_prio
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]:  10: fd:  18: name=systemd
  Aug 20 22:17:31 lxd-metal lxd.daemon[7757]: lxcfs.c: 105: do_reload: lxcfs: reloaded
  Aug 20 22:17:46 lxd-metal lxd.daemon[15565]: err="Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.10.10.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.10.10.2,10.10.10.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd: dnsmasq: failed to create listening socket for fe80::9452:fcff:fe8e:c11b%lxdbr0: Address already in use" lvl=eror msg="Failed to bring up network" name=lxdbr0 t=2018-08-20T22:17:46+0200
  Aug 20 22:17:46 lxd-metal lxd.daemon[15565]: => LXD is ready

I didn’t restart the lxd daemon so I can proceed with further diagnostic if you have any other idea…

Thanks in advance.

So what you have here is BIND9 (named) that is listening to TCP *:domain (LISTEN), meaning to all interfaces. BIND9 should not listen to all interfaces (*:domain) because dnsmasq from LXD should handle exclusively the lxdbr0 network interface.

The BIND9 configuration file has support to either listen-on to specific interfaces or exclude specific interfaces. The default is to listen on any network interface.

My suggestion is to edit the BIND9 configuration file to specifically exclude lxdbr0 from named.

Ok I will try and let you know.

Meanwhile I have restarted the LXD daemon using:

sudo systemctl restart snap.lxd.daemon

and the container is back online with its ip address

lxc ls
+----------+---------+---------------------+------+------------+-----------+
|   NAME   |  STATE  |        IPV4         | IPV6 |    TYPE    | SNAPSHOTS |
+----------+---------+---------------------+------+------------+-----------+
| quickbox | RUNNING | 10.10.10.102 (eth0) |      | PERSISTENT | 1         |
+----------+---------+---------------------+------+------------+-----------+

and BIND9 is still listening using the same configuration (the pid remains the same, I didn’t restart it or anything) but somehow dnsmasq manages to listen on the 10.10.10.0/24 network again as it should and as it always does until the snap refresh occurs:

sudo lsof -i -n | grep domain | grep LISTEN
named      929  bind   21u  IPv6  12039      0t0  TCP *:domain (LISTEN)
named      929  bind   22u  IPv4  12043      0t0  TCP 127.0.0.1:domain (LISTEN)
dnsmasq  14407   lxd    7u  IPv4 667396      0t0  TCP 10.10.10.1:domain (LISTEN)

And no error is reported in the logs:

Aug 21 15:13:17 lxd-metal lxd.daemon[13842]: => Stop reason is: host shutdown
Aug 21 15:13:17 lxd-metal lxd.daemon[13842]: => Stopping LXD (with container shutdown)
Aug 21 15:13:21 lxd-metal lxd.daemon[15565]: lvl=warn msg="Unable to update backup.yaml at this time" name=quickbox t=2018-08-21T15:13:21+0200
Aug 21 15:13:22 lxd-metal lxd.daemon[15565]: => LXD exited cleanly
Aug 21 15:13:22 lxd-metal lxd.daemon[13842]: => Stopping LXCFS
Aug 21 15:13:23 lxd-metal lxd.daemon[13842]: umount: /var/snap/lxd/common/ns: not mounted
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: => Preparing the system
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Loading snap configuration
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Setting up mntns symlink (mnt:[4026532197])
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Setting up kmod wrapper
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Preparing /boot
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Preparing a clean copy of /run
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Preparing a clean copy of /etc
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Setting up ceph configuration
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Setting up LVM configuration
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Rotating logs
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Setting up ZFS (0.6)
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Escaping the systemd cgroups
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: ==> Escaping the systemd process resource limits
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: => Starting LXCFS
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: => Starting LXD
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: mount namespace: 5
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: hierarchies:
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   0: fd:   6: devices
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   1: fd:   7: memory
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   2: fd:   8: perf_event
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   3: fd:   9: blkio
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   4: fd:  10: freezer
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   5: fd:  11: pids
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   6: fd:  12: cpuset
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   7: fd:  13: hugetlb
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   8: fd:  14: cpu,cpuacct
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:   9: fd:  15: net_cls,net_prio
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]:  10: fd:  16: name=systemd
Aug 21 15:13:23 lxd-metal lxd.daemon[14242]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-21T15:13:23+0200
Aug 21 15:13:24 lxd-metal lxd.daemon[14242]: lvl=warn msg="Unable to update backup.yaml at this time" name=quickbox t=2018-08-21T15:13:24+0200
Aug 21 15:13:25 lxd-metal lxd.daemon[14242]: => LXD is ready

So I will try modifying the BIND9 configuration but I find it odd that it would only be a problem when LXD restarts after a snap refresh…

Actually it makes a lot of sense.

When a refresh happens, the LXD daemon is restarted WITHOUT any container being brought down or any interface being deleted/recreated. dnsmasq gets respawned as part of that, and fails because bind9 has since claimed it.

When you do a restart, the interface is destroyed, getting bind9 out of it, then LXD can spawn dnsmasq which happily binds it.

So yeah, you need to configure bind9 to only listen on the interfaces it’s supposed to listen on :slight_smile:

Ok thank you both for the information and explanations.

I reconfigured BIND9, I’ll let you know how it goes once the next refresh occurs!

You can simulate that code path with systemctl reload snap.lxd.daemon

Well, if this is the same process as a snap refresh, then all looks good now!

Aug 21 18:36:40 lxd-metal lxd.daemon[14242]: => LXD exited cleanly
Aug 21 18:36:40 lxd-metal lxd.daemon[22319]: => Stop reason is: reload
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: => Preparing the system
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Loading snap configuration
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Setting up mntns symlink (mnt:[4026532197])
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Setting up kmod wrapper
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Preparing /boot
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Preparing a clean copy of /run
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Preparing a clean copy of /etc
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Setting up ceph configuration
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Setting up LVM configuration
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Rotating logs
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Setting up ZFS (0.6)
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Escaping the systemd cgroups
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: ==> Escaping the systemd process resource limits
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: => Re-using existing LXCFS
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: => Starting LXD
Aug 21 18:36:41 lxd-metal lxd.daemon[22351]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-08-21T18:36:41+0200
Aug 21 18:36:43 lxd-metal lxd.daemon[22351]: => LXD is ready
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]: mount namespace: 7
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]: hierarchies:
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   0: fd:   8: devices
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   1: fd:   9: memory
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   2: fd:  10: perf_event
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   3: fd:  11: blkio
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   4: fd:  12: freezer
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   5: fd:  13: pids
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   6: fd:  14: cpuset
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   7: fd:  15: hugetlb
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   8: fd:  16: cpu,cpuacct
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:   9: fd:  17: net_cls,net_prio
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]:  10: fd:  18: name=systemd
Aug 21 18:36:59 lxd-metal lxd.daemon[14242]: lxcfs.c: 105: do_reload: lxcfs: reloaded

Thank you very much

Great!

Can you paste the relevant fragment of the named.conf configuration file that worked for you?

It would be useful for others that may have the same issue and want a tested solution.

Interesting you should ask, I’m not 100% certain what is going on so you might help me figure it out.

As I said, I don’t use IPv6 even though it’s enabled on the host server. So I didn’t configure the LXD bridge to use it. My network configuration is very simple:

lxc network show lxdbr0
config:
  ipv4.address: 10.10.10.1/24
  ipv4.nat: "true"
  ipv6.address: none
description: ""
name: lxdbr0
type: bridge
used_by:
- /1.0/containers/quickbox
managed: true
status: Created
locations:
- none

Since you mentioned my problem was with named binding on all interfaces here:

named     929  bind   21u  IPv6  12039      0t0  TCP *:domain (LISTEN)

I noticed this binding was only for IPv6 so I figured I could simply disable IPv6 support for BIND9 which I don’t need. I’m not sure this is the best way to do it but I edited /lib/systemd/system/bind9.service to add -4 to named start options:

ExecStart=/usr/sbin/named -4 -f $OPTIONS

Then I also edited the /etc/bind/named.conf.options file to explicitly restrict bind9 listening on the loopback interface:

listen-on { 127.0.0.1; };

And then I restarted bind9

sudo systemctl daemon-reload
sudo systemctl restart bind9

After which I have restarted the lxd daemon and also reloaded it to simulate the snap refresh code path.

I ended up with the following. Notice that dnsmasq now binds itself to an IPv6 interface which it didn’t do prior to disabling IPv6 for bind9.

sudo lsof -i -n | grep domain | grep LISTEN
named    28403  bind   21u  IPv4 744189      0t0  TCP 127.0.0.1:domain (LISTEN)
dnsmasq  29948   lxd    7u  IPv4 761889      0t0  TCP 10.10.10.1:domain (LISTEN)
dnsmasq  29948   lxd    9u  IPv6 761891      0t0  TCP [fe80::20e5:81ff:fee3:541c]:domain (LISTEN) 

And the initial error I reported from the logs was:
lxd: dnsmasq: failed to create listening socket for fe80::6c00:d8ff:fea8:6782%lxdbr0: Address already in use” lvl=eror msg=“Failed to bring up network” name=lxdbr0 t=2018-08-18T06:02:48+0200_

So maybe when dnsmasq can’t manage to bind on any interface this is the error it reports but when it does manage to bind at least on an IPv4 interface it silently accepts its fate on the lack of IPv6 binding… Maybe it’s unrelated (the IPv6 addresses do not match between the error message and actual binding). But anyway, this is where I’m at.

Again I don’t need IPv6 but I don’t mind digging further if I can be of help / use for others.

What do you make of this?