Lxd containers lose dnsmasq connection (was: networking disappears)


#1

I haven’t been working with or on my lxd containers for a while (maybe a couple weeks). When I left things the following was true.

lxc list
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+
| debian | RUNNING | 10.15.189.253 (eth0) | fd42:a80e:56f1:cd:216:3eff:fea2:1fbc (eth0) | PERSISTENT | 0 |
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+
| debiancalendar | RUNNING | 10.15.189.114 (eth0) | fd42:a80e:56f1:cd:216:3eff:fe6e:4bcf (eth0) | PERSISTENT | 0 |
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+
| debianinternalserver | RUNNING | 10.15.189.99 (eth0) | fd42:a80e:56f1:cd:216:3eff:fe05:4dcc (eth0) | PERSISTENT | 0 |
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+
| debiansmallserver | RUNNING | 10.15.189.100 (eth0) | fd42:a80e:56f1:cd:216:3eff:fea5:af4 (eth0) | PERSISTENT | 0 |
±---------------------±--------±---------------------±--------------------------------------------±-----------±----------+

Today I wanted to examine some files in debianrecordkeeping but there is no networking connections and
this is what is given.

lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

Any ideas as to what I have changed?
(Nothing advertently!!)

How do I reestablish connections?
I have tried restarting the container in question. Have also tried stop and then start. Not sure what to look for although I spent some time browsing in the /var/log/memyself file (is there a mountain of stuff there!) but I’m not clear on what to do to reattach the containers to the virtual network.

TIA


#2

What happens here is that the containers did not manage to get an IP address from the LXD dnsmasq DHCP server.

Run on the host

ps ax | grep dnsmasq

Then, try to identify a process that looks like

4484 ? S 0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 ......

  1. If you cannot find such a dnsmasq command? Restart LXD and it should spawn it again.

  2. You can find the command running, but nothing happens. See the full command parameters. There should be a directory where it keeps the DHCP leases. See in there to verify that dnsmasq is working.


#3

ps ax | grep dnsmasq
23519 pts/2 S+ 0:00 grep dnsmasq

It would appear that I do have dnsmasq working - - - but all that other stuff- - - nope!

Looking into the ‘file’ you pointed out - - - well there is no such thing. There is a dnsmasq.hosts, dnsmasq.leases and dnsmasq.raw but no dnsmasq.pid . Not sure what that means though.

Thanks for the help!


#4

This means that there is no dnsmasq running. And since there is no dnsmasq running, there is no dnsmasq.pid.
If you really had a dnsmasq running, you would get a line that mentions both dnsmasq and lxd (as I show above).

Therefore, your task is to figure out how to get dnsmasq running (the dnsmasq that LXD is launching).


#5

Ok - - - some results.
Network-manager service needed a restart.
Dnsmasq needed to be restarted.

ps ax | grep dnsmasq
26600 ? S 0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,19036,8,2,49aac11d7b6f6446702e54a1607371607a1a41855200fd2ce1cdde32f24e8fb5 --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
29642 pts/2 S+ 0:00 grep dnsmasq

except that does not give me a working dnsmasq in the /snap/lxd instance.

I have been trying a number of different things (same things as worked for the host system) and things ‘look normal’ from the host side.
lxdbr0 is listed in ifconfig and has (what I think is all) the correct values.

But I can’t seem to get the containers to use dnsmasq or connect.
What could I be doing to affect connection between lxdbr0 and my containers?


#6

That dnsmasq is not the one that LXD is managing. That dnsmasq is the one provided by the distribution. The command line should mention lxd and lxdbr0.

I suggest to file an issue on https://github.com/lxc/lxd By doing so, you will be asked to produce a lot of info that helps in debugging the problem.


#7

Issue filed.

Any ideas for resolution from this group?

These containers were setup (and in the process of being set up) for managing and doing things for various parts of my businesses. For business stuff - - - - I get really stressed when things die inexplicably. Also usually means that I need to find a different solution.

Is this possibly as a result of upgrading the system? (bind9 packages were upgraded on the last upgrade)


#8

Add the URL to the issue on github here, so that people can figure out what is going on from the details you provided there.

Also, edit the title of the issue on github to something descriptive like “LXD unable anymore to spawn dnsmasq process”.

There is a package for dnsmasq that the distribution has, and there is a copy of dnsmasq that is specific to LXD, and it comes inside the package of LXD. You do not need to install manually the dnsmasq package.
The problem is that the LXD dnsmasq executable is not running for some reason.

Normally, if you restart LXD, the LXD dnsmasq may start. I assume that you have tried this.


#9

is the connection to github.

Unable to edit title only able to edit the body of the text.

Oh yes tried doing restart all and also individual containers.


#10

Some hopefully pertinent information from the host /var/log

Jan 20 14:02:01 debianserver CRON[31689]: (logcheck) CMD ( if [ -x /usr/sbin/logcheck ]; then nice -n10 /usr/sbin/logcheck; fi)
Jan 20 14:02:02 debianserver lxd.daemon[31244]: err=“Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=
lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.15.189.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/va
r/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.15.189.2,10.15.189.254,1h --listen-address=fd42:a80e:56f1:cd::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd
-S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u nobody: dnsmasq: directory /etc/resolv.conf for resolv-file is missing, cannot poll” lvl=eror msg=“Failed to bring up network” na
me=lxdbr0 t=2018-01-20T20:02:02+0000
Jan 20 14:02:02 debianserver lxd.daemon[31244]: lvl=warn msg=“Failed to update instance types: Get https://images.linuxcontainers.org/meta/instance-types/.yaml: lookup images.linuxcontainers.org on [::1]:53:
read udp [::1]:37899->[::1]:53: read: connection refused” t=2018-01-20T20:02:02+0000
Jan 20 14:02:02 debianserver kernel: [3310848.371168] lxdbr0: port 1(vethRY1O8U) entered blocking state

Is there some way to reverse that ‘Failed to run: dnsmasq . . . .’ ?

Jan 20 14:01:48 debianserver NetworkManager[1849]: [1516478508.1871] device (lxdbr0): Activation: successful, device activated.
Jan 20 14:01:48 debianserver nm-dispatcher: req:2 ‘up’ [lxdbr0]: new request (2 scripts)
Jan 20 14:01:48 debianserver nm-dispatcher: req:2 ‘up’ [lxdbr0]: start running ordered scripts…
Jan 20 14:01:48 debianserver avahi-daemon[1771]: Leaving mDNS multicast group on interface lxdbr0.IPv6 with address fe80::146b:ceff:fe8c:53fe.
Jan 20 14:01:48 debianserver avahi-daemon[1771]: Joining mDNS multicast group on interface lxdbr0.IPv6 with address fd42:a80e:56f1:cd::1.
Jan 20 14:01:48 debianserver avahi-daemon[1771]: Registering new address record for fd42:a80e:56f1:cd::1 on lxdbr0.*.
Jan 20 14:01:48 debianserver avahi-daemon[1771]: Withdrawing address record for fe80::146b:ceff:fe8c:53fe on lxdbr0.
Jan 20 14:01:48 debianserver systemd[1]: Reloading OpenBSD Secure Shell server.
Jan 20 14:01:48 debianserver systemd[1]: Reloaded OpenBSD Secure Shell server.
Jan 20 14:01:48 debianserver dnsmasq[31439]: directory /etc/resolv.conf for resolv-file is missing, cannot poll
Jan 20 14:01:48 debianserver dnsmasq[31439]: FAILED to start up
Jan 20 14:01:48 debianserver systemd[1]: Reloading OpenBSD Secure Shell server.
Jan 20 14:01:48 debianserver systemd[1]: Reloaded OpenBSD Secure Shell server.
Jan 20 14:01:49 debianserver dnsmasq[31526]: directory /etc/resolv.conf for resolv-file is missing, cannot poll
Jan 20 14:01:49 debianserver dnsmasq[31526]: FAILED to start up

This might also be pertinent (I think!!).


#11

This looks to be the issue.

I do not know if you did other changes to the server that affect LXD. With the following, I assume that there might be some hypothetical issue with the latest snap of LXD, as used on Debian.
In that hypothetical sense, here is what you would do to revert back (or forth) to an earlier version of a snap:

$ snap list --all lxd
Name  Version  Rev   Developer  Notes
lxd   2.21     5373  canonical  disabled
lxd   2.21     5408  canonical  disabled
lxd   2.21     5447  canonical  -

There are three revisions of LXD on your system (always snap keeps the last three revisions handy). Rev 5447 is the latest, and is in use. Let’s revert back to Rev 5408 and see how it looks.

$ snap revert lxd
lxd reverted to 2.21

$ snap list --all lxd
Name         Version  Rev  Developer  Notes
lxd   2.21     5373  canonical  disabled
lxd   2.21     5408  canonical  -
lxd   2.21     5447  canonical  disabled

Now, try again and see if dnsmasq manages to start.
If it still does not work, then you can revert back once more and test again.

If at the end it still does not work, then it is likely that something else is messing up the system.
In any case, you can switch back to the latest revision Rev 5447 and continue looking for a solution, by running

$ snap refresh --revision 5447 lxd


#12

Started as suggested

snap list --all lxd

Name Version Rev Developer Notes
lxd 2.21 5408 canonical disabled
lxd 2.21 5447 canonical disabled
lxd 2.21 5490 canonical -

Somehow I have a newer than expected version

root@debianserver:/# snap revert lxd
lxd reverted to 2.21
root@debianserver:/# snap list --all lxd
Name Version Rev Developer Notes
lxd 2.21 5408 canonical disabled
lxd 2.21 5447 canonical -
lxd 2.21 5490 canonical disabled

Checking on things:

$ lxc start debian
error: The container is already running
Try lxc info --show-log debian for more info
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
$ lxc stop debian
$ lxc start debian
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

Things are still not where they should be. Still no networking. On to a 2nd reversion after checking things.

$ lxc stop debian

snap list --all lxd

Name Version Rev Developer Notes
lxd 2.21 5408 canonical disabled
lxd 2.21 5447 canonical -
lxd 2.21 5490 canonical disabled

snap revert lxd

lxd reverted to 2.21

snap list --all lxd

Name Version Rev Developer Notes
lxd 2.21 5408 canonical -
lxd 2.21 5447 canonical disabled
lxd 2.21 5490 canonical disabled
$ lxc start debian
error: The container is already running
Try lxc info --show-log debian for more info

This time ALL the containers are running!

$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

Trying, one more time, to get networking started.

$ lxc stop debian
$ lxc start debian
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

Thinking that perhaps if I stop all containers that maybe I’ll have better luck.

$ lxc stop debian
$ lxc stop debiancalendar
$ lxc stop debianinternalserver
$ lxc stop debianrecordkeeping
$ lxc stop debiansmallserver
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
$ lxc restart debian
error: The container is already stopped
Try lxc info --show-log debian for more info
$ lxc start debian
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | RUNNING | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
$ lxc stop debian
$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

Moving things back up to the previously newest version.

snap refresh --revision 5490 lxd

lxd 2.21 from ‘canonical’ refreshed

This next result is, imo, somewhat unexpected and therefore interesting.

$ lxc list
LXD socket not found; is LXD installed and running?
$ lxc info
LXD socket not found; is LXD installed and running?

snap revert lxd

lxd reverted to 2.21

snap list --all lxd

Name Version Rev Developer Notes
lxd 2.21 5408 canonical -
lxd 2.21 5490 canonical disabled

I’m down to two versions now. Don’t know what happened to the third!

$ lxc list
±---------------------±--------±-----±-----±-----------±----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
±---------------------±--------±-----±-----±-----------±----------+
| debian | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiancalendar | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianinternalserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debianrecordkeeping | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+
| debiansmallserver | STOPPED | | | PERSISTENT | 0 |
±---------------------±--------±-----±-----±-----------±----------+

So - - - doesn’t seem like the exercise changed my situation and don’t really know what to make of the stored version that disappeared.

Where to next please?


#13

Your goal is to get the LXD dnsmasq to run.
When you switch between LXD snap versions, you need some way to verify whether dnsmasq is running.

Command to check whether LXD’s dnsmasq is running: ps ax | grep dnsmasq | grep lxd
Command to restart LXD: sudo systemctl restart snap.lxd.daemon.service

Also, note that when you start/restart a container, it takes a few seconds to actually start and get an IP from DHCP from LXD’s dnsmasq. Therefore, the conclusive test is to see whether LXD’s dnsmasq is running.

In your case, the problem appears to be:

Jan 20 14:02:02 debianserver lxd.daemon[31244]: err=“Failed to run: dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=
lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.15.189.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/va
r/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.15.189.2,10.15.189.254,1h --listen-address=fd42:a80e:56f1:cd::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd
-S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u nobody: dnsmasq: directory /etc/resolv.conf for resolv-file is missing, cannot poll” lvl=eror msg=“Failed to bring up network” na
me=lxdbr0 t=2018-01-20T20:02:02+0000

Therefore, this is where you should work towards.

If you google on dnsmasq: directory /etc/resolv.conf for resolv-file is missing, cannot poll, you get quite a few results from the Debian bug tracking system. It might be an issue related to Debian. Have a good read at the results to figure out what workaround is there.

Finally, when you attach text from the terminal, it is good to select it and then apply the appropriate character style (pre-formated text or blockquote). It makes it easier to read.


(Stéphane Graber) #14

Sorry about the delay, have been travelling. I suspect this is a snap-related issue for yet another /etc/resolv.conf setup that we haven’t seen yet :slight_smile:

I’ve asked for some additional information in the Github issue which should help me track this down.


#15

Information delivered over the weekend. Am looking forward to getting these particular gremlins punted!!


#16

OK - - - I was able to do snap refresh and I would appear to have a working lxd instance again.
Thank you for the fix.

How do I find out what went wrong?
What caused the issue?


Networking problem : unstable bridge connection
(Stéphane Graber) #17

I’m not sure what would have caused the initial break, but the reason for the issue was that /etc/resolv.conf on your host is a symlink to somewhere under /etc/resolvconf/…

This path doesn’t exist in the restricted /etc that the snap has access to and so /etc/resolv.conf appeared as a broken symlink inside the container, causing DNS resolution to fail.

I’ve now whitelisted /etc/resolvconf in the LXD snap, which fixes that issue.


#18

Thank you for your reply.
I know that by me asking questions I get to understand what’s behind the scenes a little better. I may not ever become fluent in those structures but hopefully I’ll be less ‘lost’ when problems come up, as they always seem to given time!

Thanks for your assistance!!