Ubuntu 18.04 infinite Loop dnsmasq systemd-resolved

rudiservo · September 25, 2018, 8:47pm

Ok lets do it!

It was a fresh instalation of Xubuntu 18.04 64bit with Luks and ext4.

installed with apt
lxd, strongsan

did had docker but purged it just in case, tried debuging installing dnsmasq daemon, purge it after.

These are my config files

/etc/dnsmasq.d/lxd
server=/lxd/10.146.38.1
bind-interfaces
except-interface=lxdbr0

lxd network
config:
ipv4.address: 10.146.38.1/24
ipv4.nat: “true”
ipv6.address: fd42:82c5:af81:fa56::1/64
ipv6.nat: “true”
description: “”
name: lxdbr0
type: bridge
used_by:

/1.0/containers/d0
/1.0/containers/d1
/1.0/containers/d2
managed: true
status: Created
locations:
none

rudiservo · September 26, 2018, 4:33pm

just tried in a fresh system with Xubuntu 18.04 and same thing, just installed LXD package nothing else.
Something to do with network manager maybe?

Already did try out you approach on DNS for LXC containers
but to it didn’t work.
My limitation is not knowing a thing about systemd-resolved or network-manager and how they interact, my only other choice that I am not very fan of is disabling systemd-resolved and installing dnsmasq for the hole system.

dnsmasq has got a loop-detection option but I do not know how to activate it on the LXD side of things.

My understanding of the problem would be that resolved asks dnsmasq, and since dnsmasq also resolves names for the containers it might ask again to resolved and get on a loop, I will wireshark the lxdbr0 port to confirm my suspition.
This does not happens only when do dig d0.lxd, it goes on a loop just by dig google.com or any not cached dns record.

I will search for an option to only query a dns for a specific TLD.

rudiservo · September 26, 2018, 5:02pm

Found something, the loop is an IPv6 query that if it can’t find (because it does not exists) and goes on a infinite loop between 127.0.0.1, 127.0.0.53 and lxdbr0 ip.
Also there are and extra set of initial queries between these 3 of IPv4 with correct answers, so a dns query is made to everyone regardless.

simos · September 26, 2018, 5:37pm

Nice!

I just tried with IPv6 working, and I did not see any extra queries.

So the issue is to figure what types of IPv6 misconfiguration would make this problem appear.

rudiservo · September 26, 2018, 7:46pm

So disable ipv6 on the network-manager?

rudiservo · September 26, 2018, 11:40pm

ok so I found these two bug reports on ubuntu.

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1694156
https://bugs.launchpad.net/ubuntu/+source/dnsmasq/+bug/1672099

one solution provided was to use

no-resolv
bind-interfaces
interface=lo
server=127.0.0.53

Issue is I don’t have dnsmasq installed on my machine, I have dnsmasq-base that comes has a lxd dependency.
I trie every configuration possible in /etc/dnsmasq.d/lxd and it doesnt seem to grab any of it

so how do I solve this? do I try and do some stuff in /var/lib/lxd/networks/lxdbr0/dnsmasq.raw
is this the “correct” config file to mess around or do I install dnsmasq package and mess around in /etc/dnsmasq.d/lxd

dnsmasq --strict-order --bind-interfaces --pid-file=/var/lib/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.146.38.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/lib/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/lib/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.146.38.2,10.146.38.254,1h -s lxd -S /lxd/ --conf-file=/var/lib/lxd/networks/lxdbr0/dnsmasq.raw -u lxd

sorry for being a pain, just been around this for the last couple of days.

simos · September 27, 2018, 6:52pm

I had a look at those two bug reports on Launchpad. The first references an additional report,

which describes a similar issue that we have with LXD but they get this with libvirt.
Here’s the text,

In debugging bug #1694156, I found that ultimately my problem was triggered by a hard-coded /etc/resolvconf/resolv.conf.d/tail I had set once upon a time pointing to my libvirt dnsmasq server. It should not be necessary to manually edit /etc/resolvconf/resolv.conf.d/tail to register dnsmasq; instead, on a system where systemd-resolved is running, libvirt should use the DBUS protocol to register its dnsmasq with systemd-resolved, specifying both SetLinkDNS and SetLinkDomains. This would enable properly-scoped DNS lookups for only the hosts on the libvirt bridge, avoiding any possibility of DNS loops and avoiding the need for manual configuration.

To do this properly, libvirt does need to declare a link domain (SetLinkDomains) that doesn’t conflict with other public DNS, or other non-authoritative DNS that may be configured on the system. I would suggest using just ‘libvirt.’ as a TLD, by default.

For example implementation, please see ./src/dns-manager/nm-dns-systemd-resolved.c:send_updates() in the network-manager source.

It describes how to do this programmatically, which means that there should be a way to interpret as additional command-line options to the configuration we do for LXD.

rudiservo · September 27, 2018, 7:25pm

a quick fix for now would be to add to the this “–dns-loop-detect” to the dnsmasq at startup.

As for the rest, I do not know where to start fixing in LXD, has much has I would like to, it would probably take me more then a few weeks full time.

Should I open a ticket in github?

simos · September 27, 2018, 9:28pm

Here is a similar report regarding K8s,

github.com/kubernetes/kubernetes

Please add --dns-loop-detect option to dnsmasq run by kube-dns

opened 01:14PM - 11 Aug 18 UTC

closed 03:47PM - 18 Aug 18 UTC

hedayat

kind/bug sig/network

**Is this a BUG REPORT or FEATURE REQUEST?**: > Uncomment only one, leave it …on its own line: > /kind bug > /kind feature **What happened**: If hosts `/etc/resolv.conf` points to a local DNS server, or if it contains the cluster's DNS server (to enable looking up services from host), it'll cause DNS loop which can effectively disable kube-dns or make it very slow. **What you expected to happen**: At least, kube-dns should detect localhost or cluster DNS server entries in resolv.conf and not send queries to them. **How to reproduce it (as minimally and precisely as possible)**: Use a system with local cashing DNS server, or add cluster's DNS server address to hosts resolv.conf. **Anything else we need to know?**: According to `dnsmasq` documentation, adding `--dns-loop-detect` should make dnsmasq detect that and disable the upstream DNS server which cause the loop. It should prevent DNS from becoming unresponsive in the first case (a better solution is to extract real upstream servers, which probably cannot be done for every configuration); and for the second case it should skip sending queries to itself and send to real upstream servers.

It explains the issue and it’s relevant to LXD.

Below is the man-page of dnsmasq that explains what --dns-loop-detect does.

–dns-loop-detect

Enable code to detect DNS forwarding loops; ie the situation where a query sent to one of the upstream server eventually returns as a new query to the dnsmasq instance. The process works by generating TXT queries of the form <hex>.test and sending them to each upstream server. The hex is a UID which encodes the instance of dnsmasq sending the query and the upstream server to which it was sent. If the query returns to the server which sent it, then the upstream server through which it was sent is disabled and this event is logged. Each time the set of upstream servers changes, the test is re-run on all of them, including ones which were previously disabled.
Source: Man page of DNSMASQ

I do not think that adding ‘–dns-loop-detect’ would have a detrimental effect to the dnsmasq in LXD.

(The rest is about being able to access programmatically systemd-resolved over DBUS and set the configuration there to fix this.)

Therefore, do open a ticket on github.

As title, you may use something like: Please add --dns-loop-detect option to dnsmasq run by LXD
Mention that you have tried this and it worked for you. Also, point to this discussion.

rudiservo · September 27, 2018, 9:35pm

Thanks for the help, will do

mdfrg · October 11, 2018, 7:05pm

The solution posted by Stuart Langridge on stackexchange worked for me:

lxc network edit lxdbr0:

config:
  ipv4.address: 10.216.134.1/24
  ipv4.nat: "true"
  ipv6.address: none
  ipv6.nat: "true"
  raw.dnsmasq: |
    auth-zone=lxd
    dns-loop-detect
name: lxdbr0
type: bridge

Add the 3 lines starting with raw.dnsmasq.

@simos,

the /lib/systemd/system/lxd-host-dns.service syntax you put in your blogpost didn’t work for me under Ubuntu 18.04. Here is what worked:

[Unit]
Description=LXD host DNS service
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/local/bin/lxdhostdns_start.sh
RemainAfterExit=true
ExecStop=/usr/local/bin/lxdhostdns_stop.sh
StandardOutput=journal

[Install]
WantedBy=multi-user.target

Notice the Type=simple and After= change.

Thank you for your blog. There are little information about LXD and what you provide is very informative.

simos · October 12, 2018, 9:36am

Thanks for this.

I updated the post at https://blog.simos.info/how-to-use-lxd-container-hostnames-on-the-host-in-ubuntu-18-04/
to include

the additions of auth-zone=lxd and detect-dns-loop in the LXD managed network interface lxdbr0.
the changes to the systemd service file.

Please report if I missed anything from the blog post or I need to explain something better.

rudiservo · October 16, 2018, 2:52pm

you can do this and it will work out of the box without any services for resolved.

mkdir /etc/systemd/resolved.conf.d
nano /etc/systemd/resolved.conf.d/lxdbr0.conf

and paste this inside

[Resolve]
DNS=10.146.38.1 #the ip of the lxbr0
Domains=lxd

also what i found that work was to add nameserver to raw.dnsmasq server=8.8.8.8 because with the loop-dettect will also block normal query to any other dns, google, ubuntu, centos, etc, havent tried the auth-zone=lxd yet.

karjala · October 20, 2018, 6:13pm

I tried the solutions in the immediately-previous 2 comments in this page.

@simos, the (permanent, updated) solution that appears in your blogpost failed (on my fresh Ubuntu MATE 18.04 with snap LXD 3.6).

@rudiservo, your 3-line solution seems to work on my (completely fresh, again) system.

@simos, the failure of the “permanent solution” on the blogpost can be seen here (I executed this command right after a boot):

root@ubuntu:~# systemctl status lxd-host-dns.service 
● lxd-host-dns.service - LXD host DNS service
   Loaded: loaded (/etc/systemd/system/lxd-host-dns.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2018-10-20 10:46:39 PDT; 4min 39s ago
  Process: 1745 ExecStart=/usr/local/bin/lxdhostdns_start.sh (code=exited, status=1/FAILURE)
 Main PID: 1745 (code=exited, status=1/FAILURE)

Oct 20 10:46:39 ubuntu systemd[1]: Started LXD host DNS service.
Oct 20 10:46:39 ubuntu lxdhostdns_start.sh[1745]: Device "lxdbr0" does not exist.
Oct 20 10:46:39 ubuntu lxdhostdns_start.sh[1745]: Unknown interface lxdbr0: No such device
Oct 20 10:46:39 ubuntu systemd[1]: lxd-host-dns.service: Main process exited, code=exited, status=1/FAILURE
Oct 20 10:46:39 ubuntu systemd[1]: lxd-host-dns.service: Failed with result 'exit-code'.
root@ubuntu:~#

@simos, would it be possible please to examine @rudiservo’s solution, if it’s good?

simos · October 20, 2018, 6:47pm

It says that it cannot find a lxdbr0 network interface. Perhaps it is lxdbr1 in your case or something else?

karjala · October 20, 2018, 7:49pm

No, that is not the case. See here for proof (operating on the VM that was based on your blogpost). Also, your temporary solution works, just not your permanent one:

user@ubuntu:~$ systemctl status lxd-host-dns.service 
● lxd-host-dns.service - LXD host DNS service
   Loaded: loaded (/etc/systemd/system/lxd-host-dns.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2018-10-20 12:46:19 PDT; 1min 5s ago
  Process: 1645 ExecStart=/usr/local/bin/lxdhostdns_start.sh (code=exited, status=1/FAILURE)
 Main PID: 1645 (code=exited, status=1/FAILURE)

Oct 20 12:46:19 ubuntu systemd[1]: Started LXD host DNS service.
Oct 20 12:46:19 ubuntu lxdhostdns_start.sh[1645]: Device "lxdbr0" does not exist.
Oct 20 12:46:19 ubuntu lxdhostdns_start.sh[1645]: Unknown interface lxdbr0: No such device
Oct 20 12:46:19 ubuntu systemd[1]: lxd-host-dns.service: Main process exited, code=exited, status=1/FAILURE
Oct 20 12:46:19 ubuntu systemd[1]: lxd-host-dns.service: Failed with result 'exit-code'.
user@ubuntu:~$ lxc network list
+--------+----------+---------+-------------+---------+
|  NAME  |   TYPE   | MANAGED | DESCRIPTION | USED BY |
+--------+----------+---------+-------------+---------+
| ens33  | physical | NO      |             | 0       |
+--------+----------+---------+-------------+---------+
| lxdbr0 | bridge   | YES     |             | 1       |
+--------+----------+---------+-------------+---------+
user@ubuntu:~$

mdfrg · November 21, 2018, 11:40pm

Actually, even this hogs CPU at 100% after a while. Start pinging .lxd domains, wait 2-3min and start to see the CPU spike.

karjala · January 5, 2019, 10:17am

After having used simos’s solution in his blogpost, together with the detect-dns-loop option, I cannot do nslookup any non-lxd domain from inside an lxd container. I use ubuntu 18.04 server. Is that your experience too? Can you nslookup non-lxd domains from inside an lxd container?

Thanks

ZipiZap_Zap · January 14, 2019, 12:29am

Hi guys

I’ve been heavily tinkering with this during the last month - basically how to configure ubuntu-18.04 lxdHost so it can resolve the container-names from dnsmasq. The new thing in 18.04 (and also some 17.xx) is that it uses the new systemd-resolved, which has to be configured to “also use” dnsmasq.

I think I found the solution, but am still trying to organize the info how I made the tests, and the results, to be able to share here in the forum. I’ve made the tests on ub18.04, with ansible (automates ssh execution on remote ips) so it can be easily changed and reproduced. I plan to clean-up the tests-code and then upload it all to github, so others can repeat and extend it as needed. Still ongoing this part.

In any case, the final conclusions were around having the following config done in the lxcHost, to resolve the containers connected to lxdbr0:

set -o errexit
set -o pipefail
set -o nounset
                                                                                                                                                               
[[ -r /etc/systemd/resolved.conf.d/lxdbr0.conf ]] && exit 0
                                                                                                                                                               
# Config systemd-resolve, via drop-in file /etc/systemd/resolved.conf.d/lxdbr0.conf (better than via global /etc/systemd/resolved.conf)
# to also use the lxdbr0-dnsmasq as an additional dns
                                                                                                                                                               
DNSMASQ_LISTENNING_IP=$(/snap/bin/lxc network get lxdbr0 ipv4.address | sed 's_/.*__g')
  # 10.99.99.1

mkdir -p /etc/systemd/resolved.conf.d

cat <<EOT > /etc/systemd/resolved.conf.d/lxdbr0.conf
[Resolve]
DNS=${DNSMASQ_LISTENNING_IP} 8.8.8.8
  # ${DNSMASQ_LISTENNING_IP} - the ip of the lxbr0, where dnsmasq for lxdbr0 is listening
  #
  # 8.8.8.8    - a usable internet dns server
  #              Because setting this DNS= option will disables the implicit
  #              default-fallback dns (that is only in effect when DNS= was never defined)
  #              So now that we define DNS= ,the default-fallback dns will not be used
  #              anymore and so we should also provide a usable internet dns server
  #              as for example, the internet-public-dns of google 8.8.8.8

#Domains=lxd
  # When this option is set, "nslookup C1" and "nslookup C1.lxd" will both work
  # When this option is not set, "nslookup C1" will not work, and only "nslookup C1.lxd" will work
  # I prefer to use fqdn C1.lxd to avoid possible confusions
EOT

# Apply config changes 
systemctl restart systemd-resolved.service
                                                                                                                                                               
# Ugly hack: restart systemd-resolved.service again after a short time
# It seems that after the previous restart, sometimes serviced-resolved becomes aware of xxx.lxd and work normally.
# But sometimes (many times) it will not yet resolve the xxx.lxd domains consistently (or it will work for a while and then start failing) but 
# after a second restart, it seems to then work correctly consistently and resolve xxx.lxd as expected (and without fails)
# So I'll just add here a "safeguard" second-restart, even though it should not be necessary and is a ugly-hack
# My personal guess, is that maybe there is some minor bug inside the systemd-resolved.service in its current version. Its just a crazy unfounded guess though
sleep 10 ; systemctl restart systemd-resolved.service

After the configuration is done, it can be tested with the following:

set -o errexit
set -o pipefail
set -o nounset
#
# Check changes made
cat /etc/systemd/resolved.conf.d/lxdbr0.conf
#journalctl -xeu systemd-resolved.service
systemd-resolve --status
                                                                                                                                                        
# Check resolution works both for xxx.lxd and internet hostnames
# For that, we will use www.google.com, and also create a group of ephemeral alpine containers to resolve their names
systemd-resolve www.google.com
TmpTestContainerList=$(echo ResolveTmpTest{1..10})
echo "${TmpTestContainerList}" | xargs -n1 /snap/bin/lxc launch --ephemeral images:alpine/edge
sleep 10 ; lxc list
for container_running in $(/snap/bin/lxc list -c=ns4 --format=csv | grep RUNNING | cut -f1 -d,); do 
  # C1
  systemd-resolve ${container_running}.lxd
done
echo "${TmpTestContainerList}" | xargs -n1 /snap/bin/lxc stop 
sleep 10 ; lxc list
                                                                                                                                                        
####### NOTES
## NOTE1: 
##   When LXD snap is disabled and then reenabled, its then also necessary to restart systemd-resolved.service to make resolution of xxx.lxd work again
##   This also happens if lxd is disabled, then the lxcHost is rebooted, and the lxd is enabled again - it will then be necessary to 
##   "systemctl restart systemd-resolved" for resolution of .lxd to work again
#
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       Non-authoritative answer:
#       Name:   C1.lxd
#       Address: 10.99.99.209
#       
#       ub@lxcHost:~$
#       ub@lxcHost:~$ sudo snap disable lxd
#       lxd disabled
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       ** server can't find C1.lxd: NXDOMAIN
#       
#       ub@lxcHost:~$ sudo snap enable lxd
#       lxd enabled
#       ub@lxcHost:~$
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       ** server can't find C1.lxd: NXDOMAIN
#       
#       ub@lxcHost:~$ sudo systemctl restart systemd-resolved
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       Non-authoritative answer:
#       Name:   C1.lxd
#       Address: 10.99.99.209
#       ** server can't find C1.lxd: NXDOMAIN
#       
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       Non-authoritative answer:
#       Name:   C1.lxd
#       Address: 10.99.99.209
#       
#       ub@lxcHost:~$ nslookup C1.lxd
#       Server:         127.0.0.53
#       Address:        127.0.0.53#53
#       
#       Non-authoritative answer:
#       Name:   C1.lxd
#       Address: 10.99.99.209

Basically I got the impression that the documentatino for systemd-resolved is not yet all-as-good-as-it-should-be for something of this importance, but can be done.

Also, when a dns is defined to systemd-resolved, then the default-implicit-invisile-publicDnss are not used anymore, and so we need to add also 8.8.8.8 (for example). On the other hand, didnt detect any problems with high-cpu, but did noticed (and commented about it) that under certain situations systemd-resolved needs to be restarted for the changes to be applied (hopefully interacting via DBUS as simo mentioned, would work better and avoid it… but in any case, should work this way too without restarts, but reality is that is does need to be restarted in some situations

Trying to get some free time to clean it up and then upload, to share the details and hopefully make it possible for others to test it as needed.

Just saw now that there was some activity in this thread, and wanted to share this upfront.

Br

Yosu_Cadilla · May 20, 2019, 11:56am

Hi @rudiservo, I would like to enable DNS for a 3 node cluster with a FAN network?
Any further steps I should do to enable your solution on my 3 nodes?
Thank you!