Small amount of dns requests in lxd cluster fail

I found that a small amount of dns requests (~0.2%) fail in my LXD cluster, has anyone seen this before?

Running lxd 4.9 installed from snap with a fan network.

Can reproduce with while sleep 0.1; do systemd-resolve minio-03.lxd &>/dev/null; echo $? >> resolv-status; done

then grep 1 resolv-status | wc -l, wc -l resolv-status

Please can you describe your cluster setup, specifically:

  • Are you using LXD’s built in dnsmasq DNS server?
  • Which host/container were you running the DNS requests from?
  • Which host was the container minio-03 running (and was that different from the host that you were running the DNS requests test from?)
  • Are you using LXD’s built in dnsmasq DNS server?

Yes

  • Which host/container were you running the DNS requests from?

Running it from minio-02 container on the hetzner-02 host

  • Which host was the container minio-03 running (and was that different from the host that you were running the DNS requests test from?)

Yes, it’s a different host. That is running on the hetzner-03 host.

Thanks, and are you running a fan network between the hosts?

Can you show the output of resolvectl status?

And can you also show the output of the command you had in your original post that shows the errors?

And you see the same issue when running dig rather than systemd-resolve - the reason I ask is that if you can recreate with the dig command we can then explore directly querying the other DNS servers in the cluster to try and locate the issue.

Yes, I’m running a fan network between hosts.

On minio-02:

Global
       LLMNR setting: no
MulticastDNS setting: no
  DNSOverTLS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 32 (eth0)
      Current Scopes: DNS
DefaultRoute setting: yes
       LLMNR setting: yes
MulticastDNS setting: no
  DNSOverTLS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
  Current DNS Server: 240.198.210.1
         DNS Servers: 240.198.210.1
          DNS Domain: lxd

The following file just shows the exit status of the systemd-resolve command: https://pb.theblazehen.com/resolv-status.txt

Will try doing the same test with dig, as it’s intermittent it might take a while to show up

Thanks let me know.

The process of resolving DNS queries in a cluster is as follows:

  1. Container -> local DNSmasq process.
  2. DNSMasq process tries to answer the query for local instances, and for non-local instances forwards to the local forkdns process.
  3. The local forkdns process then forwards the DNS request to each of the remote cluster host’s forkdns processes sequentially trying to get an answer.
  4. When the request is received by a remote forkdns process it inspects the dnsmasq’s leases file on the host and tries to answer the query which is then returned back to the original forkdns process and back to the local dnsmasq process which relays it to the requestor.

So you can see there are a few moving parts there and so using dig will allow you to use the @ notation to query each part directly.

For instance, dig @<lxd host IP> -p 1053 will allow you to query the local and remote forkdns processes directly.

Thanks,

It seems like the forkdns service on hetzner-03 doesn’t seem to be responding to any requests for services on that node

root@minio-02:/home/ubuntu# for c in caddy-03 minio-03 postgres-03; do dig $c.lxd @240.200.13.1 -p 1053; done

; <<>> DiG 9.16.1-Ubuntu <<>> caddy-03.lxd @240.200.13.1 -p 1053
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34828
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;caddy-03.lxd.                  IN      A

;; Query time: 79 msec
;; SERVER: 240.200.13.1#1053(240.200.13.1)
;; WHEN: Mon Jan 11 09:57:42 UTC 2021
;; MSG SIZE  rcvd: 30


; <<>> DiG 9.16.1-Ubuntu <<>> minio-03.lxd @240.200.13.1 -p 1053
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63242
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;minio-03.lxd.                  IN      A

;; Query time: 83 msec
;; SERVER: 240.200.13.1#1053(240.200.13.1)
;; WHEN: Mon Jan 11 09:57:42 UTC 2021
;; MSG SIZE  rcvd: 30


; <<>> DiG 9.16.1-Ubuntu <<>> postgres-03.lxd @240.200.13.1 -p 1053
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61125
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;postgres-03.lxd.               IN      A

;; Query time: 79 msec
;; SERVER: 240.200.13.1#1053(240.200.13.1)
;; WHEN: Mon Jan 11 09:57:42 UTC 2021
;; MSG SIZE  rcvd: 33

Will see if I can see anything in logs

Well if that were the case you’d never get a response, so suspect its something else at play.

Check sudo ss -ulpn for the forkdns listener on the minio-03 host and confirm its listening on the IP you expect.

Yep, I’ve got it listening on the correct IP and port. I see /var/snap/lxd/common/lxd/logs/forkdns.lxdfan0.log is empty on all the hosts, is it possible to increase log level perhaps?

Can you query it using dig locally from minio-03?

Ah you have to disable the recursion flag when querying the forkdns server, let me see how to do that with dig.

Here’s an example of querying a remote forkdns process from a container:

You need to disable recursion.

lxc exec c1 -- dig @240.7.0.1 -p 1053 +norecurse A c2.lxd

We do this to stop loops in requests. When a query is forwarded from dnsmasq to forkdns, the recursion flag is enabled. Then local forkdns strips it and marks recursion as disabled when forwarding to the the remote forkdns process.

The remote forkdns process will only answer non-recursion requests from its local database.

Just realized that my previous output was querying the dns server on hetzner-02, not hetzner-03.

A table showing the eg dig +short minio-02.lxd -p 1053 @240.198.210.1 output:


+----------------+------------+------------------+
| Container name | DNS server |      Status      |
+----------------+------------+------------------+
| minio-01.lxd   | hetzner-01 | resolves         |
| minio-02.lxd   | hetzner-01 | resolves         |
| minio-03.lxd   | hetzner-01 | resolves         |
| minio-01.lxd   | hetzner-02 | resolves         |
| minio-02.lxd   | hetzner-02 | does not resolve |
| minio-03.lxd   | hetzner-02 | resolves         |
| minio-01.lxd   | hetzner-03 | resolves         |
| minio-02.lxd   | hetzner-03 | resolves         |
| minio-03.lxd   | hetzner-01 | resolves         |
+----------------+------------+------------------+

I can repeatedly get no result with the minio-02 and hetzner-02 resolver combination. Testing with norecurse now

Ok, seems like each forkdns does correctly give the IP for the container running on the host in the last few manual tests I ran. Going to try automate test now

1 Like

Running

while :; do
        output=$(dig minio-01.lxd -p 1053 +norecurse +short A @240.197.151.1)
        if [ -z "$output" ]; then echo minio-01.lxd / hetzner-01.lxd fails; fi

        output=$(dig minio-02.lxd -p 1053 +norecurse +short A @240.198.210.1)
        if [ -z "$output" ]; then echo minio-02.lxd / hetzner-02.lxd fails; fi

        output=$(dig minio-03.lxd -p 1053 +norecurse +short A @240.200.13.1)
        if [ -z "$output" ]; then echo minio-03.lxd / hetzner-03.lxd fails; fi
done

For a few minutes now, no errors yet

Still nothing, could it be on the dnsmasq side perhaps?

Can you recreate the issue using dig to dnsmasq?

Testing that now, nothing yet

Maybe intermittent network issue?