I’m trying to understand an intermittent DNS failure I’m seeing on an OVN-backed Incus cluster, and whether this is expected behavior or a misconfiguration on my end.
I’m under the impression that OVN works by intercepting UDP DNS requests headed to the DNS server and respond with its own response.
I’m running a 3-node Incus cluster (all hosts and affected instances on Debian trixie). The upstream DNS server is outside the OVN network. Instances use systemd-resolved.
What I’m seeing intermittently is:
- Instance-to-instance name resolution for
.incusnames suddenly starts returning NXDOMAIN. - The NXDOMAIN result is then cached by
systemd-resolved. - Once cached, instances fail to connect to each other by name until I manually run:
resolvectl flush-caches.
My questions are:
- Is this behavior expected given OVN’s UDP-only DNS interception model?
- Is there a recommended configuration to avoid negative caching of
.incusrecords in this setup (for example, exposing an authoritative DNS service for.incusover both UDP and TCP)? - Or is this a case where relying on OVN DNS interception for
.incusnames is not intended to be robust withsystemd-resolved?
Details
All relevant instances are on the same OVN network:
caddy 10.22.22.100 (OVN)
pocket-id 10.22.22.114 (OVN)
lldap 10.22.22.103 (OVN)
(Networking works by IP; the issue is name resolution.)
Observed failure
From one instance (caddy), name resolution for another instance intermittently fails:
incus exec caddy -- ping -c3 pocket-id
ping: pocket-id: Temporary failure in name resolution
At the same time, resolution for another .incus name succeeds:
incus exec caddy -- ping -c3 lldap
PING lldap.incus (10.22.22.103) ...
64 bytes from 10.22.22.103: icmp_seq=1 ttl=64 time=2.95 ms
(This shows OVN connectivity is fine and .incus resolution is not globally broken.)
Resolver state during failure
Querying the resolver directly shows NXDOMAIN for the failing name:
incus exec caddy -- resolvectl query pocket-id
pocket-id: Name 'pocket-id' not found
While a working name is served from cache:
incus exec caddy -- resolvectl query lldap
lldap: 10.22.22.103 -- link: eth0
(lldap.incus)
-- Data from: cache
Proof of negative caching
The NXDOMAIN result appears to be cached by systemd-resolved:
incus exec caddy -- resolvectl statistics
incus exec caddy -- resolvectl query pocket-id
incus exec caddy -- resolvectl statistics
Relevant delta:
Cache Hits: 1340 → 1342
Cache Misses: 441 → 441
(This indicates the NXDOMAIN is being served from cache, with no new upstream lookup.)
Resolver configuration context
incus exec caddy -- resolvectl status
Current DNS Server: 192.168.1.2
DNS Domain: incus
(The upstream DNS server is outside the OVN network.)
Cache flush immediately resolves the issue
incus exec caddy -- resolvectl flush-caches
incus exec caddy -- ping -c3 pocket-id
PING pocket-id.incus (10.22.22.114) ...
64 bytes from 10.22.22.114: icmp_seq=1 ttl=64 time=1.92 ms
(This demonstrates direct causality: cached NXDOMAIN → failure; cache flush → resolution restored.)