LXD DNS seems to cache negative results after container restart

janxb · August 1, 2018, 1:59pm

Hi everyone,
since some days I have a strange issue with the LXD DNS server. But first, let me describe my use-case:

I have multiple containers, which are connected to the outside world via an HTTP proxy running NGINX. This proxy dynamically resolves the container hostnames without any caching. So when a container is stopped or started, this is instantly reflected in the web.

Every day, I stop each container, create a snapshot and start it back up. During this time, NGINX correctly responds with an error 502, because it can’t resolve the container hostnames while they are offline. The problem now is: After the container is started again, the LXD DNS still returns no result for the hostname. When I am quick enough, this is reproducible in my live setup. When I query the LXD DNS server, it does not return an IP address for appr. five minutes after container restart.

Could this be an issue in my individual setup, or was there something in the DNS handling changed in recent versions of LXD?

I am using the latest stable snap package on Ubuntu 18.04. Current package version is 8011.

simos · August 1, 2018, 2:34pm

This could be an issue with the dnsmasq configuration, however I was not able to reproduce.

Here is what a I tried:

$ lxc launch ubuntu:18.04 mydnstest
$ while true; do echo "====================================";date; host mydnstest.lxd 10.50.250.1; ping -W 1 -c 1 10.50.250.50; sleep 1; done

What I notice is that when I stop the container, the DNS server continues to respond back with the DNS values as if the container is still running.

I suppose you have configure nginx in the container (as reverse proxy). Did you put in there the hostname of the container or did you put the IP address? It could be an issue with nginx as reverse proxy that sees that an upstream website is not responding, therefore it disables it momentarily. Check for any option on heartbit like HAProxy has.

janxb · August 1, 2018, 6:14pm

I did a little more testing on this, here is what I found out:

The issue is not related to NGINX, or even to using another container for the DNS lookup. Here is the procedure I was able to use for reproducing the problem:

lxc launch ubuntu:18.04 c1
lxc snapshot c1 snap0
lxc delete c1/snap0

Monitor the DNS results using the following command (change the IP address to match your LXD bridge gateway)

watch -n1 dig c1.lxd @10.154.181.1

After deleting the snapshot in the last command, the IP address of the container disappears from the DNS. Only a container restart fixes this.

Edit: I was able to reproduce this issue on two different machines. I also created a new one on-the-fly and installed LXD from scratch. There i was not able to reproduce this. So right now I am still not sure what I could have done to cause this. I compared the output of “lxc network edit lxdbr0”, it is identical on the affected and the unaffected machine, no custom dnsmasq configuration.

simos · August 1, 2018, 6:49pm

I managed to replicate this (LXD 3.3 from snap), and indeed LXD’s dnsmasq loses the IP address of the container that goes its snapshot removed.

I noticed that when you remove a snapshot of a container, LXD restarts dnsmasq and the process gets a new PID. It looks weird that LXD would restart dnsmasq when merely a snapshot gets deleted and not the container itself. As is something went wrong and LXD would think that the container is gone.

You mention that your fresh LXD installation does not exhibit this issue. Can you check in that installation whether LXD respawns dnsmasq as soon as you delete the snapshot of the container?

I notice this line in my lxd.log:

lvl=dbug msg="Database error: &errors.errorString{s:\"sql: no rows in result set\"}"

Can you check whether you get this on your fresh installation, and whether you get this on your two existing installations of LXD that exhibit the issue?

janxb · August 1, 2018, 7:32pm

I made a mistake when trying to replicate this issue on the fresh machine. I used the pre-installed LXD 3.0.1 from Ubuntu. When switching to the LXD provided by the snap package, the problem was also existing on a fresh install. So as of now, all of the tested machines seem to contain this bug.

Below you can find the logs of snapshot creation and deletion.

Debug Logs

snapshot creation
-----------------
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:20:35+0000 url=/1.0
lvl=dbug msg="Disconnected event listener: 9d6cdf7b-81ba-4d0b-97ed-54b614430bd5" t=2018-08-01T19:20:35+0000
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:20:35+0000 url=/1.0/events
lvl=dbug msg="New event listener: 387e41f4-d4b3-479a-81c2-0232a0c5aa4c" t=2018-08-01T19:20:35+0000
ip=@ lvl=dbug method=POST msg=handling t=2018-08-01T19:20:35+0000 url=/1.0/containers/c1/snapshots
lvl=dbug msg="Initializing a BTRFS driver" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Mounting BTRFS storage volume for container \"c1\" on storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Mounting BTRFS storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Mounted BTRFS storage volume for container \"c1\" on storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="New task operation: 51e0e782-6bff-42cf-96f5-1f840b4922d2" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Started task operation: 51e0e782-6bff-42cf-96f5-1f840b4922d2" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Database error: &errors.errorString{s:\"sql: no rows in result set\"}" t=2018-08-01T19:20:35+0000
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:20:35+0000 url=/1.0/operations/51e0e782-6bff-42cf-96f5-1f840b4922d2
ephemeral=false lvl=info msg="Creating container" name=c1/snap0 t=2018-08-01T19:20:35+0000
lvl=dbug msg="Initializing a BTRFS driver" t=2018-08-01T19:20:35+0000
ephemeral=false lvl=info msg="Created container" name=c1/snap0 t=2018-08-01T19:20:35+0000
lvl=dbug msg="Creating BTRFS storage volume for snapshot \"c1\" on storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Mounting BTRFS storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Created BTRFS storage volume for snapshot \"c1\" on storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Initializing BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Mounting BTRFS storage pool \"default\"" t=2018-08-01T19:20:35+0000
lvl=dbug msg="Initialized BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:20:36+0000
lvl=dbug msg="Stopping BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:20:36+0000
lvl=dbug msg="Mounting BTRFS storage pool \"default\"" t=2018-08-01T19:20:36+0000
lvl=dbug msg="Stopped BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:20:36+0000
lvl=dbug msg="Success for task operation: 51e0e782-6bff-42cf-96f5-1f840b4922d2" t=2018-08-01T19:20:36+0000

snapshot deletion
-----------------
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:21:14+0000 url=/1.0
lvl=dbug msg="Disconnected event listener: 387e41f4-d4b3-479a-81c2-0232a0c5aa4c" t=2018-08-01T19:21:14+0000
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:21:14+0000 url=/1.0/events
lvl=dbug msg="New event listener: 2239bcb0-5046-45a4-a451-0b177890361b" t=2018-08-01T19:21:14+0000
ip=@ lvl=dbug method=DELETE msg=handling t=2018-08-01T19:21:14+0000 url=/1.0/containers/c1/snapshots/snap0
lvl=dbug msg="New task operation: 93d8fd18-0695-45e7-abe5-5d251e8cf7c9" t=2018-08-01T19:21:14+0000
lvl=dbug msg="Started task operation: 93d8fd18-0695-45e7-abe5-5d251e8cf7c9" t=2018-08-01T19:21:14+0000
created=2018-08-01T19:20:35+0000 ephemeral=false lvl=info msg="Deleting container" name=c1/snap0 t=2018-08-01T19:21:14+0000 used=1970-01-01T00:00:00+0000
ip=@ lvl=dbug method=GET msg=handling t=2018-08-01T19:21:14+0000 url=/1.0/operations/93d8fd18-0695-45e7-abe5-5d251e8cf7c9
lvl=dbug msg="Initializing a BTRFS driver" t=2018-08-01T19:21:14+0000
lvl=dbug msg="Deleting BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:21:14+0000
lvl=dbug msg="Mounting BTRFS storage pool \"default\"" t=2018-08-01T19:21:14+0000
lvl=dbug msg="Deleted BTRFS storage volume for snapshot \"c1/snap0\" on storage pool \"default\"" t=2018-08-01T19:21:15+0000
created=2018-08-01T19:20:35+0000 ephemeral=false lvl=info msg="Deleted container" name=c1/snap0 t=2018-08-01T19:21:15+0000 used=1970-01-01T00:00:00+0000
lvl=dbug msg="Success for task operation: 93d8fd18-0695-45e7-abe5-5d251e8cf7c9" t=2018-08-01T19:21:15+0000

Edit: To directly answer your questions:

Yes, dnsmasq is restarted after the snapshot deletion, based on the log message I receive in syslog.
Yes, as you can see in the attached logs, I have this database error in my logs, too.

simos · August 1, 2018, 8:41pm

Thanks!

Can you file an issue at https://github.com/lxc/lxd/issues about this?

We have reproduced this in several installations of LXD 3.x.

I also tried in LXD 2.0 and the bug is not present there.

$ lxc launch ubuntu:16.04 snapbug --config security.nesting=true
$ lxc exec snapbug -- sudo --user ubuntu --login
ubuntu@snapp:~$ sudo lxd init
...
ubuntu@snapp:~$ lxc launch images:alpine/3.6 myalpine
...
# now check that it has an IP, and also check the PID of dnsmasq.
ubuntu@snapp:~$ lxc snapshot myalpine snap0
ubuntu@snapp:~$ lxc delete myalpine/snap0

# The container still has an IP address and dnsmasq was not spawned.

stgraber · August 1, 2018, 8:54pm

Doh, I suspect the fix will be quite easy, looking at that now.

stgraber · August 1, 2018, 9:01pm

Got a fix ready, will send it for review in a bit.

janxb · August 2, 2018, 6:47am

If you still wanted me to create an issue after the comments from @stgraber, here it is: https://github.com/lxc/lxd/issues/4866

simos · August 2, 2018, 7:34am

This is great. @stgraber created the pull request at https://github.com/lxc/lxd/pull/4863
and will match it to the issue you just created.

janxb · August 7, 2018, 9:54am

Hi everyone,
I don’t want to be that guy who’s always asking for an ETA, but do we have to wait for another big LXD release, or will you push a new snap package version which includes the fix for this issue?

stgraber · August 7, 2018, 2:37pm

I intend to cherry-pick this fix soon, but we’ve had some major reshuffling of database code happening over the past week which will need to be included at the same time and I wanted more testing on that code before we roll it out to all our stable users.