Dnsmasq die some time

I am running snap lxd cluster mode on ubuntu 18.04. Usually 1 day or less, dnsmasq on some nodes may die, so the containers on the host has no ipv4 address. After snap restart lxd, the dnsmasq back and it continue to work.

Since lxd starts the dnsmasq, when it’s gone, should lxd take some actions? like restart it? or log? dnsmasq log by default is in /var/log, from the log I can not see any clue why it’s end.

You should check jourjournalctl -u snap.lxd.daemon.servicenalctl for logs or /var/snap/lxd/common/lxd, /var/log does not have much logging nowadays.

I’m running into the same issue, where dnsmasq dies and I can’t find anything in the log to point to why it’s happening.

lxd 3.12, installed via snap
os: Ubuntu 16.04.6 (fresh install)

journalctl -u snap.lxd.daemon.servicenalctl returns “-- No entries --”

None of the logs under /var/snap/lxd/common/lxd have any messages referring to dnsmasq. /var/log/syslog has some logs, but nothing about the crash.

May 13 14:42:58 osmr5 dnsmasq[8497]: started, version 2.75 cachesize 150
May 13 14:42:58 osmr5 dnsmasq[8497]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify
May 13 14:42:58 osmr5 dnsmasq-dhcp[8497]: DHCP, IP range 10.249.75.2 – 10.249.75.254, lease time 1h
May 13 14:42:58 osmr5 dnsmasq-dhcp[8497]: DHCP, sockets bound exclusively to interface lxdbr0
May 13 14:42:58 osmr5 dnsmasq[8497]: using local addresses only for domain lxd
May 13 14:42:58 osmr5 dnsmasq[8497]: reading /etc/resolv.conf
May 13 14:42:58 osmr5 dnsmasq[8497]: using local addresses only for domain lxd
May 13 14:42:58 osmr5 dnsmasq[8497]: using nameserver 10.244.192.1#53
May 13 14:42:58 osmr5 dnsmasq[8497]: using nameserver 10.208.0.2#53
May 13 14:42:58 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 14:42:58 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-bionic-base
May 13 14:42:58 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 14:42:58 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-bionic-base
May 13 14:42:59 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 14:43:09 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 14:43:09 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-2810aa-0
May 13 17:41:40 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 17:41:40 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fa0336-0
May 13 17:41:40 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-2810aa-0
May 13 17:42:06 osmr5 dnsmasq[8497]: read /etc/hosts - 7 addresses
May 13 17:42:06 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fa0336-1
May 13 17:42:06 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fa0336-0
May 13 17:42:06 osmr5 dnsmasq-dhcp[8497]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-2810aa-0

I’m seeing dnsmasq crash within minutes or hours of a snap restart lxd.

The only thing I can think of readily is dnsmasq thinks there’s a nameserver at 10.208.0.2, but there’s not. I’m not sure why it thinks there is, either. That network (OpenStack) shouldn’t be returning anything for DNS. OpenStack was returning that as a default, because I didn’t specify a DNS server. I’ve remedied that, so only valid DNS is returned.

well, that is happening for me too, however when I remove the ‘nalctl’ at the end it works better :slight_smile:

So it does! Here is what it looks like after a restart until dnsmasq dies:

May 13 21:37:13 osmr5 systemd[1]: Started Service for snap application lxd.daemon.
May 13 21:37:13 osmr5 lxd.daemon[28159]: => Preparing the system
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Loading snap configuration
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Setting up mntns symlink (mnt:[4026532166])
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Setting up kmod wrapper
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Preparing /boot
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Preparing a clean copy of /run
May 13 21:37:13 osmr5 lxd.daemon[28159]: ==> Preparing a clean copy of /etc
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Setting up ceph configuration
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Setting up LVM configuration
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Rotating logs
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Setting up ZFS (0.6)
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Escaping the systemd cgroups
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Escaping the systemd process resource limits
May 13 21:37:14 osmr5 lxd.daemon[28159]: ==> Disabling shiftfs on this kernel (auto)
May 13 21:37:14 osmr5 lxd.daemon[28159]: => Starting LXCFS
May 13 21:37:14 osmr5 lxd.daemon[28159]: mount namespace: 6
May 13 21:37:14 osmr5 lxd.daemon[28159]: hierarchies:
May 13 21:37:14 osmr5 lxd.daemon[28159]: 0: fd: 7: freezer
May 13 21:37:14 osmr5 lxd.daemon[28159]: 1: fd: 8: hugetlb
May 13 21:37:14 osmr5 lxd.daemon[28159]: 2: fd: 9: devices
May 13 21:37:14 osmr5 lxd.daemon[28159]: 3: fd: 10: cpu,cpuacct
May 13 21:37:14 osmr5 lxd.daemon[28159]: 4: fd: 11: perf_event
May 13 21:37:14 osmr5 lxd.daemon[28159]: 5: fd: 12: blkio
May 13 21:37:14 osmr5 lxd.daemon[28159]: 6: fd: 13: cpuset
May 13 21:37:14 osmr5 lxd.daemon[28159]: 7: fd: 14: net_cls,net_prio
May 13 21:37:14 osmr5 lxd.daemon[28159]: 8: fd: 15: pids
May 13 21:37:14 osmr5 lxd.daemon[28159]: 9: fd: 16: memory
May 13 21:37:14 osmr5 lxd.daemon[28159]: 10: fd: 17: name=systemd
May 13 21:37:14 osmr5 lxd.daemon[28159]: => Starting LXD
May 13 21:37:14 osmr5 lxd.daemon[28159]: t=2019-05-13T21:37:14+0000 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
May 13 21:37:16 osmr5 lxd.daemon[28159]: => LXD is ready
May 13 21:38:30 osmr5 lxd.daemon[28159]: No such file or directory - Failed to receive file descriptor
May 13 22:15:54 osmr5 lxd.daemon[28159]: No such file or directory - Failed to receive file descriptor
May 13 22:16:03 osmr5 lxd.daemon[28159]: No such file or directory - Failed to receive file descriptor

At this point, dnsmasq has done away and I have to restart lxd.

Hmm, anything in /var/log/syslog? dnsmasq doesn’t usually write stuff to the LXD log but is somewhat chatty in syslog.

Nothing of interest, unfortunately. I’m happy to bump up verbosity, if there’s a way to do so.

May 13 21:38:32 osmr5 dnsmasq[29787]: started, version 2.75 cachesize 150
May 13 21:38:32 osmr5 dnsmasq[29787]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify
May 13 21:38:32 osmr5 dnsmasq-dhcp[29787]: DHCP, IP range 10.249.75.2 – 10.249.75.254, lease time 1h
May 13 21:38:32 osmr5 dnsmasq-dhcp[29787]: DHCP, sockets bound exclusively to interface lxdbr0
May 13 21:38:32 osmr5 dnsmasq[29787]: using local addresses only for domain lxd
May 13 21:38:32 osmr5 dnsmasq[29787]: reading /etc/resolv.conf
May 13 21:38:32 osmr5 dnsmasq[29787]: using local addresses only for domain lxd
May 13 21:38:32 osmr5 dnsmasq[29787]: using nameserver 10.244.192.1#53
May 13 21:38:32 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 21:38:32 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-13a35e-0
May 13 21:38:32 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 21:38:32 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-13a35e-0
May 13 21:38:32 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 21:39:37 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 21:39:37 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-59feb8-0
May 13 22:05:49 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 22:05:49 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fb367f-0
May 13 22:05:49 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-59feb8-0
May 13 22:06:10 osmr5 dnsmasq[29787]: read /etc/hosts - 8 addresses
May 13 22:06:10 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fb367f-1
May 13 22:06:10 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-fb367f-0
May 13 22:06:10 osmr5 dnsmasq-dhcp[29787]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/juju-59feb8-0

Each time I’ve caught this, it was while deploying charms to Juju (2.5.7) but I don’t see anything in the debug-log that might cast blame on it.

Do you see report of a crash/segfault for dnsmasq in dmesg?

Nope, it’s clean as far as I can tell.

Hmm, that’s a bit frustrating… If you can easily reproduce it, maybe try to strace -p PID the dnsmasq process and see what happens when it dies? Not sure it’s going to provide very useful output but we can always hope :slight_smile:

uh, I have never seen this message in my logs. Looking at the lxd sources it’s easy to find it unfortunatly it’s in a part of the code that is commented out, so it don’t help much ! Look at syslog then as said by @stgraber

Here’s the last lines from strace before it crashed:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1) = 1 ([{fd=6, revents=POLLIN}])
recvmsg(6, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51090), sin_addr=inet_addr(“10.249.75.100”)}, msg_iov(1)=[{“B\271\1\0\0\1\0\0\0\0\0\0\003100\00275\003249\00210\7in-ad”…, 4096}], msg_controllen=0, msg_flags=0}, 0) = 44
sendmsg(6, {msg_name(16)={sa_family=AF_INET, sin_port=htons(51090), sin_addr=inet_addr(“10.249.75.100”)}, msg_iov(1)=[{“B\271\205\200\0\1\0\1\0\0\0\0\003100\00275\003249\00210\7in-ad”…, 75}], msg_controllen=0, msg_flags=0}, 0) = 75
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1
^[[A) = 1 ([{fd=5, revents=POLLIN}])
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n@\0\0\376\2\0\1\20\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 116
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n@\0\0\376\2\0\1\20\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, 0) = 116
fcntl(5, F_SETFL, O_RDWR) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1) = 1 ([{fd=5, revents=POLLIN}])
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\10\0\0\377\3\0\1\20\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 116
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\10\0\0\377\3\0\1\20\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, 0) = 116
fcntl(5, F_SETFL, O_RDWR) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1) = 1 ([{fd=5, revents=POLLIN}])
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\200\0\0\377\0\0\2\0\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 116
recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{“t\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\200\0\0\377\0\0\2\0\0\0\0\10\0\17\0”…, 1368}], msg_controllen=0, msg_flags=0}, 0) = 116
fcntl(5, F_SETFL, O_RDWR) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1 <unfinished …>
+++ killed by SIGKILL +++

https://github.com/lxc/lxd/pull/5757 may help with that