DNS resolution no longer works; dnsmasq uses 100% CPU

Bug #1314697 reported by Adam Smith
90
This bug affects 17 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Upgraded from Trusty latest to Utopic latest, and rebooted.

DNS resolution no longer worked. I.e. ping google.com failed, ping 8.8.8.8 suceeded.

dnsmasq is running, and port 53 is open. Turned on log queries and log dhcp then attempted to ping google. Extract from syslog is attached.

I have now worked around this by adding "nameserver 8.8.8.8" to /etc/resolv.conf. I am aware this is not a sustainable workaround as this file is automatically generated.

ProblemType: Bug
DistroRelease: Ubuntu 14.10
Package: dnsmasq 2.70-1 [modified: etc/dnsmasq.conf]
ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
Uname: Linux 3.13.0-24-generic x86_64
ApportVersion: 2.14.1-0ubuntu3
Architecture: amd64
Date: Wed Apr 30 16:52:25 2014
InstallationDate: Installed on 2011-12-04 (878 days ago)
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111011)
PackageArchitecture: all
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/zsh
SourcePackage: dnsmasq
SystemImageInfo: Error: [Errno 2] No such file or directory: 'system-image-cli'
UpgradeStatus: Upgraded to utopic on 2014-01-31 (88 days ago)
mtime.conffile..etc.dnsmasq.conf: 2014-04-30T16:46:18.803393

Revision history for this message
Adam Smith (adam-disc0tech) wrote :
Revision history for this message
Adam Smith (adam-disc0tech) wrote :
Revision history for this message
Adam Smith (adam-disc0tech) wrote :

Also note that dnsmasq is running under user=nobody and running at 99-100% CPU.

Revision history for this message
Thomas Hood (jdthood) wrote :

Assuming you are using NetworkManager, a better way to work around the problem than editing resolv.conf by hand is to edit /etc/NetworkManager/NetworkManager.conf and comment out the line `dns=dnsmasq`, save, and then `restart network-manager`. This will cause NetworkManager not to use a slave instance of dnsmasq as a local forwarding nameserver.

Revision history for this message
Adam Smith (adam-disc0tech) wrote :

Thanks for the workaround Thomas.

I've checked a few more things, firstly I found this in syslog at startup, missing from my earlier attachment:

Apr 30 18:57:06 donkey NetworkManager[1162]: <info> DNS: starting dnsmasq...
Apr 30 18:57:06 donkey NetworkManager[1162]: <warn> dnsmasq not available on the bus, can't update servers.
Apr 30 18:57:06 donkey NetworkManager[1162]: <error> [1398880626.164382] [nm-dns-dnsmasq.c:396] update(): dnsmasq owner not found on bus: Could not get owner of name 'org.freedesktop.NetworkManager.dnsmasq': no such name
Apr 30 18:57:06 donkey NetworkManager[1162]: <warn> DNS: plugin dnsmasq update failed

I've also noticed a that there is no -u arg passed to the service on startup, and that the --conf-file arg points to /var/lib/libvirt/dnsmasq/default.conf - which is an empty file.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dnsmasq (Ubuntu):
status: New → Confirmed
Revision history for this message
Colin Ian King (colin-king) wrote :

I'm seeing this too, strace show it spinning on:

select(8, [0 3 6 7], [], [6], NULL) = 1 (in [0])
recvmsg(0, 0x7fffdb2aa6d0, 0) = -1 ENOTSOCK (Socket operation on non-socket)
accept(0, 0, NULL) = -1 ENOTSOCK (Socket operation on non-socket)
select(8, [0 3 6 7], [], [6], NULL) = 1 (in [0])
recvmsg(0, 0x7fffdb2aa6d0, 0) = -1 ENOTSOCK (Socket operation on non-socket)
accept(0, 0, NULL) = -1 ENOTSOCK (Socket operation on non-socket)

So we're getting ENOTSOCK on fd 0, which is /dev/null :

Revision history for this message
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 1314697] Re: Upon upgrade, DNS resolution no longer works

On 01/05/14 07:45, Colin King wrote:
> I'm seeing this too, strace show it spinning on:
>
> select(8, [0 3 6 7], [], [6], NULL) = 1 (in [0])
> recvmsg(0, 0x7fffdb2aa6d0, 0) = -1 ENOTSOCK (Socket operation on non-socket)
> accept(0, 0, NULL) = -1 ENOTSOCK (Socket operation on non-socket)
> select(8, [0 3 6 7], [], [6], NULL) = 1 (in [0])
> recvmsg(0, 0x7fffdb2aa6d0, 0) = -1 ENOTSOCK (Socket operation on non-socket)
> accept(0, 0, NULL) = -1 ENOTSOCK (Socket operation on non-socket)
>
> So we're getting ENOTSOCK on fd 0, which is /dev/null :
>

It would be really useful to get an strace of dnsmasq from startup to
when it's in this state, since I've no yet been able to reproduce this.
Also the output of lsof | grep dnsmasq

The bug looks superficially like an earlier one which was triggered by
starting dnsmasq without stdout/stdin/stderr (ie fd 0,1,2). Sockets
would be opened which used these fds, and subsequently the "connect std*
to /dev/null" code would trash them. However that doesn't explain why
both the UDP and TCP listening sockets seem to be zero here.

Cheers,

Simon.

Revision history for this message
Adam Smith (adam-disc0tech) wrote : Re: Upon upgrade, DNS resolution no longer works

LSOF output below. I tried to put a strace in init.d but failed miserably

lsof | grep dnsmasq

dnsmasq 1430 dnsmasq cwd unknown /proc/1430/cwd (readlink: Permission denied)
dnsmasq 1430 dnsmasq rtd unknown /proc/1430/root (readlink: Permission denied)
dnsmasq 1430 dnsmasq txt unknown /proc/1430/exe (readlink: Permission denied)
dnsmasq 1430 dnsmasq NOFD /proc/1430/fd (opendir: Permission denied)
dnsmasq 2066 nobody cwd unknown /proc/2066/cwd (readlink: Permission denied)
dnsmasq 2066 nobody rtd unknown /proc/2066/root (readlink: Permission denied)
dnsmasq 2066 nobody txt unknown /proc/2066/exe (readlink: Permission denied)
dnsmasq 2066 nobody NOFD /proc/2066/fd (opendir: Permission denied)
dnsmasq 3176 libvirt-dnsmasq cwd unknown /proc/3176/cwd (readlink: Permission denied)
dnsmasq 3176 libvirt-dnsmasq rtd unknown /proc/3176/root (readlink: Permission denied)
dnsmasq 3176 libvirt-dnsmasq txt unknown /proc/3176/exe (readlink: Permission denied)
dnsmasq 3176 libvirt-dnsmasq NOFD /proc/3176/fd (opendir: Permission denied)

Revision history for this message
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 1314697] Re: Upon upgrade, DNS resolution no longer works

On 02/05/14 12:00, Adam Smith wrote:
> LSOF output below. I tried to put a strace in init.d but failed
> miserably
>
> lsof | grep dnsmasq
>
> dnsmasq 1430 dnsmasq cwd unknown /proc/1430/cwd (readlink: Permission denied)
> dnsmasq 1430 dnsmasq rtd unknown /proc/1430/root (readlink: Permission denied)
> dnsmasq 1430 dnsmasq txt unknown /proc/1430/exe (readlink: Permission denied)
> dnsmasq 1430 dnsmasq NOFD /proc/1430/fd (opendir: Permission denied)
> dnsmasq 2066 nobody cwd unknown /proc/2066/cwd (readlink: Permission denied)
> dnsmasq 2066 nobody rtd unknown /proc/2066/root (readlink: Permission denied)
> dnsmasq 2066 nobody txt unknown /proc/2066/exe (readlink: Permission denied)
> dnsmasq 2066 nobody NOFD /proc/2066/fd (opendir: Permission denied)
> dnsmasq 3176 libvirt-dnsmasq cwd unknown /proc/3176/cwd (readlink: Permission denied)
> dnsmasq 3176 libvirt-dnsmasq rtd unknown /proc/3176/root (readlink: Permission denied)
> dnsmasq 3176 libvirt-dnsmasq txt unknown /proc/3176/exe (readlink: Permission denied)
> dnsmasq 3176 libvirt-dnsmasq NOFD /proc/3176/fd (opendir: Permission denied)
>

It looks like that was done as an ordinary user. Please could you repeat
as root

sudo lsof | grep dnsmasq.

I note that there's two instances of dnsmasq here, one started by
libvirt (?) and one, I guess, started by network-manager. I wonder if
some interaction between the two is what's causing the problem. Which
one is the which is spinning? Can you disable the libvirt stuff and see
if that makes a difference?

Cheers,

Simon.

Revision history for this message
Chris Halse Rogers (raof) wrote : Re: Upon upgrade, DNS resolution no longer works

Observations:

This is not dependent on multiple instances of dnsmasq running - I've disabled the libvirt and lxc-net instances, and the network-manager dnsmasq still spins at 100% CPU and fails to resolve DNS queries.

Attached is the strace output from boot of the dnsmasq instance spawned by network-manager

Revision history for this message
Chris Halse Rogers (raof) wrote :

And, for good measure, the lsof output.

Thomas Hood (jdthood)
summary: - Upon upgrade, DNS resolution no longer works
+ Upon upgrade, DNS resolution no longer works; dnsmasq uses 100% CPU
Revision history for this message
James Hunt (jamesodhunt) wrote : Re: Upon upgrade, DNS resolution no longer works; dnsmasq uses 100% CPU

Attached is a full strace from dnsmasq startup showing the problem.

summary: - Upon upgrade, DNS resolution no longer works; dnsmasq uses 100% CPU
+ DNS resolution no longer works; dnsmasq uses 100% CPU
Revision history for this message
James Hunt (jamesodhunt) wrote :

Updated description since problem is unrelated to upgrade - it just occurs with the version of dnsmasq currently in utopic.

Revision history for this message
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 1314697] Re: DNS resolution no longer works; dnsmasq uses 100% CPU

Annoyingly, I still can't reproduce this on the systems I have
available. On a system where the problem occurs, can it be reproduced
when dnsmasq is started standalone with the same command-line
parameters? The idea situation would be to get the bug to show up in a
dnsmasq instance running under gdb.

The strace gives lots of valuable information:

during inialisation in network.c, create_bound_listeners() calls
create_listeners() once. The creates a UDP socket and a TCP socket each
bound to 127.0.1.1:53, with file-descriptors 4 and 5. Those file
descriptors are stored in a struct listener object which will be the
only one in a chain pointed to by daemon->listeners (or
dnsmasq_daemon->listeners, in gdb). The file descriptors are stored in
the ->fd and ->tcpfd fields of the struct.

By the time dnsmasq gets to the select loop in dnsmasq.c, those two
fields have somehow been zeroed - that's enough to exactly match what's
in the strace. dnsmasq selects for read events on fd 0 instead of fd 4
and fd 5 and when select says that reading is OK, is makes the syscalls
is would make for fd 4 (recvfrom) and fd 5 (accept) but to fd 0 instead.

If someone can reproduce this in gdb I suggest doing the following.

1) Set a breakpoint in create_bound_listeners() and trace through until
dnsmasq_daemon->listeners->fd and dnsmasq_daemon->listeners->tcpfd have
the correct values (4 and 5)

2) Set watchpoints on those two expressions, and then continue
execution. Gdb should then tell us where those locations are being
overwritten.

Cheers,

Simon.

Revision history for this message
James Hunt (jamesodhunt) wrote :

A bit of debugging shows that the culprit is blockdata_expand() which is being called via blockdata_init(). The issue seems to be that blockdata_expand() is passed a parameter of zero. That function then mallocs zero bytes (successfully seemingly), the proceeds to overwrite data before the returned address resulting the the 2 fds being set to zero.

blockdata_expand() is passed zero since daemon->cachesize (aka dnsmasq_daemon->cachesize) is zero. This is confirmed by looking at syslog which shows:

May 8 21:56:54 utopic dnsmasq[10812]: started, version 2.70 cache disabled

Revision history for this message
James Hunt (jamesodhunt) wrote :

BTW - the problem is recreatable for me every time simply by spinning up a utopic kvm instance.

Revision history for this message
Simon Kelley (simon-thekelleys) wrote :

On 08/05/14 22:18, James Hunt wrote:
> A bit of debugging shows that the culprit is blockdata_expand() which is
> being called via blockdata_init(). The issue seems to be that
> blockdata_expand() is passed a parameter of zero. That function then
> mallocs zero bytes (successfully seemingly), the proceeds to overwrite
> data before the returned address resulting the the 2 fds being set to
> zero.
>
> blockdata_expand() is passed zero since daemon->cachesize (aka
> dnsmasq_daemon->cachesize) is zero. This is confirmed by looking at
> syslog which shows:
>
> May 8 21:56:54 utopic dnsmasq[10812]: started, version 2.70 cache
> disabled
>

Excellent. Many thanks for doing that. I've pushed the fix to the git
repo and a test release, 2.71test2.

I'm minded to make a 2.71 release (which has this and a few other
bugfixes) in the next couple of days.

> BTW - the problem is recreatable for me every time simply by spinning
> up a utopic kvm instance.

I'm on the end of a dodgy 3G connection that won't support a netinstall
or image download in reasonable time/cost.

Cheers,

Simon.

Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1314697

tags: added: iso-testing
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package dnsmasq - 2.71-1

---------------
dnsmasq (2.71-1) unstable; urgency=low

   * New upstream.
   * Fix 100% CPU-usage bug when dnsmasq started with cachesize
     set to zero. (LP: #1314697)

 -- Simon Kelley <email address hidden> Fri, 16 May 2014 20:17:10 +0000

Changed in dnsmasq (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.