Merge ~blake-rouse/maas:improved-dns-reload into maas:master

Proposed by Blake Rouse on 2017-08-18
Status: Merged
Approved by: Mike Pontillo on 2017-08-19
Approved revision: 83d0b4200ba97eacc5f56b1c26a97928172bf959
Merge reported by: MAAS Lander
Merged at revision: not available
Proposed branch: ~blake-rouse/maas:improved-dns-reload
Merge into: maas:master
Diff against target: 332 lines (+196/-5)
4 files modified
src/maasserver/dns/config.py (+6/-0)
src/maasserver/dns/tests/test_config.py (+19/-0)
src/maasserver/region_controller.py (+46/-3)
src/maasserver/tests/test_region_controller.py (+125/-2)
Reviewer Review Type Date Requested Status
Mike Pontillo (community) 2017-08-18 Approve on 2017-08-18
Review via email: mp+329260@code.launchpad.net

Commit message

Fixes LP: #1710308, #1710278 - After BIND reload verify that the serial for all authoritative domains the new value before continuing.

To post a comment you must log in.
Mike Pontillo (mpontillo) wrote :

Looks good so far; I have some comments and questions inline below.

review: Needs Information
Blake Rouse (blake-rouse) :
Mike Pontillo (mpontillo) wrote :

Thanks for the replies. Sounds good.

review: Approve
Felipe Reyes (freyes) wrote :

Hello,

I'm trying to understand how this patch "throttles bind9 reloads", if I'm reading this correctly, this will only check (after rndc reload is ran) that the reload succeeded querying and verifying the serial is the expected one.

If there is a storm of dhcp requests and produces tenths of events (e.g. commissioning many nodes at the same time, or a rack coming back from a power failure), bind9 will still be reloaded for everyone of those events even if they are within the same second.

Best,

Mike Pontillo (mpontillo) wrote :

@freyes, it was my understanding after looking at the code with Blake that the surrounding code in MAAS already prevents concurrent reloads. (That is, the `rdnc` command sends a message to the BIND process, the BIND process asynchronously processes it, and control is returned to MAAS almost immediately.) This patch just ensures that the next reload starts after the previous reload has completed.

Felipe Reyes (freyes) wrote :

there are no concurrent reloads, but there is no throttling, e.g. if 2 events within the same second arrive, maas will run 'rndc reload' twice, while it could just run it once at the end of processing of the second event (the last that finishes it task), that's what it's suggested in the bug description:

""" - Throttle consecutive updates. (For example, when the first DNS change occurs, schedule a task for a few seconds in the future which will update DNS, and don't ever schedule more additional updates while that task is pending.)"""

Mike Pontillo (mpontillo) wrote :

That's true. My bug description did overstep a little bit in that it described how I would have personally fixed the issue. Since this patch doesn't address that aspect of the solution, I'll probably need to change the title/description and file another bug for the "excessive reloads" aspect.

Mike Pontillo (mpontillo) wrote :

I've filed bug #1712205 to address the idea that we should batch multiple consecutive DNS changes into a single reload.

Andres Rodriguez (andreserl) wrote :

We should double check, but I thought Maas already did that (before this
bugfix?)

On Mon, Aug 21, 2017 at 5:36 PM Mike Pontillo <email address hidden>
wrote:

> I've filed bug #1712205 to address the idea that we should batch multiple
> consecutive DNS changes into a single reload.
> --
> https://code.launchpad.net/~blake-rouse/maas/+git/maas/+merge/329260
> Your team MAAS Committers is subscribed to branch maas:master.
>
--
Andres Rodriguez
Engineering Manager, MAAS
Canonical USA, Inc.

Mike Pontillo (mpontillo) wrote :

What MAAS already did was to prevent **concurrent outstanding reload requests**. That is, if the "rndc reload" command was already being executed and another DNS change came in, MAAS would (even before this branch) not try to issue *another* reload at the same time. However, the "rndc reload" returns so quickly (as it asynchronously causes BIND to start reloading) that we could still effectively have two concurrent reloads prior to this branch landing.

Mike Pontillo (mpontillo) wrote :

After discussing this again, Blake reminded me that there is an implicit throttle here when we pause for a second to await the DNS query timeout while we check, then pause for two seconds while we wait for the zone to reload.

So this branch *will* implicitly thottle reload requests, unless BIND reloads extremely fast and MAAS checks and confirms the zone before any of the pauses occur.

Felipe Reyes (freyes) wrote :

Mike, thanks for addressing my comments, and specially for filing another bug for the throttling part. I will keep an eye on that one ;)

Best,

Blake Rouse (blake-rouse) wrote :

Felipe,

The throttling bug has been changed, as its not a throttling issue. My branch introduces throttling based on how the DNS service handles the culling of updates.

But I am looking into improving the DNS triggers so less DNS updates are performed.

Felipe Reyes (freyes) wrote :

Yes, I got it ;)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/src/maasserver/dns/config.py b/src/maasserver/dns/config.py
index 855cc47..e27773a 100644
--- a/src/maasserver/dns/config.py
+++ b/src/maasserver/dns/config.py
@@ -88,6 +88,12 @@ def dns_update_all_zones(reload_retry=False):
88 else:88 else:
89 bind_reload()89 bind_reload()
9090
91 # Return the current serial and list of domain names.
92 return serial, [
93 domain.name
94 for domain in domains
95 ]
96
9197
92def get_upstream_dns():98def get_upstream_dns():
93 """Return the IP addresses of configured upstream DNS servers.99 """Return the IP addresses of configured upstream DNS servers.
diff --git a/src/maasserver/dns/tests/test_config.py b/src/maasserver/dns/tests/test_config.py
index 029e874..c163420 100644
--- a/src/maasserver/dns/tests/test_config.py
+++ b/src/maasserver/dns/tests/test_config.py
@@ -53,6 +53,7 @@ from testtools.matchers import (
53 Contains,53 Contains,
54 Equals,54 Equals,
55 FileContains,55 FileContains,
56 MatchesSetwise,
56 MatchesStructure,57 MatchesStructure,
57)58)
5859
@@ -302,6 +303,24 @@ class TestDNSConfigModifications(TestDNSServer):
302 port=self.bind.config.port, commands=[ns_record, '+short'])303 port=self.bind.config.port, commands=[ns_record, '+short'])
303 self.assertEqual(ip, ip_of_ns_record)304 self.assertEqual(ip, ip_of_ns_record)
304305
306 def test_dns_update_all_zones_returns_serial_and_domains(self):
307 self.patch(settings, 'DNS_CONNECT', True)
308 domain = factory.make_Domain()
309 # These domains should not show up. Just to test we create them.
310 for _ in range(3):
311 factory.make_Domain(authoritative=False)
312 node, static = self.create_node_with_static_ip(domain=domain)
313 fake_serial = random.randint(1, 1000)
314 self.patch(
315 dns_config_module,
316 "current_zone_serial").return_value = fake_serial
317 serial, domains = dns_update_all_zones()
318 self.assertThat(serial, Equals(fake_serial))
319 self.assertThat(domains, MatchesSetwise(*[
320 Equals(domain.name)
321 for domain in Domain.objects.filter(authoritative=True)
322 ]))
323
305324
306class TestDNSDynamicIPAddresses(TestDNSServer):325class TestDNSDynamicIPAddresses(TestDNSServer):
307 """Allocated nodes with IP addresses in the dynamic range get a DNS326 """Allocated nodes with IP addresses in the dynamic range get a DNS
diff --git a/src/maasserver/region_controller.py b/src/maasserver/region_controller.py
index f3a51f3..0f9e181 100644
--- a/src/maasserver/region_controller.py
+++ b/src/maasserver/region_controller.py
@@ -31,16 +31,25 @@ from provisioningserver.logger import LegacyLogger
31from provisioningserver.utils.twisted import (31from provisioningserver.utils.twisted import (
32 asynchronous,32 asynchronous,
33 FOREVER,33 FOREVER,
34 pause,
34)35)
35from twisted.application.service import Service36from twisted.application.service import Service
36from twisted.internet import reactor37from twisted.internet import reactor
37from twisted.internet.defer import DeferredList38from twisted.internet.defer import (
39 DeferredList,
40 inlineCallbacks,
41)
38from twisted.internet.task import LoopingCall42from twisted.internet.task import LoopingCall
43from twisted.names.client import Resolver
3944
4045
41log = LegacyLogger()46log = LegacyLogger()
4247
4348
49class DNSReloadError(Exception):
50 """Error raised when the bind never fully reloads the zone."""
51
52
44class RegionControllerService(Service):53class RegionControllerService(Service):
45 """54 """
46 A service that controllers external services that are in MAAS's control on55 A service that controllers external services that are in MAAS's control on
@@ -64,6 +73,9 @@ class RegionControllerService(Service):
64 self.needsDNSUpdate = False73 self.needsDNSUpdate = False
65 self.needsProxyUpdate = False74 self.needsProxyUpdate = False
66 self.postgresListener = postgresListener75 self.postgresListener = postgresListener
76 self.dnsResolver = Resolver(
77 resolv=None, servers=[('127.0.0.1', 53)],
78 timeout=(1,), reactor=clock)
6779
68 @asynchronous(timeout=FOREVER)80 @asynchronous(timeout=FOREVER)
69 def startService(self):81 def startService(self):
@@ -108,9 +120,12 @@ class RegionControllerService(Service):
108 if self.needsDNSUpdate:120 if self.needsDNSUpdate:
109 self.needsDNSUpdate = False121 self.needsDNSUpdate = False
110 d = deferToDatabase(transactional(dns_update_all_zones))122 d = deferToDatabase(transactional(dns_update_all_zones))
123 d.addCallback(self._check_serial)
111 d.addCallback(124 d.addCallback(
112 lambda _: log.msg(125 lambda domains: log.msg(
113 "Successfully configured DNS."))126 "Successfully configured DNS "
127 "authoritative zones: %s." % (
128 ', '.join(domains))))
114 d.addErrback(129 d.addErrback(
115 log.err,130 log.err,
116 "Failed configuring DNS.")131 "Failed configuring DNS.")
@@ -131,3 +146,31 @@ class RegionControllerService(Service):
131 self.processingDefer = None146 self.processingDefer = None
132 else:147 else:
133 return DeferredList(defers)148 return DeferredList(defers)
149
150 @inlineCallbacks
151 def _check_serial(self, result):
152 """Check that the serial of the domain is updated."""
153 serial, domain_names = result
154 not_matching_domains = set(domain_names)
155 loop = 0
156 while len(not_matching_domains) > 0 and loop != 30:
157 for domain in list(not_matching_domains):
158 try:
159 answers, _, _ = yield self.dnsResolver.lookupAuthority(
160 domain)
161 except (ValueError, TimeoutError):
162 answers = []
163 if len(answers) > 0:
164 if int(answers[0].payload.serial) == int(serial):
165 not_matching_domains.remove(domain)
166 loop += 1
167 yield pause(2)
168 # 30 retries with 2 second pauses (aka. 60 seconds) has passed and
169 # there still is a domain that has the wrong serial. For now just
170 # raise the error, in the future we should take action and force
171 # restart bind.
172 if len(not_matching_domains) > 0:
173 raise DNSReloadError(
174 "Failed to reload DNS; serial mismatch "
175 "on domains %s" % ', '.join(not_matching_domains))
176 return domain_names
diff --git a/src/maasserver/tests/test_region_controller.py b/src/maasserver/tests/test_region_controller.py
index 9dbe4da..a4e83d2 100644
--- a/src/maasserver/tests/test_region_controller.py
+++ b/src/maasserver/tests/test_region_controller.py
@@ -5,6 +5,7 @@
55
6__all__ = []6__all__ = []
77
8import random
8from unittest.mock import (9from unittest.mock import (
9 ANY,10 ANY,
10 call,11 call,
@@ -14,7 +15,10 @@ from unittest.mock import (
1415
15from crochet import wait_for16from crochet import wait_for
16from maasserver import region_controller17from maasserver import region_controller
17from maasserver.region_controller import RegionControllerService18from maasserver.region_controller import (
19 DNSReloadError,
20 RegionControllerService,
21)
18from maasserver.testing.factory import factory22from maasserver.testing.factory import factory
19from maasserver.testing.testcase import MAASServerTestCase23from maasserver.testing.testcase import MAASServerTestCase
20from maastesting.matchers import (24from maastesting.matchers import (
@@ -22,6 +26,7 @@ from maastesting.matchers import (
22 MockCallsMatch,26 MockCallsMatch,
23 MockNotCalled,27 MockNotCalled,
24)28)
29from testtools import ExpectedException
25from testtools.matchers import MatchesStructure30from testtools.matchers import MatchesStructure
26from twisted.internet import reactor31from twisted.internet import reactor
27from twisted.internet.defer import (32from twisted.internet.defer import (
@@ -29,6 +34,12 @@ from twisted.internet.defer import (
29 inlineCallbacks,34 inlineCallbacks,
30 succeed,35 succeed,
31)36)
37from twisted.names.dns import (
38 A,
39 Record_SOA,
40 RRHeader,
41 SOA,
42)
3243
3344
34wait_for_reactor = wait_for(30) # 30 seconds.45wait_for_reactor = wait_for(30) # 30 seconds.
@@ -166,16 +177,28 @@ class TestRegionControllerService(MAASServerTestCase):
166 def test_process_updates_zones(self):177 def test_process_updates_zones(self):
167 service = RegionControllerService(sentinel.listener)178 service = RegionControllerService(sentinel.listener)
168 service.needsDNSUpdate = True179 service.needsDNSUpdate = True
180 dns_result = (
181 random.randint(1, 1000), [
182 factory.make_name('domain')
183 for _ in range(3)
184 ])
169 mock_dns_update_all_zones = self.patch(185 mock_dns_update_all_zones = self.patch(
170 region_controller, "dns_update_all_zones")186 region_controller, "dns_update_all_zones")
187 mock_dns_update_all_zones.return_value = dns_result
188 mock_check_serial = self.patch(service, "_check_serial")
189 mock_check_serial.return_value = succeed(dns_result[1])
171 mock_msg = self.patch(190 mock_msg = self.patch(
172 region_controller.log, "msg")191 region_controller.log, "msg")
173 service.startProcessing()192 service.startProcessing()
174 yield service.processingDefer193 yield service.processingDefer
175 self.assertThat(mock_dns_update_all_zones, MockCalledOnceWith())194 self.assertThat(mock_dns_update_all_zones, MockCalledOnceWith())
195 self.assertThat(mock_check_serial, MockCalledOnceWith(dns_result))
176 self.assertThat(196 self.assertThat(
177 mock_msg,197 mock_msg,
178 MockCalledOnceWith("Successfully configured DNS."))198 MockCalledOnceWith(
199 "Successfully configured DNS "
200 "authoritative zones: %s." % (
201 ', '.join(dns_result[1]))))
179202
180 @wait_for_reactor203 @wait_for_reactor
181 @inlineCallbacks204 @inlineCallbacks
@@ -236,8 +259,16 @@ class TestRegionControllerService(MAASServerTestCase):
236 service = RegionControllerService(sentinel.listener)259 service = RegionControllerService(sentinel.listener)
237 service.needsDNSUpdate = True260 service.needsDNSUpdate = True
238 service.needsProxyUpdate = True261 service.needsProxyUpdate = True
262 dns_result = (
263 random.randint(1, 1000), [
264 factory.make_name('domain')
265 for _ in range(3)
266 ])
239 mock_dns_update_all_zones = self.patch(267 mock_dns_update_all_zones = self.patch(
240 region_controller, "dns_update_all_zones")268 region_controller, "dns_update_all_zones")
269 mock_dns_update_all_zones.return_value = dns_result
270 mock_check_serial = self.patch(service, "_check_serial")
271 mock_check_serial.return_value = succeed(dns_result[1])
241 mock_proxy_update_config = self.patch(272 mock_proxy_update_config = self.patch(
242 region_controller, "proxy_update_config")273 region_controller, "proxy_update_config")
243 mock_proxy_update_config.return_value = succeed(None)274 mock_proxy_update_config.return_value = succeed(None)
@@ -245,5 +276,97 @@ class TestRegionControllerService(MAASServerTestCase):
245 yield service.processingDefer276 yield service.processingDefer
246 self.assertThat(277 self.assertThat(
247 mock_dns_update_all_zones, MockCalledOnceWith())278 mock_dns_update_all_zones, MockCalledOnceWith())
279 self.assertThat(mock_check_serial, MockCalledOnceWith(dns_result))
248 self.assertThat(280 self.assertThat(
249 mock_proxy_update_config, MockCalledOnceWith(reload_proxy=True))281 mock_proxy_update_config, MockCalledOnceWith(reload_proxy=True))
282
283 def make_soa_result(self, serial):
284 return RRHeader(
285 type=SOA, cls=A, ttl=30, payload=Record_SOA(serial=serial))
286
287 @wait_for_reactor
288 def test__check_serial_doesnt_raise_error_on_successful_serial_match(self):
289 service = RegionControllerService(sentinel.listener)
290 result_serial = random.randint(1, 1000)
291 formatted_serial = '{0:10d}'.format(result_serial)
292 dns_names = [
293 factory.make_name('domain')
294 for _ in range(3)
295 ]
296 # Mock pause so test runs faster.
297 self.patch(region_controller, "pause").return_value = succeed(None)
298 mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
299 mock_lookup.side_effect = [
300 # First pass no results.
301 succeed(([], [], [])),
302 succeed(([], [], [])),
303 succeed(([], [], [])),
304 # First domain valid result.
305 succeed(([self.make_soa_result(result_serial)], [], [])),
306 succeed(([], [], [])),
307 succeed(([], [], [])),
308 # Second domain wrong serial.
309 succeed(([self.make_soa_result(result_serial - 1)], [], [])),
310 succeed(([], [], [])),
311 # Third domain correct serial.
312 succeed(([], [], [])),
313 succeed(([self.make_soa_result(result_serial)], [], [])),
314 # Second domain correct serial.
315 succeed(([self.make_soa_result(result_serial)], [], [])),
316 ]
317 # Error should not be raised.
318 return service._check_serial((formatted_serial, dns_names))
319
320 @wait_for_reactor
321 @inlineCallbacks
322 def test__check_serial_raise_error_after_30_tries(self):
323 service = RegionControllerService(sentinel.listener)
324 result_serial = random.randint(1, 1000)
325 formatted_serial = '{0:10d}'.format(result_serial)
326 dns_names = [
327 factory.make_name('domain')
328 for _ in range(3)
329 ]
330 # Mock pause so test runs faster.
331 self.patch(region_controller, "pause").return_value = succeed(None)
332 mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
333 mock_lookup.side_effect = lambda *args: succeed(([], [], []))
334 # Error should not be raised.
335 with ExpectedException(DNSReloadError):
336 yield service._check_serial((formatted_serial, dns_names))
337
338 @wait_for_reactor
339 @inlineCallbacks
340 def test__check_serial_handles_ValueError(self):
341 service = RegionControllerService(sentinel.listener)
342 result_serial = random.randint(1, 1000)
343 formatted_serial = '{0:10d}'.format(result_serial)
344 dns_names = [
345 factory.make_name('domain')
346 for _ in range(3)
347 ]
348 # Mock pause so test runs faster.
349 self.patch(region_controller, "pause").return_value = succeed(None)
350 mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
351 mock_lookup.side_effect = ValueError()
352 # Error should not be raised.
353 with ExpectedException(DNSReloadError):
354 yield service._check_serial((formatted_serial, dns_names))
355
356 @wait_for_reactor
357 @inlineCallbacks
358 def test__check_serial_handles_TimeoutError(self):
359 service = RegionControllerService(sentinel.listener)
360 result_serial = random.randint(1, 1000)
361 formatted_serial = '{0:10d}'.format(result_serial)
362 dns_names = [
363 factory.make_name('domain')
364 for _ in range(3)
365 ]
366 # Mock pause so test runs faster.
367 self.patch(region_controller, "pause").return_value = succeed(None)
368 mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
369 mock_lookup.side_effect = TimeoutError()
370 # Error should not be raised.
371 with ExpectedException(DNSReloadError):
372 yield service._check_serial((formatted_serial, dns_names))

Subscribers

People subscribed via source and target branches

to all changes: