Merge ~blake-rouse/maas:improved-dns-reload into maas:master

Proposed by Blake Rouse on 2017-08-18
Status: Merged
Approved by: Mike Pontillo on 2017-08-19
Approved revision: 83d0b4200ba97eacc5f56b1c26a97928172bf959
Merge reported by: MAAS Lander
Merged at revision: not available
Proposed branch: ~blake-rouse/maas:improved-dns-reload
Merge into: maas:master
Diff against target: 332 lines (+196/-5)
4 files modified
src/maasserver/dns/config.py (+6/-0)
src/maasserver/dns/tests/test_config.py (+19/-0)
src/maasserver/region_controller.py (+46/-3)
src/maasserver/tests/test_region_controller.py (+125/-2)
Reviewer Review Type Date Requested Status
Mike Pontillo (community) 2017-08-18 Approve on 2017-08-18
Review via email: mp+329260@code.launchpad.net

Commit message

Fixes LP: #1710308, #1710278 - After BIND reload verify that the serial for all authoritative domains the new value before continuing.

To post a comment you must log in.
Mike Pontillo (mpontillo) wrote :

Looks good so far; I have some comments and questions inline below.

review: Needs Information
Blake Rouse (blake-rouse) :
Mike Pontillo (mpontillo) wrote :

Thanks for the replies. Sounds good.

review: Approve
Felipe Reyes (freyes) wrote :

Hello,

I'm trying to understand how this patch "throttles bind9 reloads", if I'm reading this correctly, this will only check (after rndc reload is ran) that the reload succeeded querying and verifying the serial is the expected one.

If there is a storm of dhcp requests and produces tenths of events (e.g. commissioning many nodes at the same time, or a rack coming back from a power failure), bind9 will still be reloaded for everyone of those events even if they are within the same second.

Best,

Mike Pontillo (mpontillo) wrote :

@freyes, it was my understanding after looking at the code with Blake that the surrounding code in MAAS already prevents concurrent reloads. (That is, the `rdnc` command sends a message to the BIND process, the BIND process asynchronously processes it, and control is returned to MAAS almost immediately.) This patch just ensures that the next reload starts after the previous reload has completed.

Felipe Reyes (freyes) wrote :

there are no concurrent reloads, but there is no throttling, e.g. if 2 events within the same second arrive, maas will run 'rndc reload' twice, while it could just run it once at the end of processing of the second event (the last that finishes it task), that's what it's suggested in the bug description:

""" - Throttle consecutive updates. (For example, when the first DNS change occurs, schedule a task for a few seconds in the future which will update DNS, and don't ever schedule more additional updates while that task is pending.)"""

Mike Pontillo (mpontillo) wrote :

That's true. My bug description did overstep a little bit in that it described how I would have personally fixed the issue. Since this patch doesn't address that aspect of the solution, I'll probably need to change the title/description and file another bug for the "excessive reloads" aspect.

Mike Pontillo (mpontillo) wrote :

I've filed bug #1712205 to address the idea that we should batch multiple consecutive DNS changes into a single reload.

Andres Rodriguez (andreserl) wrote :

We should double check, but I thought Maas already did that (before this
bugfix?)

On Mon, Aug 21, 2017 at 5:36 PM Mike Pontillo <email address hidden>
wrote:

> I've filed bug #1712205 to address the idea that we should batch multiple
> consecutive DNS changes into a single reload.
> --
> https://code.launchpad.net/~blake-rouse/maas/+git/maas/+merge/329260
> Your team MAAS Committers is subscribed to branch maas:master.
>
--
Andres Rodriguez
Engineering Manager, MAAS
Canonical USA, Inc.

Mike Pontillo (mpontillo) wrote :

What MAAS already did was to prevent **concurrent outstanding reload requests**. That is, if the "rndc reload" command was already being executed and another DNS change came in, MAAS would (even before this branch) not try to issue *another* reload at the same time. However, the "rndc reload" returns so quickly (as it asynchronously causes BIND to start reloading) that we could still effectively have two concurrent reloads prior to this branch landing.

Mike Pontillo (mpontillo) wrote :

After discussing this again, Blake reminded me that there is an implicit throttle here when we pause for a second to await the DNS query timeout while we check, then pause for two seconds while we wait for the zone to reload.

So this branch *will* implicitly thottle reload requests, unless BIND reloads extremely fast and MAAS checks and confirms the zone before any of the pauses occur.

Felipe Reyes (freyes) wrote :

Mike, thanks for addressing my comments, and specially for filing another bug for the throttling part. I will keep an eye on that one ;)

Best,

Blake Rouse (blake-rouse) wrote :

Felipe,

The throttling bug has been changed, as its not a throttling issue. My branch introduces throttling based on how the DNS service handles the culling of updates.

But I am looking into improving the DNS triggers so less DNS updates are performed.

Felipe Reyes (freyes) wrote :

Yes, I got it ;)

Update scan failed

At least one of the branches involved have failed to scan. You can manually schedule a rescan if required.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/src/maasserver/dns/config.py b/src/maasserver/dns/config.py
2index 855cc47..e27773a 100644
3--- a/src/maasserver/dns/config.py
4+++ b/src/maasserver/dns/config.py
5@@ -88,6 +88,12 @@ def dns_update_all_zones(reload_retry=False):
6 else:
7 bind_reload()
8
9+ # Return the current serial and list of domain names.
10+ return serial, [
11+ domain.name
12+ for domain in domains
13+ ]
14+
15
16 def get_upstream_dns():
17 """Return the IP addresses of configured upstream DNS servers.
18diff --git a/src/maasserver/dns/tests/test_config.py b/src/maasserver/dns/tests/test_config.py
19index 029e874..c163420 100644
20--- a/src/maasserver/dns/tests/test_config.py
21+++ b/src/maasserver/dns/tests/test_config.py
22@@ -53,6 +53,7 @@ from testtools.matchers import (
23 Contains,
24 Equals,
25 FileContains,
26+ MatchesSetwise,
27 MatchesStructure,
28 )
29
30@@ -302,6 +303,24 @@ class TestDNSConfigModifications(TestDNSServer):
31 port=self.bind.config.port, commands=[ns_record, '+short'])
32 self.assertEqual(ip, ip_of_ns_record)
33
34+ def test_dns_update_all_zones_returns_serial_and_domains(self):
35+ self.patch(settings, 'DNS_CONNECT', True)
36+ domain = factory.make_Domain()
37+ # These domains should not show up. Just to test we create them.
38+ for _ in range(3):
39+ factory.make_Domain(authoritative=False)
40+ node, static = self.create_node_with_static_ip(domain=domain)
41+ fake_serial = random.randint(1, 1000)
42+ self.patch(
43+ dns_config_module,
44+ "current_zone_serial").return_value = fake_serial
45+ serial, domains = dns_update_all_zones()
46+ self.assertThat(serial, Equals(fake_serial))
47+ self.assertThat(domains, MatchesSetwise(*[
48+ Equals(domain.name)
49+ for domain in Domain.objects.filter(authoritative=True)
50+ ]))
51+
52
53 class TestDNSDynamicIPAddresses(TestDNSServer):
54 """Allocated nodes with IP addresses in the dynamic range get a DNS
55diff --git a/src/maasserver/region_controller.py b/src/maasserver/region_controller.py
56index f3a51f3..0f9e181 100644
57--- a/src/maasserver/region_controller.py
58+++ b/src/maasserver/region_controller.py
59@@ -31,16 +31,25 @@ from provisioningserver.logger import LegacyLogger
60 from provisioningserver.utils.twisted import (
61 asynchronous,
62 FOREVER,
63+ pause,
64 )
65 from twisted.application.service import Service
66 from twisted.internet import reactor
67-from twisted.internet.defer import DeferredList
68+from twisted.internet.defer import (
69+ DeferredList,
70+ inlineCallbacks,
71+)
72 from twisted.internet.task import LoopingCall
73+from twisted.names.client import Resolver
74
75
76 log = LegacyLogger()
77
78
79+class DNSReloadError(Exception):
80+ """Error raised when the bind never fully reloads the zone."""
81+
82+
83 class RegionControllerService(Service):
84 """
85 A service that controllers external services that are in MAAS's control on
86@@ -64,6 +73,9 @@ class RegionControllerService(Service):
87 self.needsDNSUpdate = False
88 self.needsProxyUpdate = False
89 self.postgresListener = postgresListener
90+ self.dnsResolver = Resolver(
91+ resolv=None, servers=[('127.0.0.1', 53)],
92+ timeout=(1,), reactor=clock)
93
94 @asynchronous(timeout=FOREVER)
95 def startService(self):
96@@ -108,9 +120,12 @@ class RegionControllerService(Service):
97 if self.needsDNSUpdate:
98 self.needsDNSUpdate = False
99 d = deferToDatabase(transactional(dns_update_all_zones))
100+ d.addCallback(self._check_serial)
101 d.addCallback(
102- lambda _: log.msg(
103- "Successfully configured DNS."))
104+ lambda domains: log.msg(
105+ "Successfully configured DNS "
106+ "authoritative zones: %s." % (
107+ ', '.join(domains))))
108 d.addErrback(
109 log.err,
110 "Failed configuring DNS.")
111@@ -131,3 +146,31 @@ class RegionControllerService(Service):
112 self.processingDefer = None
113 else:
114 return DeferredList(defers)
115+
116+ @inlineCallbacks
117+ def _check_serial(self, result):
118+ """Check that the serial of the domain is updated."""
119+ serial, domain_names = result
120+ not_matching_domains = set(domain_names)
121+ loop = 0
122+ while len(not_matching_domains) > 0 and loop != 30:
123+ for domain in list(not_matching_domains):
124+ try:
125+ answers, _, _ = yield self.dnsResolver.lookupAuthority(
126+ domain)
127+ except (ValueError, TimeoutError):
128+ answers = []
129+ if len(answers) > 0:
130+ if int(answers[0].payload.serial) == int(serial):
131+ not_matching_domains.remove(domain)
132+ loop += 1
133+ yield pause(2)
134+ # 30 retries with 2 second pauses (aka. 60 seconds) has passed and
135+ # there still is a domain that has the wrong serial. For now just
136+ # raise the error, in the future we should take action and force
137+ # restart bind.
138+ if len(not_matching_domains) > 0:
139+ raise DNSReloadError(
140+ "Failed to reload DNS; serial mismatch "
141+ "on domains %s" % ', '.join(not_matching_domains))
142+ return domain_names
143diff --git a/src/maasserver/tests/test_region_controller.py b/src/maasserver/tests/test_region_controller.py
144index 9dbe4da..a4e83d2 100644
145--- a/src/maasserver/tests/test_region_controller.py
146+++ b/src/maasserver/tests/test_region_controller.py
147@@ -5,6 +5,7 @@
148
149 __all__ = []
150
151+import random
152 from unittest.mock import (
153 ANY,
154 call,
155@@ -14,7 +15,10 @@ from unittest.mock import (
156
157 from crochet import wait_for
158 from maasserver import region_controller
159-from maasserver.region_controller import RegionControllerService
160+from maasserver.region_controller import (
161+ DNSReloadError,
162+ RegionControllerService,
163+)
164 from maasserver.testing.factory import factory
165 from maasserver.testing.testcase import MAASServerTestCase
166 from maastesting.matchers import (
167@@ -22,6 +26,7 @@ from maastesting.matchers import (
168 MockCallsMatch,
169 MockNotCalled,
170 )
171+from testtools import ExpectedException
172 from testtools.matchers import MatchesStructure
173 from twisted.internet import reactor
174 from twisted.internet.defer import (
175@@ -29,6 +34,12 @@ from twisted.internet.defer import (
176 inlineCallbacks,
177 succeed,
178 )
179+from twisted.names.dns import (
180+ A,
181+ Record_SOA,
182+ RRHeader,
183+ SOA,
184+)
185
186
187 wait_for_reactor = wait_for(30) # 30 seconds.
188@@ -166,16 +177,28 @@ class TestRegionControllerService(MAASServerTestCase):
189 def test_process_updates_zones(self):
190 service = RegionControllerService(sentinel.listener)
191 service.needsDNSUpdate = True
192+ dns_result = (
193+ random.randint(1, 1000), [
194+ factory.make_name('domain')
195+ for _ in range(3)
196+ ])
197 mock_dns_update_all_zones = self.patch(
198 region_controller, "dns_update_all_zones")
199+ mock_dns_update_all_zones.return_value = dns_result
200+ mock_check_serial = self.patch(service, "_check_serial")
201+ mock_check_serial.return_value = succeed(dns_result[1])
202 mock_msg = self.patch(
203 region_controller.log, "msg")
204 service.startProcessing()
205 yield service.processingDefer
206 self.assertThat(mock_dns_update_all_zones, MockCalledOnceWith())
207+ self.assertThat(mock_check_serial, MockCalledOnceWith(dns_result))
208 self.assertThat(
209 mock_msg,
210- MockCalledOnceWith("Successfully configured DNS."))
211+ MockCalledOnceWith(
212+ "Successfully configured DNS "
213+ "authoritative zones: %s." % (
214+ ', '.join(dns_result[1]))))
215
216 @wait_for_reactor
217 @inlineCallbacks
218@@ -236,8 +259,16 @@ class TestRegionControllerService(MAASServerTestCase):
219 service = RegionControllerService(sentinel.listener)
220 service.needsDNSUpdate = True
221 service.needsProxyUpdate = True
222+ dns_result = (
223+ random.randint(1, 1000), [
224+ factory.make_name('domain')
225+ for _ in range(3)
226+ ])
227 mock_dns_update_all_zones = self.patch(
228 region_controller, "dns_update_all_zones")
229+ mock_dns_update_all_zones.return_value = dns_result
230+ mock_check_serial = self.patch(service, "_check_serial")
231+ mock_check_serial.return_value = succeed(dns_result[1])
232 mock_proxy_update_config = self.patch(
233 region_controller, "proxy_update_config")
234 mock_proxy_update_config.return_value = succeed(None)
235@@ -245,5 +276,97 @@ class TestRegionControllerService(MAASServerTestCase):
236 yield service.processingDefer
237 self.assertThat(
238 mock_dns_update_all_zones, MockCalledOnceWith())
239+ self.assertThat(mock_check_serial, MockCalledOnceWith(dns_result))
240 self.assertThat(
241 mock_proxy_update_config, MockCalledOnceWith(reload_proxy=True))
242+
243+ def make_soa_result(self, serial):
244+ return RRHeader(
245+ type=SOA, cls=A, ttl=30, payload=Record_SOA(serial=serial))
246+
247+ @wait_for_reactor
248+ def test__check_serial_doesnt_raise_error_on_successful_serial_match(self):
249+ service = RegionControllerService(sentinel.listener)
250+ result_serial = random.randint(1, 1000)
251+ formatted_serial = '{0:10d}'.format(result_serial)
252+ dns_names = [
253+ factory.make_name('domain')
254+ for _ in range(3)
255+ ]
256+ # Mock pause so test runs faster.
257+ self.patch(region_controller, "pause").return_value = succeed(None)
258+ mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
259+ mock_lookup.side_effect = [
260+ # First pass no results.
261+ succeed(([], [], [])),
262+ succeed(([], [], [])),
263+ succeed(([], [], [])),
264+ # First domain valid result.
265+ succeed(([self.make_soa_result(result_serial)], [], [])),
266+ succeed(([], [], [])),
267+ succeed(([], [], [])),
268+ # Second domain wrong serial.
269+ succeed(([self.make_soa_result(result_serial - 1)], [], [])),
270+ succeed(([], [], [])),
271+ # Third domain correct serial.
272+ succeed(([], [], [])),
273+ succeed(([self.make_soa_result(result_serial)], [], [])),
274+ # Second domain correct serial.
275+ succeed(([self.make_soa_result(result_serial)], [], [])),
276+ ]
277+ # Error should not be raised.
278+ return service._check_serial((formatted_serial, dns_names))
279+
280+ @wait_for_reactor
281+ @inlineCallbacks
282+ def test__check_serial_raise_error_after_30_tries(self):
283+ service = RegionControllerService(sentinel.listener)
284+ result_serial = random.randint(1, 1000)
285+ formatted_serial = '{0:10d}'.format(result_serial)
286+ dns_names = [
287+ factory.make_name('domain')
288+ for _ in range(3)
289+ ]
290+ # Mock pause so test runs faster.
291+ self.patch(region_controller, "pause").return_value = succeed(None)
292+ mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
293+ mock_lookup.side_effect = lambda *args: succeed(([], [], []))
294+ # Error should not be raised.
295+ with ExpectedException(DNSReloadError):
296+ yield service._check_serial((formatted_serial, dns_names))
297+
298+ @wait_for_reactor
299+ @inlineCallbacks
300+ def test__check_serial_handles_ValueError(self):
301+ service = RegionControllerService(sentinel.listener)
302+ result_serial = random.randint(1, 1000)
303+ formatted_serial = '{0:10d}'.format(result_serial)
304+ dns_names = [
305+ factory.make_name('domain')
306+ for _ in range(3)
307+ ]
308+ # Mock pause so test runs faster.
309+ self.patch(region_controller, "pause").return_value = succeed(None)
310+ mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
311+ mock_lookup.side_effect = ValueError()
312+ # Error should not be raised.
313+ with ExpectedException(DNSReloadError):
314+ yield service._check_serial((formatted_serial, dns_names))
315+
316+ @wait_for_reactor
317+ @inlineCallbacks
318+ def test__check_serial_handles_TimeoutError(self):
319+ service = RegionControllerService(sentinel.listener)
320+ result_serial = random.randint(1, 1000)
321+ formatted_serial = '{0:10d}'.format(result_serial)
322+ dns_names = [
323+ factory.make_name('domain')
324+ for _ in range(3)
325+ ]
326+ # Mock pause so test runs faster.
327+ self.patch(region_controller, "pause").return_value = succeed(None)
328+ mock_lookup = self.patch(service.dnsResolver, "lookupAuthority")
329+ mock_lookup.side_effect = TimeoutError()
330+ # Error should not be raised.
331+ with ExpectedException(DNSReloadError):
332+ yield service._check_serial((formatted_serial, dns_names))

Subscribers

People subscribed via source and target branches

to all changes: