Merge lp:~gmb/maas/catch-errors-in-power-callsites into lp:~maas-committers/maas/trunk
- catch-errors-in-power-callsites
- Merge into trunk
Status: | Merged |
---|---|
Approved by: | Graham Binns |
Approved revision: | no longer in the source branch. |
Merged at revision: | 3228 |
Proposed branch: | lp:~gmb/maas/catch-errors-in-power-callsites |
Merge into: | lp:~maas-committers/maas/trunk |
Diff against target: |
508 lines (+368/-17) 2 files modified
src/maasserver/models/node.py (+76/-13) src/maasserver/models/tests/test_node.py (+292/-4) |
To merge this branch: | bzr merge lp:~gmb/maas/catch-errors-in-power-callsites |
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Raphaël Badin (community) | Approve | ||
Gavin Panella (community) | Approve | ||
Review via email: mp+237619@code.launchpad.net |
Commit message
Refactor call sites of Node.objects.
- Handle errors raised by those methods
- Ensure that model changes are committed before power actions are sent
(when starting nodes) and after power actions are sent (when
stopping nodes). This avoids race conditions when carrying out actions
in bulk.
- When errors occur in start_nodes(), the callsites will explicitly roll
the node back to a sane status (either its old status or a suitable
FAILED status).
- Re-raise those errors once they've finished with them
- Don't change the state of the Node if an error occurred whilst starting or stopping it.
This is a further fix for bug 1375980.
I've also removed the check for len(nodes_
I've added tests to check for the atomicity of the callsites, for their logging and re-raising of exceptions, too.
Description of the change
Graham Binns (gmb) wrote : | # |
On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> The new node status needs to be committed before we try to start the node. If there's an error we need to reset the status and commit again.
>
> However, you may have some fun getting Django to let you commit when it's in one of its special "atomic" blocks. Where "fun" is not the word I would choose.
Can you help me understand why the new status needs to be committed?…
Ah, wait, because it's possible that two users could try to commission
the node at the same time? Or is it something else entirely?
I tried using atomic() and even savepoint() but couldn't get them to
work, before I realised that I could do things this way… The only
other option is to do a manual rollback – i.e. on error, set the node
status back to whatever it was. Would that be okay with you?
Gavin Panella (allenap) wrote : | # |
> On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> > The new node status needs to be committed before we try to start the
> > node. If there's an error we need to reset the status and commit
> > again.
> >
> > However, you may have some fun getting Django to let you commit when
> > it's in one of its special "atomic" blocks. Where "fun" is not the
> > word I would choose.
>
> Can you help me understand why the new status needs to be committed?…
> Ah, wait, because it's possible that two users could try to commission
> the node at the same time? Or is it something else entirely?
I was thinking about a node coming up and trying to boot from TFTP
before we've committed its status to the database. For a single node I
doubt it'll be a problem because, in most cases, it'll be a fraction of
a second between the PowerOn call returning and the status being saved.
However, when starting multiple nodes it could be longer because the
transaction doesn't get committed until the end of the web/API request.
If a node does its first TFTP request within that window then it'll be
given boot instructions based on its old status.
We use TransactionMidd
slowness to migrate off it may be a blessing: it /may/ allow us to call
transaction.
ATOMIC_REQUESTS, Django's replacement for TransactionMidd
would definitely not be able to commit in the middle of a request.
> I tried using atomic() and even savepoint() but couldn't get them to
> work, before I realised that I could do things this way… The only
> other option is to do a manual rollback – i.e. on error, set the node
> status back to whatever it was. Would that be okay with you?
Yeah, it'll have to be a manual rollback because we need to have that
status committed.
I remember we've talked about some of this stuff before and you talked
about manual rollbacks. If I misunderstood and gave bad advice, I'm
sorry, and hopefully I'm thinking straight now.
Graham Binns (gmb) wrote : | # |
On 8 October 2014 20:56, Gavin Panella <email address hidden> wrote:
> I was thinking about a node coming up and trying to boot from TFTP
> before we've committed its status to the database. For a single node I
> doubt it'll be a problem because, in most cases, it'll be a fraction of
> a second between the PowerOn call returning and the status being saved.
>
> However, when starting multiple nodes it could be longer because the
> transaction doesn't get committed until the end of the web/API request.
> If a node does its first TFTP request within that window then it'll be
> given boot instructions based on its old status.
>
> We use TransactionMidd
> slowness to migrate off it may be a blessing: it /may/ allow us to call
> transaction.
> ATOMIC_REQUESTS, Django's replacement for TransactionMidd
> would definitely not be able to commit in the middle of a request.
Aah, okay. Now I'm with you (on both the problem and the
why-won'
>> I tried using atomic() and even savepoint() but couldn't get them to
>> work, before I realised that I could do things this way… The only
>> other option is to do a manual rollback – i.e. on error, set the node
>> status back to whatever it was. Would that be okay with you?
>
> Yeah, it'll have to be a manual rollback because we need to have that
> status committed.
>
> I remember we've talked about some of this stuff before and you talked
> about manual rollbacks. If I misunderstood and gave bad advice, I'm
> sorry, and hopefully I'm thinking straight now.
Okay, manual rollback it is. That's relatively easy, actually. And you
didn't give bad advice — you gave really good advice that turned out
not to work properly; no fault of yours.
Gavin Panella (allenap) wrote : | # |
Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?
Graham Binns (gmb) wrote : | # |
On 9 October 2014 14:09, Gavin Panella <email address hidden> wrote:
>
> Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?
Sure.
Raphaël Badin (rvb) wrote : | # |
Looks good. Of course we need to extensively test this (I guess you've done some of it already).
Couple of remarks inline.
One additional question: I don't see the same pattern being applied to deployment… is that on purpose? (Maybe that is because the deployment action is "embedded" in start_node()).
Graham Binns (gmb) wrote : | # |
> Looks good. Of course we need to extensively test this (I guess you've done
> some of it already).
>
> Couple of remarks inline.
>
> One additional question: I don't see the same pattern being applied to
> deployment… is that on purpose? (Maybe that is because the deployment action
> is "embedded" in start_node()).
Exactly. And although start_nodes() still needs some love, I don't think we're going to get it in this cycle.
Preview Diff
1 | === modified file 'src/maasserver/models/node.py' |
2 | --- src/maasserver/models/node.py 2014-10-09 18:47:50 +0000 |
3 | +++ src/maasserver/models/node.py 2014-10-10 09:02:08 +0000 |
4 | @@ -945,13 +945,34 @@ |
5 | |
6 | commissioning_user_data = generate_user_data(node=self) |
7 | NodeResult.objects.clear_results(self) |
8 | - self.status = NODE_STATUS.COMMISSIONING |
9 | - self.save() |
10 | # The commissioning profile is handled in start_nodes. |
11 | maaslog.info( |
12 | "%s: Starting commissioning", self.hostname) |
13 | - Node.objects.start_nodes( |
14 | - [self.system_id], user, user_data=commissioning_user_data) |
15 | + # We need to mark the node as COMMISSIONING now to avoid a race |
16 | + # when starting multiple nodes. We hang on to old_status just in |
17 | + # case the power action fails. |
18 | + old_status = self.status |
19 | + self.status = NODE_STATUS.COMMISSIONING |
20 | + self.save() |
21 | + transaction.commit() |
22 | + try: |
23 | + # We don't check for which nodes we've started here, because |
24 | + # it's possible we can't start the node - its power type may not |
25 | + # allow us to do that. |
26 | + Node.objects.start_nodes( |
27 | + [self.system_id], user, user_data=commissioning_user_data) |
28 | + except Exception as ex: |
29 | + maaslog.error( |
30 | + "%s: Unable to start node: %s", |
31 | + self.hostname, unicode(ex)) |
32 | + self.status = old_status |
33 | + self.save() |
34 | + transaction.commit() |
35 | + # Let the exception bubble up, since the UI or API will have to |
36 | + # deal with it. |
37 | + raise |
38 | + else: |
39 | + maaslog.info("%s: Commissioning started", self.hostname) |
40 | |
41 | def abort_commissioning(self, user): |
42 | """Power off a commissioning node and set its status to 'declared'.""" |
43 | @@ -962,10 +983,20 @@ |
44 | % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status])) |
45 | maaslog.info( |
46 | "%s: Aborting commissioning", self.hostname) |
47 | - stopped_node = Node.objects.stop_nodes([self.system_id], user) |
48 | - if len(stopped_node) == 1: |
49 | + try: |
50 | + # We don't check for which nodes we've stopped here, because |
51 | + # it's possible we can't stop the node - its power type may |
52 | + # not allow us to do that. |
53 | + Node.objects.stop_nodes([self.system_id], user) |
54 | + except Exception as ex: |
55 | + maaslog.error( |
56 | + "%s: Unable to shut node down: %s", |
57 | + self.hostname, unicode(ex)) |
58 | + raise |
59 | + else: |
60 | self.status = NODE_STATUS.NEW |
61 | self.save() |
62 | + maaslog.info("%s: Commissioning aborted", self.hostname) |
63 | |
64 | def delete(self): |
65 | """Delete this node. |
66 | @@ -1227,12 +1258,31 @@ |
67 | from metadataserver.user_data.disk_erasing import generate_user_data |
68 | |
69 | disk_erase_user_data = generate_user_data(node=self) |
70 | + maaslog.info( |
71 | + "%s: Starting disk erasure", self.hostname) |
72 | + # Change the status of the node now to avoid races when starting |
73 | + # nodes in bulk. |
74 | self.status = NODE_STATUS.DISK_ERASING |
75 | self.save() |
76 | - maaslog.info( |
77 | - "%s: Starting disk erasing", self.hostname) |
78 | - Node.objects.start_nodes( |
79 | - [self.system_id], user, user_data=disk_erase_user_data) |
80 | + transaction.commit() |
81 | + try: |
82 | + Node.objects.start_nodes( |
83 | + [self.system_id], user, user_data=disk_erase_user_data) |
84 | + except Exception as ex: |
85 | + maaslog.error( |
86 | + "%s: Unable to start node: %s", |
87 | + self.hostname, unicode(ex)) |
88 | + # We always mark the node as failed here, although we could |
89 | + # potentially move it back to the state it was in |
90 | + # previously. For now, though, this is safer, since it marks |
91 | + # the node as needing attention. |
92 | + self.status = NODE_STATUS.FAILED_DISK_ERASING |
93 | + self.save() |
94 | + transaction.commit() |
95 | + raise |
96 | + else: |
97 | + maaslog.info( |
98 | + "%s: Disk erasure started.", self.hostname) |
99 | |
100 | def abort_disk_erasing(self, user): |
101 | """ |
102 | @@ -1246,8 +1296,14 @@ |
103 | % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status])) |
104 | maaslog.info( |
105 | "%s: Aborting disk erasing", self.hostname) |
106 | - stopped_node = Node.objects.stop_nodes([self.system_id], user) |
107 | - if len(stopped_node) == 1: |
108 | + try: |
109 | + Node.objects.stop_nodes([self.system_id], user) |
110 | + except Exception as ex: |
111 | + maaslog.error( |
112 | + "%s: Unable to shut node down: %s", |
113 | + self.hostname, unicode(ex)) |
114 | + raise |
115 | + else: |
116 | self.status = NODE_STATUS.FAILED_DISK_ERASING |
117 | self.save() |
118 | |
119 | @@ -1270,7 +1326,14 @@ |
120 | :raises MultipleFailures: If host maps cannot be deleted. |
121 | """ |
122 | maaslog.info("%s: Releasing node", self.hostname) |
123 | - Node.objects.stop_nodes([self.system_id], self.owner) |
124 | + try: |
125 | + Node.objects.stop_nodes([self.system_id], self.owner) |
126 | + except Exception as ex: |
127 | + maaslog.error( |
128 | + "%s: Unable to shut node down: %s", self.hostname, |
129 | + unicode(ex)) |
130 | + raise |
131 | + |
132 | deallocated_ips = StaticIPAddress.objects.deallocate_by_node(self) |
133 | self.delete_host_maps(deallocated_ips) |
134 | from maasserver.dns.config import change_dns_zones |
135 | |
136 | === modified file 'src/maasserver/models/tests/test_node.py' |
137 | --- src/maasserver/models/tests/test_node.py 2014-10-09 23:35:05 +0000 |
138 | +++ src/maasserver/models/tests/test_node.py 2014-10-10 09:02:08 +0000 |
139 | @@ -23,6 +23,7 @@ |
140 | |
141 | import crochet |
142 | from django.core.exceptions import ValidationError |
143 | +from django.db import transaction |
144 | from maasserver import preseed as preseed_module |
145 | from maasserver.clusterrpc.power_parameters import get_power_types |
146 | from maasserver.clusterrpc.testing.boot_images import make_rpc_boot_image |
147 | @@ -59,6 +60,7 @@ |
148 | StaticIPAddressManager, |
149 | ) |
150 | from maasserver.models.user import create_auth_token |
151 | +from maasserver.node_action import RPC_EXCEPTIONS |
152 | from maasserver.node_status import ( |
153 | get_failed_status, |
154 | MONITORED_STATUSES, |
155 | @@ -82,6 +84,7 @@ |
156 | from maastesting.matchers import ( |
157 | MockAnyCall, |
158 | MockCalledOnceWith, |
159 | + MockCallsMatch, |
160 | MockNotCalled, |
161 | ) |
162 | from maastesting.testcase import MAASTestCase |
163 | @@ -91,9 +94,13 @@ |
164 | NodeResult, |
165 | NodeUserData, |
166 | ) |
167 | -from metadataserver.user_data import commissioning |
168 | +from metadataserver.user_data import ( |
169 | + commissioning, |
170 | + disk_erasing, |
171 | + ) |
172 | from mock import ( |
173 | ANY, |
174 | + call, |
175 | Mock, |
176 | sentinel, |
177 | ) |
178 | @@ -101,7 +108,10 @@ |
179 | from provisioningserver.power_schema import JSON_POWER_TYPE_PARAMETERS |
180 | from provisioningserver.rpc import cluster as cluster_module |
181 | from provisioningserver.rpc.cluster import StartMonitors |
182 | -from provisioningserver.rpc.exceptions import MultipleFailures |
183 | +from provisioningserver.rpc.exceptions import ( |
184 | + MultipleFailures, |
185 | + NoConnectionsAvailable, |
186 | + ) |
187 | from provisioningserver.rpc.power import QUERY_POWER_TYPES |
188 | from provisioningserver.rpc.testing import ( |
189 | always_succeed_with, |
190 | @@ -743,6 +753,65 @@ |
191 | self.assertThat(stop_nodes, MockCalledOnceWith( |
192 | [node.system_id], owner)) |
193 | |
194 | + def test_start_disk_erasing_reverts_to_sane_state_on_error(self): |
195 | + # If start_disk_erasing encounters an error when calling |
196 | + # start_nodes(), it will transition the node to a sane state. |
197 | + # Failures encountered in one call to start_disk_erasing() won't |
198 | + # affect subsequent calls. |
199 | + admin = factory.make_admin() |
200 | + nodes = [ |
201 | + factory.make_Node( |
202 | + status=NODE_STATUS.ALLOCATED, power_type="virsh") |
203 | + for _ in range(3) |
204 | + ] |
205 | + generate_user_data = self.patch(disk_erasing, 'generate_user_data') |
206 | + start_nodes = self.patch(Node.objects, 'start_nodes') |
207 | + start_nodes.side_effect = [ |
208 | + None, |
209 | + MultipleFailures( |
210 | + Failure(NoConnectionsAvailable())), |
211 | + None, |
212 | + ] |
213 | + |
214 | + with transaction.atomic(): |
215 | + for node in nodes: |
216 | + try: |
217 | + node.start_disk_erasing(admin) |
218 | + except RPC_EXCEPTIONS: |
219 | + # Suppress all the expected errors coming out of |
220 | + # start_disk_erasing() because they're tested |
221 | + # eleswhere. |
222 | + pass |
223 | + |
224 | + expected_calls = ( |
225 | + call( |
226 | + [node.system_id], admin, |
227 | + user_data=generate_user_data.return_value) |
228 | + for node in nodes) |
229 | + self.assertThat( |
230 | + start_nodes, MockCallsMatch(*expected_calls)) |
231 | + self.assertEqual( |
232 | + [ |
233 | + NODE_STATUS.DISK_ERASING, |
234 | + NODE_STATUS.FAILED_DISK_ERASING, |
235 | + NODE_STATUS.DISK_ERASING, |
236 | + ], |
237 | + [node.status for node in nodes]) |
238 | + |
239 | + def test_start_disk_erasing_logs_and_raises_errors_in_starting(self): |
240 | + admin = factory.make_admin() |
241 | + node = factory.make_Node(status=NODE_STATUS.ALLOCATED) |
242 | + maaslog = self.patch(node_module, 'maaslog') |
243 | + exception = NoConnectionsAvailable(factory.make_name()) |
244 | + self.patch(Node.objects, 'start_nodes').side_effect = exception |
245 | + self.assertRaises( |
246 | + NoConnectionsAvailable, node.start_disk_erasing, admin) |
247 | + self.assertEqual(NODE_STATUS.FAILED_DISK_ERASING, node.status) |
248 | + self.assertThat( |
249 | + maaslog.error, MockCalledOnceWith( |
250 | + "%s: Unable to start node: %s", |
251 | + node.hostname, unicode(exception))) |
252 | + |
253 | def test_abort_operation_aborts_disk_erasing(self): |
254 | agent_name = factory.make_name('agent-name') |
255 | owner = factory.make_User() |
256 | @@ -761,6 +830,63 @@ |
257 | agent_name=agent_name) |
258 | self.assertRaises(NodeStateViolation, node.abort_operation, owner) |
259 | |
260 | + def test_abort_disk_erasing_reverts_to_sane_state_on_error(self): |
261 | + # If start_disk_erasing encounters an error when calling |
262 | + # start_nodes(), it will transition the node to a sane state. |
263 | + # Failures encountered in one call to start_disk_erasing() won't |
264 | + # affect subsequent calls. |
265 | + admin = factory.make_admin() |
266 | + nodes = [ |
267 | + factory.make_Node( |
268 | + status=NODE_STATUS.DISK_ERASING, power_type="virsh") |
269 | + for _ in range(3) |
270 | + ] |
271 | + stop_nodes = self.patch(Node.objects, 'stop_nodes') |
272 | + stop_nodes.return_value = [ |
273 | + [node] for node in nodes |
274 | + ] |
275 | + stop_nodes.side_effect = [ |
276 | + None, |
277 | + MultipleFailures( |
278 | + Failure(NoConnectionsAvailable())), |
279 | + None, |
280 | + ] |
281 | + |
282 | + with transaction.atomic(): |
283 | + for node in nodes: |
284 | + try: |
285 | + node.abort_disk_erasing(admin) |
286 | + except RPC_EXCEPTIONS: |
287 | + # Suppress all the expected errors coming out of |
288 | + # abort_disk_erasing() because they're tested |
289 | + # eleswhere. |
290 | + pass |
291 | + |
292 | + self.assertThat( |
293 | + stop_nodes, MockCallsMatch( |
294 | + *(call([node.system_id], admin) for node in nodes))) |
295 | + self.assertEqual( |
296 | + [ |
297 | + NODE_STATUS.FAILED_DISK_ERASING, |
298 | + NODE_STATUS.DISK_ERASING, |
299 | + NODE_STATUS.FAILED_DISK_ERASING, |
300 | + ], |
301 | + [node.status for node in nodes]) |
302 | + |
303 | + def test_abort_disk_erasing_logs_and_raises_errors_in_stopping(self): |
304 | + admin = factory.make_admin() |
305 | + node = factory.make_Node(status=NODE_STATUS.DISK_ERASING) |
306 | + maaslog = self.patch(node_module, 'maaslog') |
307 | + exception = NoConnectionsAvailable(factory.make_name()) |
308 | + self.patch(Node.objects, 'stop_nodes').side_effect = exception |
309 | + self.assertRaises( |
310 | + NoConnectionsAvailable, node.abort_disk_erasing, admin) |
311 | + self.assertEqual(NODE_STATUS.DISK_ERASING, node.status) |
312 | + self.assertThat( |
313 | + maaslog.error, MockCalledOnceWith( |
314 | + "%s: Unable to shut node down: %s", |
315 | + node.hostname, unicode(exception))) |
316 | + |
317 | def test_release_node_that_has_power_on_and_controlled_power_type(self): |
318 | self.patch(node_module, 'wait_for_power_commands') |
319 | agent_name = factory.make_name('agent-name') |
320 | @@ -1080,6 +1206,57 @@ |
321 | node.release() |
322 | self.assertThat(change_dns_zones, MockCalledOnceWith([node.nodegroup])) |
323 | |
324 | + def test_release_logs_and_raises_errors_in_stopping(self): |
325 | + node = factory.make_Node(status=NODE_STATUS.DEPLOYED) |
326 | + maaslog = self.patch(node_module, 'maaslog') |
327 | + exception = NoConnectionsAvailable(factory.make_name()) |
328 | + self.patch(Node.objects, 'stop_nodes').side_effect = exception |
329 | + self.assertRaises(NoConnectionsAvailable, node.release) |
330 | + self.assertEqual(NODE_STATUS.DEPLOYED, node.status) |
331 | + self.assertThat( |
332 | + maaslog.error, MockCalledOnceWith( |
333 | + "%s: Unable to shut node down: %s", |
334 | + node.hostname, unicode(exception))) |
335 | + |
336 | + def test_release_reverts_to_sane_state_on_error(self): |
337 | + # If release() encounters an error when stopping the node, it |
338 | + # will leave the node in its previous state (i.e. DEPLOYED). |
339 | + nodes = [ |
340 | + factory.make_Node( |
341 | + status=NODE_STATUS.DEPLOYED, power_type="virsh") |
342 | + for _ in range(3) |
343 | + ] |
344 | + stop_nodes = self.patch(Node.objects, 'stop_nodes') |
345 | + stop_nodes.return_value = [ |
346 | + [node] for node in nodes |
347 | + ] |
348 | + stop_nodes.side_effect = [ |
349 | + None, |
350 | + MultipleFailures( |
351 | + Failure(NoConnectionsAvailable())), |
352 | + None, |
353 | + ] |
354 | + |
355 | + with transaction.atomic(): |
356 | + for node in nodes: |
357 | + try: |
358 | + node.release() |
359 | + except RPC_EXCEPTIONS: |
360 | + # Suppress all expected errors; we test for them |
361 | + # elsewhere. |
362 | + pass |
363 | + |
364 | + self.assertThat( |
365 | + stop_nodes, MockCallsMatch( |
366 | + *(call([node.system_id], None) for node in nodes))) |
367 | + self.assertEqual( |
368 | + [ |
369 | + NODE_STATUS.RELEASING, |
370 | + NODE_STATUS.DEPLOYED, |
371 | + NODE_STATUS.RELEASING, |
372 | + ], |
373 | + [node.status for node in nodes]) |
374 | + |
375 | def test_accept_enlistment_gets_node_out_of_declared_state(self): |
376 | # If called on a node in New state, accept_enlistment() |
377 | # changes the node's status, and returns the node. |
378 | @@ -1141,10 +1318,10 @@ |
379 | {status: node.status for status, node in nodes.items()}) |
380 | |
381 | def test_start_commissioning_changes_status_and_starts_node(self): |
382 | - start_nodes = self.patch(Node.objects, "start_nodes") |
383 | - |
384 | node = factory.make_Node( |
385 | status=NODE_STATUS.NEW, power_type='ether_wake') |
386 | + start_nodes = self.patch(Node.objects, "start_nodes") |
387 | + start_nodes.return_value = [node] |
388 | factory.make_MACAddress(node=node) |
389 | admin = factory.make_admin() |
390 | node.start_commissioning(admin) |
391 | @@ -1192,6 +1369,117 @@ |
392 | self.assertEqual( |
393 | data, NodeResult.objects.get_data(node, filename)) |
394 | |
395 | + def test_start_commissioning_reverts_to_sane_state_on_error(self): |
396 | + # When start_commissioning encounters an error when trying to |
397 | + # start the node, it will revert the node to its previous |
398 | + # status. |
399 | + admin = factory.make_admin() |
400 | + nodes = [ |
401 | + factory.make_Node(status=NODE_STATUS.NEW, power_type="ether_wake") |
402 | + for _ in range(3) |
403 | + ] |
404 | + generate_user_data = self.patch(commissioning, 'generate_user_data') |
405 | + start_nodes = self.patch(Node.objects, 'start_nodes') |
406 | + start_nodes.side_effect = [ |
407 | + None, |
408 | + MultipleFailures( |
409 | + Failure(NoConnectionsAvailable())), |
410 | + None, |
411 | + ] |
412 | + |
413 | + with transaction.atomic(): |
414 | + for node in nodes: |
415 | + try: |
416 | + node.start_commissioning(admin) |
417 | + except RPC_EXCEPTIONS: |
418 | + # Suppress all expected errors; we test for them |
419 | + # elsewhere. |
420 | + pass |
421 | + |
422 | + expected_calls = ( |
423 | + call( |
424 | + [node.system_id], admin, |
425 | + user_data=generate_user_data.return_value) |
426 | + for node in nodes) |
427 | + self.assertThat( |
428 | + start_nodes, MockCallsMatch(*expected_calls)) |
429 | + self.assertEqual( |
430 | + [ |
431 | + NODE_STATUS.COMMISSIONING, |
432 | + NODE_STATUS.NEW, |
433 | + NODE_STATUS.COMMISSIONING |
434 | + ], |
435 | + [node.status for node in nodes]) |
436 | + |
437 | + def test_start_commissioning_logs_and_raises_errors_in_starting(self): |
438 | + admin = factory.make_admin() |
439 | + node = factory.make_Node(status=NODE_STATUS.NEW) |
440 | + maaslog = self.patch(node_module, 'maaslog') |
441 | + exception = NoConnectionsAvailable(factory.make_name()) |
442 | + self.patch(Node.objects, 'start_nodes').side_effect = exception |
443 | + self.assertRaises( |
444 | + NoConnectionsAvailable, node.start_commissioning, admin) |
445 | + self.assertEqual(NODE_STATUS.NEW, node.status) |
446 | + self.assertThat( |
447 | + maaslog.error, MockCalledOnceWith( |
448 | + "%s: Unable to start node: %s", |
449 | + node.hostname, unicode(exception))) |
450 | + |
451 | + def test_abort_commissioning_reverts_to_sane_state_on_error(self): |
452 | + # If abort commissioning hits an error when trying to stop the |
453 | + # node, it will revert the node to the state it was in before |
454 | + # abort_commissioning() was called. |
455 | + admin = factory.make_admin() |
456 | + nodes = [ |
457 | + factory.make_Node( |
458 | + status=NODE_STATUS.COMMISSIONING, power_type="virsh") |
459 | + for _ in range(3) |
460 | + ] |
461 | + stop_nodes = self.patch(Node.objects, 'stop_nodes') |
462 | + stop_nodes.return_value = [ |
463 | + [node] for node in nodes |
464 | + ] |
465 | + stop_nodes.side_effect = [ |
466 | + None, |
467 | + MultipleFailures( |
468 | + Failure(NoConnectionsAvailable())), |
469 | + None, |
470 | + ] |
471 | + |
472 | + with transaction.atomic(): |
473 | + for node in nodes: |
474 | + try: |
475 | + node.abort_commissioning(admin) |
476 | + except RPC_EXCEPTIONS: |
477 | + # Suppress all expected errors; we test for them |
478 | + # elsewhere. |
479 | + pass |
480 | + |
481 | + self.assertThat( |
482 | + stop_nodes, MockCallsMatch( |
483 | + *(call([node.system_id], admin) for node in nodes))) |
484 | + self.assertEqual( |
485 | + [ |
486 | + NODE_STATUS.NEW, |
487 | + NODE_STATUS.COMMISSIONING, |
488 | + NODE_STATUS.NEW, |
489 | + ], |
490 | + [node.status for node in nodes]) |
491 | + |
492 | + def test_abort_commissioning_logs_and_raises_errors_in_stopping(self): |
493 | + admin = factory.make_admin() |
494 | + node = factory.make_Node(status=NODE_STATUS.COMMISSIONING) |
495 | + maaslog = self.patch(node_module, 'maaslog') |
496 | + exception = NoConnectionsAvailable(factory.make_name()) |
497 | + self.patch(Node.objects, 'stop_nodes').side_effect = exception |
498 | + self.assertRaises( |
499 | + NoConnectionsAvailable, node.abort_commissioning, admin) |
500 | + self.assertEqual(NODE_STATUS.COMMISSIONING, node.status) |
501 | + self.assertThat( |
502 | + maaslog.error, MockCalledOnceWith( |
503 | + "%s: Unable to shut node down: %s", |
504 | + node.hostname, unicode(exception))) |
505 | + |
506 | def test_abort_commissioning_changes_status_and_stops_node(self): |
507 | node = factory.make_Node( |
508 | status=NODE_STATUS.COMMISSIONING, power_type='virsh') |
It's good, but I think there's a race in there that we need to fix. It's not your fault, and this branch doesn't make it worse, but it's still not the right approach.