Merge lp:~gmb/maas/catch-errors-in-power-callsites into lp:~maas-committers/maas/trunk

Proposed by Graham Binns
Status: Merged
Approved by: Graham Binns
Approved revision: no longer in the source branch.
Merged at revision: 3228
Proposed branch: lp:~gmb/maas/catch-errors-in-power-callsites
Merge into: lp:~maas-committers/maas/trunk
Diff against target: 508 lines (+368/-17)
2 files modified
src/maasserver/models/node.py (+76/-13)
src/maasserver/models/tests/test_node.py (+292/-4)
To merge this branch: bzr merge lp:~gmb/maas/catch-errors-in-power-callsites
Reviewer Review Type Date Requested Status
Raphaël Badin (community) Approve
Gavin Panella (community) Approve
Review via email: mp+237619@code.launchpad.net

Commit message

Refactor call sites of Node.objects.(start|stop)_nodes() so that they:
 - Handle errors raised by those methods
 - Ensure that model changes are committed before power actions are sent
   (when starting nodes) and after power actions are sent (when
   stopping nodes). This avoids race conditions when carrying out actions
   in bulk.
 - When errors occur in start_nodes(), the callsites will explicitly roll
   the node back to a sane status (either its old status or a suitable
   FAILED status).
 - Re-raise those errors once they've finished with them
 - Don't change the state of the Node if an error occurred whilst starting or stopping it.

This is a further fix for bug 1375980.

I've also removed the check for len(nodes_(started|stopped) > 0 in those methods that used it. This check made no sense; there are power types which we just *can't* power off (ether_wake, for example). If the check was left in place, the node would never come out of whatever state it was in, and would be stuck.

I've added tests to check for the atomicity of the callsites, for their logging and re-raising of exceptions, too.

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) wrote :

It's good, but I think there's a race in there that we need to fix. It's not your fault, and this branch doesn't make it worse, but it's still not the right approach.

review: Needs Fixing
Revision history for this message
Graham Binns (gmb) wrote :

On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> The new node status needs to be committed before we try to start the node. If there's an error we need to reset the status and commit again.
>
> However, you may have some fun getting Django to let you commit when it's in one of its special "atomic" blocks. Where "fun" is not the word I would choose.

Can you help me understand why the new status needs to be committed?…
Ah, wait, because it's possible that two users could try to commission
the node at the same time? Or is it something else entirely?

I tried using atomic() and even savepoint() but couldn't get them to
work, before I realised that I could do things this way… The only
other option is to do a manual rollback – i.e. on error, set the node
status back to whatever it was. Would that be okay with you?

Revision history for this message
Gavin Panella (allenap) wrote :

> On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> > The new node status needs to be committed before we try to start the
> > node. If there's an error we need to reset the status and commit
> > again.
> >
> > However, you may have some fun getting Django to let you commit when
> > it's in one of its special "atomic" blocks. Where "fun" is not the
> > word I would choose.
>
> Can you help me understand why the new status needs to be committed?…
> Ah, wait, because it's possible that two users could try to commission
> the node at the same time? Or is it something else entirely?

I was thinking about a node coming up and trying to boot from TFTP
before we've committed its status to the database. For a single node I
doubt it'll be a problem because, in most cases, it'll be a fraction of
a second between the PowerOn call returning and the status being saved.

However, when starting multiple nodes it could be longer because the
transaction doesn't get committed until the end of the web/API request.
If a node does its first TFTP request within that window then it'll be
given boot instructions based on its old status.

We use TransactionMiddleware, which has been deprecated, but our
slowness to migrate off it may be a blessing: it /may/ allow us to call
transaction.commit() in the middle of a request. If we were using
ATOMIC_REQUESTS, Django's replacement for TransactionMiddleware, we
would definitely not be able to commit in the middle of a request.

> I tried using atomic() and even savepoint() but couldn't get them to
> work, before I realised that I could do things this way… The only
> other option is to do a manual rollback – i.e. on error, set the node
> status back to whatever it was. Would that be okay with you?

Yeah, it'll have to be a manual rollback because we need to have that
status committed.

I remember we've talked about some of this stuff before and you talked
about manual rollbacks. If I misunderstood and gave bad advice, I'm
sorry, and hopefully I'm thinking straight now.

Revision history for this message
Graham Binns (gmb) wrote :

On 8 October 2014 20:56, Gavin Panella <email address hidden> wrote:
> I was thinking about a node coming up and trying to boot from TFTP
> before we've committed its status to the database. For a single node I
> doubt it'll be a problem because, in most cases, it'll be a fraction of
> a second between the PowerOn call returning and the status being saved.
>
> However, when starting multiple nodes it could be longer because the
> transaction doesn't get committed until the end of the web/API request.
> If a node does its first TFTP request within that window then it'll be
> given boot instructions based on its old status.
>
> We use TransactionMiddleware, which has been deprecated, but our
> slowness to migrate off it may be a blessing: it /may/ allow us to call
> transaction.commit() in the middle of a request. If we were using
> ATOMIC_REQUESTS, Django's replacement for TransactionMiddleware, we
> would definitely not be able to commit in the middle of a request.

Aah, okay. Now I'm with you (on both the problem and the
why-won't-you-be-atomic() fronts).

>> I tried using atomic() and even savepoint() but couldn't get them to
>> work, before I realised that I could do things this way… The only
>> other option is to do a manual rollback – i.e. on error, set the node
>> status back to whatever it was. Would that be okay with you?
>
> Yeah, it'll have to be a manual rollback because we need to have that
> status committed.
>
> I remember we've talked about some of this stuff before and you talked
> about manual rollbacks. If I misunderstood and gave bad advice, I'm
> sorry, and hopefully I'm thinking straight now.

Okay, manual rollback it is. That's relatively easy, actually. And you
didn't give bad advice — you gave really good advice that turned out
not to work properly; no fault of yours.

Revision history for this message
Gavin Panella (allenap) wrote :

Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?

review: Approve
Revision history for this message
Graham Binns (gmb) wrote :

On 9 October 2014 14:09, Gavin Panella <email address hidden> wrote:
>
> Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?

Sure.

Revision history for this message
Raphaël Badin (rvb) wrote :

Looks good. Of course we need to extensively test this (I guess you've done some of it already).

Couple of remarks inline.

One additional question: I don't see the same pattern being applied to deployment… is that on purpose? (Maybe that is because the deployment action is "embedded" in start_node()).

review: Approve
Revision history for this message
Graham Binns (gmb) wrote :

> Looks good. Of course we need to extensively test this (I guess you've done
> some of it already).
>
> Couple of remarks inline.
>
> One additional question: I don't see the same pattern being applied to
> deployment… is that on purpose? (Maybe that is because the deployment action
> is "embedded" in start_node()).

Exactly. And although start_nodes() still needs some love, I don't think we're going to get it in this cycle.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/maasserver/models/node.py'
2--- src/maasserver/models/node.py 2014-10-09 18:47:50 +0000
3+++ src/maasserver/models/node.py 2014-10-10 09:02:08 +0000
4@@ -945,13 +945,34 @@
5
6 commissioning_user_data = generate_user_data(node=self)
7 NodeResult.objects.clear_results(self)
8- self.status = NODE_STATUS.COMMISSIONING
9- self.save()
10 # The commissioning profile is handled in start_nodes.
11 maaslog.info(
12 "%s: Starting commissioning", self.hostname)
13- Node.objects.start_nodes(
14- [self.system_id], user, user_data=commissioning_user_data)
15+ # We need to mark the node as COMMISSIONING now to avoid a race
16+ # when starting multiple nodes. We hang on to old_status just in
17+ # case the power action fails.
18+ old_status = self.status
19+ self.status = NODE_STATUS.COMMISSIONING
20+ self.save()
21+ transaction.commit()
22+ try:
23+ # We don't check for which nodes we've started here, because
24+ # it's possible we can't start the node - its power type may not
25+ # allow us to do that.
26+ Node.objects.start_nodes(
27+ [self.system_id], user, user_data=commissioning_user_data)
28+ except Exception as ex:
29+ maaslog.error(
30+ "%s: Unable to start node: %s",
31+ self.hostname, unicode(ex))
32+ self.status = old_status
33+ self.save()
34+ transaction.commit()
35+ # Let the exception bubble up, since the UI or API will have to
36+ # deal with it.
37+ raise
38+ else:
39+ maaslog.info("%s: Commissioning started", self.hostname)
40
41 def abort_commissioning(self, user):
42 """Power off a commissioning node and set its status to 'declared'."""
43@@ -962,10 +983,20 @@
44 % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status]))
45 maaslog.info(
46 "%s: Aborting commissioning", self.hostname)
47- stopped_node = Node.objects.stop_nodes([self.system_id], user)
48- if len(stopped_node) == 1:
49+ try:
50+ # We don't check for which nodes we've stopped here, because
51+ # it's possible we can't stop the node - its power type may
52+ # not allow us to do that.
53+ Node.objects.stop_nodes([self.system_id], user)
54+ except Exception as ex:
55+ maaslog.error(
56+ "%s: Unable to shut node down: %s",
57+ self.hostname, unicode(ex))
58+ raise
59+ else:
60 self.status = NODE_STATUS.NEW
61 self.save()
62+ maaslog.info("%s: Commissioning aborted", self.hostname)
63
64 def delete(self):
65 """Delete this node.
66@@ -1227,12 +1258,31 @@
67 from metadataserver.user_data.disk_erasing import generate_user_data
68
69 disk_erase_user_data = generate_user_data(node=self)
70+ maaslog.info(
71+ "%s: Starting disk erasure", self.hostname)
72+ # Change the status of the node now to avoid races when starting
73+ # nodes in bulk.
74 self.status = NODE_STATUS.DISK_ERASING
75 self.save()
76- maaslog.info(
77- "%s: Starting disk erasing", self.hostname)
78- Node.objects.start_nodes(
79- [self.system_id], user, user_data=disk_erase_user_data)
80+ transaction.commit()
81+ try:
82+ Node.objects.start_nodes(
83+ [self.system_id], user, user_data=disk_erase_user_data)
84+ except Exception as ex:
85+ maaslog.error(
86+ "%s: Unable to start node: %s",
87+ self.hostname, unicode(ex))
88+ # We always mark the node as failed here, although we could
89+ # potentially move it back to the state it was in
90+ # previously. For now, though, this is safer, since it marks
91+ # the node as needing attention.
92+ self.status = NODE_STATUS.FAILED_DISK_ERASING
93+ self.save()
94+ transaction.commit()
95+ raise
96+ else:
97+ maaslog.info(
98+ "%s: Disk erasure started.", self.hostname)
99
100 def abort_disk_erasing(self, user):
101 """
102@@ -1246,8 +1296,14 @@
103 % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status]))
104 maaslog.info(
105 "%s: Aborting disk erasing", self.hostname)
106- stopped_node = Node.objects.stop_nodes([self.system_id], user)
107- if len(stopped_node) == 1:
108+ try:
109+ Node.objects.stop_nodes([self.system_id], user)
110+ except Exception as ex:
111+ maaslog.error(
112+ "%s: Unable to shut node down: %s",
113+ self.hostname, unicode(ex))
114+ raise
115+ else:
116 self.status = NODE_STATUS.FAILED_DISK_ERASING
117 self.save()
118
119@@ -1270,7 +1326,14 @@
120 :raises MultipleFailures: If host maps cannot be deleted.
121 """
122 maaslog.info("%s: Releasing node", self.hostname)
123- Node.objects.stop_nodes([self.system_id], self.owner)
124+ try:
125+ Node.objects.stop_nodes([self.system_id], self.owner)
126+ except Exception as ex:
127+ maaslog.error(
128+ "%s: Unable to shut node down: %s", self.hostname,
129+ unicode(ex))
130+ raise
131+
132 deallocated_ips = StaticIPAddress.objects.deallocate_by_node(self)
133 self.delete_host_maps(deallocated_ips)
134 from maasserver.dns.config import change_dns_zones
135
136=== modified file 'src/maasserver/models/tests/test_node.py'
137--- src/maasserver/models/tests/test_node.py 2014-10-09 23:35:05 +0000
138+++ src/maasserver/models/tests/test_node.py 2014-10-10 09:02:08 +0000
139@@ -23,6 +23,7 @@
140
141 import crochet
142 from django.core.exceptions import ValidationError
143+from django.db import transaction
144 from maasserver import preseed as preseed_module
145 from maasserver.clusterrpc.power_parameters import get_power_types
146 from maasserver.clusterrpc.testing.boot_images import make_rpc_boot_image
147@@ -59,6 +60,7 @@
148 StaticIPAddressManager,
149 )
150 from maasserver.models.user import create_auth_token
151+from maasserver.node_action import RPC_EXCEPTIONS
152 from maasserver.node_status import (
153 get_failed_status,
154 MONITORED_STATUSES,
155@@ -82,6 +84,7 @@
156 from maastesting.matchers import (
157 MockAnyCall,
158 MockCalledOnceWith,
159+ MockCallsMatch,
160 MockNotCalled,
161 )
162 from maastesting.testcase import MAASTestCase
163@@ -91,9 +94,13 @@
164 NodeResult,
165 NodeUserData,
166 )
167-from metadataserver.user_data import commissioning
168+from metadataserver.user_data import (
169+ commissioning,
170+ disk_erasing,
171+ )
172 from mock import (
173 ANY,
174+ call,
175 Mock,
176 sentinel,
177 )
178@@ -101,7 +108,10 @@
179 from provisioningserver.power_schema import JSON_POWER_TYPE_PARAMETERS
180 from provisioningserver.rpc import cluster as cluster_module
181 from provisioningserver.rpc.cluster import StartMonitors
182-from provisioningserver.rpc.exceptions import MultipleFailures
183+from provisioningserver.rpc.exceptions import (
184+ MultipleFailures,
185+ NoConnectionsAvailable,
186+ )
187 from provisioningserver.rpc.power import QUERY_POWER_TYPES
188 from provisioningserver.rpc.testing import (
189 always_succeed_with,
190@@ -743,6 +753,65 @@
191 self.assertThat(stop_nodes, MockCalledOnceWith(
192 [node.system_id], owner))
193
194+ def test_start_disk_erasing_reverts_to_sane_state_on_error(self):
195+ # If start_disk_erasing encounters an error when calling
196+ # start_nodes(), it will transition the node to a sane state.
197+ # Failures encountered in one call to start_disk_erasing() won't
198+ # affect subsequent calls.
199+ admin = factory.make_admin()
200+ nodes = [
201+ factory.make_Node(
202+ status=NODE_STATUS.ALLOCATED, power_type="virsh")
203+ for _ in range(3)
204+ ]
205+ generate_user_data = self.patch(disk_erasing, 'generate_user_data')
206+ start_nodes = self.patch(Node.objects, 'start_nodes')
207+ start_nodes.side_effect = [
208+ None,
209+ MultipleFailures(
210+ Failure(NoConnectionsAvailable())),
211+ None,
212+ ]
213+
214+ with transaction.atomic():
215+ for node in nodes:
216+ try:
217+ node.start_disk_erasing(admin)
218+ except RPC_EXCEPTIONS:
219+ # Suppress all the expected errors coming out of
220+ # start_disk_erasing() because they're tested
221+ # eleswhere.
222+ pass
223+
224+ expected_calls = (
225+ call(
226+ [node.system_id], admin,
227+ user_data=generate_user_data.return_value)
228+ for node in nodes)
229+ self.assertThat(
230+ start_nodes, MockCallsMatch(*expected_calls))
231+ self.assertEqual(
232+ [
233+ NODE_STATUS.DISK_ERASING,
234+ NODE_STATUS.FAILED_DISK_ERASING,
235+ NODE_STATUS.DISK_ERASING,
236+ ],
237+ [node.status for node in nodes])
238+
239+ def test_start_disk_erasing_logs_and_raises_errors_in_starting(self):
240+ admin = factory.make_admin()
241+ node = factory.make_Node(status=NODE_STATUS.ALLOCATED)
242+ maaslog = self.patch(node_module, 'maaslog')
243+ exception = NoConnectionsAvailable(factory.make_name())
244+ self.patch(Node.objects, 'start_nodes').side_effect = exception
245+ self.assertRaises(
246+ NoConnectionsAvailable, node.start_disk_erasing, admin)
247+ self.assertEqual(NODE_STATUS.FAILED_DISK_ERASING, node.status)
248+ self.assertThat(
249+ maaslog.error, MockCalledOnceWith(
250+ "%s: Unable to start node: %s",
251+ node.hostname, unicode(exception)))
252+
253 def test_abort_operation_aborts_disk_erasing(self):
254 agent_name = factory.make_name('agent-name')
255 owner = factory.make_User()
256@@ -761,6 +830,63 @@
257 agent_name=agent_name)
258 self.assertRaises(NodeStateViolation, node.abort_operation, owner)
259
260+ def test_abort_disk_erasing_reverts_to_sane_state_on_error(self):
261+ # If start_disk_erasing encounters an error when calling
262+ # start_nodes(), it will transition the node to a sane state.
263+ # Failures encountered in one call to start_disk_erasing() won't
264+ # affect subsequent calls.
265+ admin = factory.make_admin()
266+ nodes = [
267+ factory.make_Node(
268+ status=NODE_STATUS.DISK_ERASING, power_type="virsh")
269+ for _ in range(3)
270+ ]
271+ stop_nodes = self.patch(Node.objects, 'stop_nodes')
272+ stop_nodes.return_value = [
273+ [node] for node in nodes
274+ ]
275+ stop_nodes.side_effect = [
276+ None,
277+ MultipleFailures(
278+ Failure(NoConnectionsAvailable())),
279+ None,
280+ ]
281+
282+ with transaction.atomic():
283+ for node in nodes:
284+ try:
285+ node.abort_disk_erasing(admin)
286+ except RPC_EXCEPTIONS:
287+ # Suppress all the expected errors coming out of
288+ # abort_disk_erasing() because they're tested
289+ # eleswhere.
290+ pass
291+
292+ self.assertThat(
293+ stop_nodes, MockCallsMatch(
294+ *(call([node.system_id], admin) for node in nodes)))
295+ self.assertEqual(
296+ [
297+ NODE_STATUS.FAILED_DISK_ERASING,
298+ NODE_STATUS.DISK_ERASING,
299+ NODE_STATUS.FAILED_DISK_ERASING,
300+ ],
301+ [node.status for node in nodes])
302+
303+ def test_abort_disk_erasing_logs_and_raises_errors_in_stopping(self):
304+ admin = factory.make_admin()
305+ node = factory.make_Node(status=NODE_STATUS.DISK_ERASING)
306+ maaslog = self.patch(node_module, 'maaslog')
307+ exception = NoConnectionsAvailable(factory.make_name())
308+ self.patch(Node.objects, 'stop_nodes').side_effect = exception
309+ self.assertRaises(
310+ NoConnectionsAvailable, node.abort_disk_erasing, admin)
311+ self.assertEqual(NODE_STATUS.DISK_ERASING, node.status)
312+ self.assertThat(
313+ maaslog.error, MockCalledOnceWith(
314+ "%s: Unable to shut node down: %s",
315+ node.hostname, unicode(exception)))
316+
317 def test_release_node_that_has_power_on_and_controlled_power_type(self):
318 self.patch(node_module, 'wait_for_power_commands')
319 agent_name = factory.make_name('agent-name')
320@@ -1080,6 +1206,57 @@
321 node.release()
322 self.assertThat(change_dns_zones, MockCalledOnceWith([node.nodegroup]))
323
324+ def test_release_logs_and_raises_errors_in_stopping(self):
325+ node = factory.make_Node(status=NODE_STATUS.DEPLOYED)
326+ maaslog = self.patch(node_module, 'maaslog')
327+ exception = NoConnectionsAvailable(factory.make_name())
328+ self.patch(Node.objects, 'stop_nodes').side_effect = exception
329+ self.assertRaises(NoConnectionsAvailable, node.release)
330+ self.assertEqual(NODE_STATUS.DEPLOYED, node.status)
331+ self.assertThat(
332+ maaslog.error, MockCalledOnceWith(
333+ "%s: Unable to shut node down: %s",
334+ node.hostname, unicode(exception)))
335+
336+ def test_release_reverts_to_sane_state_on_error(self):
337+ # If release() encounters an error when stopping the node, it
338+ # will leave the node in its previous state (i.e. DEPLOYED).
339+ nodes = [
340+ factory.make_Node(
341+ status=NODE_STATUS.DEPLOYED, power_type="virsh")
342+ for _ in range(3)
343+ ]
344+ stop_nodes = self.patch(Node.objects, 'stop_nodes')
345+ stop_nodes.return_value = [
346+ [node] for node in nodes
347+ ]
348+ stop_nodes.side_effect = [
349+ None,
350+ MultipleFailures(
351+ Failure(NoConnectionsAvailable())),
352+ None,
353+ ]
354+
355+ with transaction.atomic():
356+ for node in nodes:
357+ try:
358+ node.release()
359+ except RPC_EXCEPTIONS:
360+ # Suppress all expected errors; we test for them
361+ # elsewhere.
362+ pass
363+
364+ self.assertThat(
365+ stop_nodes, MockCallsMatch(
366+ *(call([node.system_id], None) for node in nodes)))
367+ self.assertEqual(
368+ [
369+ NODE_STATUS.RELEASING,
370+ NODE_STATUS.DEPLOYED,
371+ NODE_STATUS.RELEASING,
372+ ],
373+ [node.status for node in nodes])
374+
375 def test_accept_enlistment_gets_node_out_of_declared_state(self):
376 # If called on a node in New state, accept_enlistment()
377 # changes the node's status, and returns the node.
378@@ -1141,10 +1318,10 @@
379 {status: node.status for status, node in nodes.items()})
380
381 def test_start_commissioning_changes_status_and_starts_node(self):
382- start_nodes = self.patch(Node.objects, "start_nodes")
383-
384 node = factory.make_Node(
385 status=NODE_STATUS.NEW, power_type='ether_wake')
386+ start_nodes = self.patch(Node.objects, "start_nodes")
387+ start_nodes.return_value = [node]
388 factory.make_MACAddress(node=node)
389 admin = factory.make_admin()
390 node.start_commissioning(admin)
391@@ -1192,6 +1369,117 @@
392 self.assertEqual(
393 data, NodeResult.objects.get_data(node, filename))
394
395+ def test_start_commissioning_reverts_to_sane_state_on_error(self):
396+ # When start_commissioning encounters an error when trying to
397+ # start the node, it will revert the node to its previous
398+ # status.
399+ admin = factory.make_admin()
400+ nodes = [
401+ factory.make_Node(status=NODE_STATUS.NEW, power_type="ether_wake")
402+ for _ in range(3)
403+ ]
404+ generate_user_data = self.patch(commissioning, 'generate_user_data')
405+ start_nodes = self.patch(Node.objects, 'start_nodes')
406+ start_nodes.side_effect = [
407+ None,
408+ MultipleFailures(
409+ Failure(NoConnectionsAvailable())),
410+ None,
411+ ]
412+
413+ with transaction.atomic():
414+ for node in nodes:
415+ try:
416+ node.start_commissioning(admin)
417+ except RPC_EXCEPTIONS:
418+ # Suppress all expected errors; we test for them
419+ # elsewhere.
420+ pass
421+
422+ expected_calls = (
423+ call(
424+ [node.system_id], admin,
425+ user_data=generate_user_data.return_value)
426+ for node in nodes)
427+ self.assertThat(
428+ start_nodes, MockCallsMatch(*expected_calls))
429+ self.assertEqual(
430+ [
431+ NODE_STATUS.COMMISSIONING,
432+ NODE_STATUS.NEW,
433+ NODE_STATUS.COMMISSIONING
434+ ],
435+ [node.status for node in nodes])
436+
437+ def test_start_commissioning_logs_and_raises_errors_in_starting(self):
438+ admin = factory.make_admin()
439+ node = factory.make_Node(status=NODE_STATUS.NEW)
440+ maaslog = self.patch(node_module, 'maaslog')
441+ exception = NoConnectionsAvailable(factory.make_name())
442+ self.patch(Node.objects, 'start_nodes').side_effect = exception
443+ self.assertRaises(
444+ NoConnectionsAvailable, node.start_commissioning, admin)
445+ self.assertEqual(NODE_STATUS.NEW, node.status)
446+ self.assertThat(
447+ maaslog.error, MockCalledOnceWith(
448+ "%s: Unable to start node: %s",
449+ node.hostname, unicode(exception)))
450+
451+ def test_abort_commissioning_reverts_to_sane_state_on_error(self):
452+ # If abort commissioning hits an error when trying to stop the
453+ # node, it will revert the node to the state it was in before
454+ # abort_commissioning() was called.
455+ admin = factory.make_admin()
456+ nodes = [
457+ factory.make_Node(
458+ status=NODE_STATUS.COMMISSIONING, power_type="virsh")
459+ for _ in range(3)
460+ ]
461+ stop_nodes = self.patch(Node.objects, 'stop_nodes')
462+ stop_nodes.return_value = [
463+ [node] for node in nodes
464+ ]
465+ stop_nodes.side_effect = [
466+ None,
467+ MultipleFailures(
468+ Failure(NoConnectionsAvailable())),
469+ None,
470+ ]
471+
472+ with transaction.atomic():
473+ for node in nodes:
474+ try:
475+ node.abort_commissioning(admin)
476+ except RPC_EXCEPTIONS:
477+ # Suppress all expected errors; we test for them
478+ # elsewhere.
479+ pass
480+
481+ self.assertThat(
482+ stop_nodes, MockCallsMatch(
483+ *(call([node.system_id], admin) for node in nodes)))
484+ self.assertEqual(
485+ [
486+ NODE_STATUS.NEW,
487+ NODE_STATUS.COMMISSIONING,
488+ NODE_STATUS.NEW,
489+ ],
490+ [node.status for node in nodes])
491+
492+ def test_abort_commissioning_logs_and_raises_errors_in_stopping(self):
493+ admin = factory.make_admin()
494+ node = factory.make_Node(status=NODE_STATUS.COMMISSIONING)
495+ maaslog = self.patch(node_module, 'maaslog')
496+ exception = NoConnectionsAvailable(factory.make_name())
497+ self.patch(Node.objects, 'stop_nodes').side_effect = exception
498+ self.assertRaises(
499+ NoConnectionsAvailable, node.abort_commissioning, admin)
500+ self.assertEqual(NODE_STATUS.COMMISSIONING, node.status)
501+ self.assertThat(
502+ maaslog.error, MockCalledOnceWith(
503+ "%s: Unable to shut node down: %s",
504+ node.hostname, unicode(exception)))
505+
506 def test_abort_commissioning_changes_status_and_stops_node(self):
507 node = factory.make_Node(
508 status=NODE_STATUS.COMMISSIONING, power_type='virsh')