MAAS

Merge lp:~gmb/maas/catch-errors-in-power-callsites into lp:~maas-committers/maas/trunk

catch-errors-in-power-callsites
Merge into trunk

Proposed by Graham Binns on 2014-10-08

Status:

Merged

Approved by:

Graham Binns on 2014-10-10

Approved revision:

no longer in the source branch.

Merged at revision:

3228

Proposed branch:

lp:~gmb/maas/catch-errors-in-power-callsites

Merge into:

lp:~maas-committers/maas/trunk

Diff against target:

508 lines (+368/-17)

2 files modified

src/maasserver/models/node.py (+76/-13)
src/maasserver/models/tests/test_node.py (+292/-4)

To merge this branch:

bzr merge lp:~gmb/maas/catch-errors-in-power-callsites

Related bugs:

Bug #1373368: Conflicting power actions being dropped on the floor can result in leaving a node in an inconsistent state	Critical	Fix Released
Bug #1375970: Timeout leads to inconsistency between maas and real world state, can't commission or start nodes	Critical	Fix Released
Bug #1375980: Nodes failed to transition out of "New" state on bulk commission	Critical	Fix Released
Bug #1377099: Bulk operation leaves nodes in inconsistent state	Critical	Fix Released

Link a bug report

Reviewer	Date Requested	Status
Raphaël Badin (community)	2014-10-09	Approve on 2014-10-09
Gavin Panella (community)	2014-10-08	Approve on 2014-10-09
Review via email: mp+237619@code.launchpad.net

Commit message

Refactor call sites of Node.objects.(start|stop)_nodes() so that they:
- Handle errors raised by those methods
- Ensure that model changes are committed before power actions are sent
   (when starting nodes) and after power actions are sent (when
   stopping nodes). This avoids race conditions when carrying out actions
   in bulk.
- When errors occur in start_nodes(), the callsites will explicitly roll
   the node back to a sane status (either its old status or a suitable
   FAILED status).
- Re-raise those errors once they've finished with them
- Don't change the state of the Node if an error occurred whilst starting or stopping it.

This is a further fix for bug 1375980.

I've also removed the check for len(nodes_(started|stopped) > 0 in those methods that used it. This check made no sense; there are power types which we just *can't* power off (ether_wake, for example). If the check was left in place, the node would never come out of whatever state it was in, and would be stuck.

I've added tests to check for the atomicity of the callsites, for their logging and re-raising of exceptions, too.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-08:

It's good, but I think there's a race in there that we need to fix. It's not your fault, and this branch doesn't make it worse, but it's still not the right approach.

review: Needs Fixing

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-08:

On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> The new node status needs to be committed before we try to start the node. If there's an error we need to reset the status and commit again.
>
> However, you may have some fun getting Django to let you commit when it's in one of its special "atomic" blocks. Where "fun" is not the word I would choose.

Can you help me understand why the new status needs to be committed?…
Ah, wait, because it's possible that two users could try to commission
the node at the same time? Or is it something else entirely?

I tried using atomic() and even savepoint() but couldn't get them to
work, before I realised that I could do things this way… The only
other option is to do a manual rollback – i.e. on error, set the node
status back to whatever it was. Would that be okay with you?

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-08:

> On 8 October 2014 18:05, Gavin Panella <email address hidden> wrote:
> > The new node status needs to be committed before we try to start the
> > node. If there's an error we need to reset the status and commit
> > again.
> >
> > However, you may have some fun getting Django to let you commit when
> > it's in one of its special "atomic" blocks. Where "fun" is not the
> > word I would choose.
>
> Can you help me understand why the new status needs to be committed?…
> Ah, wait, because it's possible that two users could try to commission
> the node at the same time? Or is it something else entirely?

I was thinking about a node coming up and trying to boot from TFTP
before we've committed its status to the database. For a single node I
doubt it'll be a problem because, in most cases, it'll be a fraction of
a second between the PowerOn call returning and the status being saved.

However, when starting multiple nodes it could be longer because the
transaction doesn't get committed until the end of the web/API request.
If a node does its first TFTP request within that window then it'll be
given boot instructions based on its old status.

We use TransactionMiddleware, which has been deprecated, but our
slowness to migrate off it may be a blessing: it /may/ allow us to call
transaction.commit() in the middle of a request. If we were using
ATOMIC_REQUESTS, Django's replacement for TransactionMiddleware, we
would definitely not be able to commit in the middle of a request.

> I tried using atomic() and even savepoint() but couldn't get them to
> work, before I realised that I could do things this way… The only
> other option is to do a manual rollback – i.e. on error, set the node
> status back to whatever it was. Would that be okay with you?

Yeah, it'll have to be a manual rollback because we need to have that
status committed.

I remember we've talked about some of this stuff before and you talked
about manual rollbacks. If I misunderstood and gave bad advice, I'm
sorry, and hopefully I'm thinking straight now.

> On 8 October 2014 18:05, Gavin Panella <gavin.panella@canonical.com> wrote:
> > The new node status needs to be committed before we try to start the
> > node. If there's an error we need to reset the status and commit
> > again.
> >
> > However, you may have some fun getting Django to let you commit when
> > it's in one of its special "atomic" blocks. Where "fun" is not the
> > word I would choose.
> 
> Can you help me understand why the new status needs to be committed?…
> Ah, wait, because it's possible that two users could try to commission
> the node at the same time? Or is it something else entirely?

Yeah, it'll have to be a manual rollback because we need to have that
status committed.

I remember we've talked about some of this stuff before and you talked
about manual rollbacks. If I misunderstood and gave bad advice, I'm
sorry, and hopefully I'm thinking straight now.

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-09:

On 8 October 2014 20:56, Gavin Panella <email address hidden> wrote:
> I was thinking about a node coming up and trying to boot from TFTP
> before we've committed its status to the database. For a single node I
> doubt it'll be a problem because, in most cases, it'll be a fraction of
> a second between the PowerOn call returning and the status being saved.
>
> However, when starting multiple nodes it could be longer because the
> transaction doesn't get committed until the end of the web/API request.
> If a node does its first TFTP request within that window then it'll be
> given boot instructions based on its old status.
>
> We use TransactionMiddleware, which has been deprecated, but our
> slowness to migrate off it may be a blessing: it /may/ allow us to call
> transaction.commit() in the middle of a request. If we were using
> ATOMIC_REQUESTS, Django's replacement for TransactionMiddleware, we
> would definitely not be able to commit in the middle of a request.

Aah, okay. Now I'm with you (on both the problem and the
why-won't-you-be-atomic() fronts).

>> I tried using atomic() and even savepoint() but couldn't get them to
>> work, before I realised that I could do things this way… The only
>> other option is to do a manual rollback – i.e. on error, set the node
>> status back to whatever it was. Would that be okay with you?
>
> Yeah, it'll have to be a manual rollback because we need to have that
> status committed.
>
> I remember we've talked about some of this stuff before and you talked
> about manual rollbacks. If I misunderstood and gave bad advice, I'm
> sorry, and hopefully I'm thinking straight now.

Okay, manual rollback it is. That's relatively easy, actually. And you
didn't give bad advice — you gave really good advice that turned out
not to work properly; no fault of yours.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-09:

Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?

review: Approve

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-09:

On 9 October 2014 14:09, Gavin Panella <email address hidden> wrote:
>
> Looks good. I'm still a bit uncomfortable about suppress-all-errors in the tests because it might be hiding something important. Can you narrow it down to just the error or errors you expect?

Sure.

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-09:

Looks good. Of course we need to extensively test this (I guess you've done some of it already).

Couple of remarks inline.

One additional question: I don't see the same pattern being applied to deployment… is that on purpose? (Maybe that is because the deployment action is "embedded" in start_node()).

review: Approve

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-10:

> Looks good. Of course we need to extensively test this (I guess you've done
> some of it already).
>
> Couple of remarks inline.
>
> One additional question: I don't see the same pattern being applied to
> deployment… is that on purpose? (Maybe that is because the deployment action
> is "embedded" in start_node()).

Exactly. And although start_nodes() still needs some love, I don't think we're going to get it in this cycle.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Andres Rodriguez

Blake Rouse

Brendan Donegan

Dave Walker

Deepa

Enrique Chirivella Pérez

Fumihito YOSHIDA

Graham Binns

MAAS Committers

Mike Pontillo

james beedy

 === modified file 'src/maasserver/models/node.py'
 --- src/maasserver/models/node.py	2014-10-09 18:47:50 +0000
 +++ src/maasserver/models/node.py	2014-10-10 09:02:08 +0000
@@ -945,13 +945,34 @@
          commissioning_user_data = generate_user_data(node=self)
          NodeResult.objects.clear_results(self)
--        self.status = NODE_STATUS.COMMISSIONING
--        self.save()
          # The commissioning profile is handled in start_nodes.
          maaslog.info(
              "%s: Starting commissioning", self.hostname)
--        Node.objects.start_nodes(
--            [self.system_id], user, user_data=commissioning_user_data)
++        # We need to mark the node as COMMISSIONING now to avoid a race
++        # when starting multiple nodes. We hang on to old_status just in
++        # case the power action fails.
++        old_status = self.status
++        self.status = NODE_STATUS.COMMISSIONING
++        self.save()
++        transaction.commit()
++        try:
++            # We don't check for which nodes we've started here, because
++            # it's possible we can't start the node - its power type may not
++            # allow us to do that.
++            Node.objects.start_nodes(
++                [self.system_id], user, user_data=commissioning_user_data)
++        except Exception as ex:
++            maaslog.error(
++                "%s: Unable to start node: %s",
++                self.hostname, unicode(ex))
++            self.status = old_status
++            self.save()
++            transaction.commit()
++            # Let the exception bubble up, since the UI or API will have to
++            # deal with it.
++            raise
++        else:
++            maaslog.info("%s: Commissioning started", self.hostname)
      def abort_commissioning(self, user):
          """Power off a commissioning node and set its status to 'declared'."""
@@ -962,10 +983,20 @@
                  % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status]))
          maaslog.info(
              "%s: Aborting commissioning", self.hostname)
--        stopped_node = Node.objects.stop_nodes([self.system_id], user)
--        if len(stopped_node) == 1:
++        try:
++            # We don't check for which nodes we've stopped here, because
++            # it's possible we can't stop the node - its power type may
++            # not allow us to do that.
++            Node.objects.stop_nodes([self.system_id], user)
++        except Exception as ex:
++            maaslog.error(
++                "%s: Unable to shut node down: %s",
++                self.hostname, unicode(ex))
++            raise
++        else:
              self.status = NODE_STATUS.NEW
              self.save()
++            maaslog.info("%s: Commissioning aborted", self.hostname)
      def delete(self):
          """Delete this node.
@@ -1227,12 +1258,31 @@
          from metadataserver.user_data.disk_erasing import generate_user_data
          disk_erase_user_data = generate_user_data(node=self)
++        maaslog.info(
++            "%s: Starting disk erasure", self.hostname)
++        # Change the status of the node now to avoid races when starting
++        # nodes in bulk.
          self.status = NODE_STATUS.DISK_ERASING
          self.save()
--        maaslog.info(
--            "%s: Starting disk erasing", self.hostname)
--        Node.objects.start_nodes(
--            [self.system_id], user, user_data=disk_erase_user_data)
++        transaction.commit()
++        try:
++            Node.objects.start_nodes(
++                [self.system_id], user, user_data=disk_erase_user_data)
++        except Exception as ex:
++            maaslog.error(
++                "%s: Unable to start node: %s",
++                self.hostname, unicode(ex))
++            # We always mark the node as failed here, although we could
++            # potentially move it back to the state it was in
++            # previously. For now, though, this is safer, since it marks
++            # the node as needing attention.
++            self.status = NODE_STATUS.FAILED_DISK_ERASING
++            self.save()
++            transaction.commit()
++            raise
++        else:
++            maaslog.info(
++                "%s: Disk erasure started.", self.hostname)
      def abort_disk_erasing(self, user):
          """
@@ -1246,8 +1296,14 @@
                  % (self.system_id, NODE_STATUS_CHOICES_DICT[self.status]))
          maaslog.info(
              "%s: Aborting disk erasing", self.hostname)
--        stopped_node = Node.objects.stop_nodes([self.system_id], user)
--        if len(stopped_node) == 1:
++        try:
++            Node.objects.stop_nodes([self.system_id], user)
++        except Exception as ex:
++            maaslog.error(
++                "%s: Unable to shut node down: %s",
++                self.hostname, unicode(ex))
++            raise
++        else:
              self.status = NODE_STATUS.FAILED_DISK_ERASING
              self.save()
@@ -1270,7 +1326,14 @@
          :raises MultipleFailures: If host maps cannot be deleted.
          """
          maaslog.info("%s: Releasing node", self.hostname)
--        Node.objects.stop_nodes([self.system_id], self.owner)
++        try:
++            Node.objects.stop_nodes([self.system_id], self.owner)
++        except Exception as ex:
++            maaslog.error(
++                "%s: Unable to shut node down: %s", self.hostname,
++                unicode(ex))
++            raise
++
          deallocated_ips = StaticIPAddress.objects.deallocate_by_node(self)
          self.delete_host_maps(deallocated_ips)
          from maasserver.dns.config import change_dns_zones
 === modified file 'src/maasserver/models/tests/test_node.py'
 --- src/maasserver/models/tests/test_node.py	2014-10-09 23:35:05 +0000
 +++ src/maasserver/models/tests/test_node.py	2014-10-10 09:02:08 +0000
@@ -23,6 +23,7 @@
  import crochet
  from django.core.exceptions import ValidationError
++from django.db import transaction
  from maasserver import preseed as preseed_module
  from maasserver.clusterrpc.power_parameters import get_power_types
  from maasserver.clusterrpc.testing.boot_images import make_rpc_boot_image
@@ -59,6 +60,7 @@
      StaticIPAddressManager,
+     )
  from maasserver.models.user import create_auth_token
++from maasserver.node_action import RPC_EXCEPTIONS
  from maasserver.node_status import (
      get_failed_status,
      MONITORED_STATUSES,
@@ -82,6 +84,7 @@
  from maastesting.matchers import (
      MockAnyCall,
      MockCalledOnceWith,
++    MockCallsMatch,
      MockNotCalled,
+     )
  from maastesting.testcase import MAASTestCase
@@ -91,9 +94,13 @@
      NodeResult,
      NodeUserData,
+     )
--from metadataserver.user_data import commissioning
++from metadataserver.user_data import (
++    commissioning,
++    disk_erasing,
++    )
  from mock import (
      ANY,
++    call,
      Mock,
      sentinel,
+     )
@@ -101,7 +108,10 @@
  from provisioningserver.power_schema import JSON_POWER_TYPE_PARAMETERS
  from provisioningserver.rpc import cluster as cluster_module
  from provisioningserver.rpc.cluster import StartMonitors
--from provisioningserver.rpc.exceptions import MultipleFailures
++from provisioningserver.rpc.exceptions import (
++    MultipleFailures,
++    NoConnectionsAvailable,
++    )
  from provisioningserver.rpc.power import QUERY_POWER_TYPES
  from provisioningserver.rpc.testing import (
      always_succeed_with,
@@ -743,6 +753,65 @@
          self.assertThat(stop_nodes, MockCalledOnceWith(
              [node.system_id], owner))
++    def test_start_disk_erasing_reverts_to_sane_state_on_error(self):
++        # If start_disk_erasing encounters an error when calling
++        # start_nodes(), it will transition the node to a sane state.
++        # Failures encountered in one call to start_disk_erasing() won't
++        # affect subsequent calls.
++        admin = factory.make_admin()
++        nodes = [
++            factory.make_Node(
++                status=NODE_STATUS.ALLOCATED, power_type="virsh")
++            for _ in range(3)
++            ]
++        generate_user_data = self.patch(disk_erasing, 'generate_user_data')
++        start_nodes = self.patch(Node.objects, 'start_nodes')
++        start_nodes.side_effect = [
++            None,
++            MultipleFailures(
++                Failure(NoConnectionsAvailable())),
++            None,
++            ]
++
++        with transaction.atomic():
++            for node in nodes:
++                try:
++                    node.start_disk_erasing(admin)
++                except RPC_EXCEPTIONS:
++                    # Suppress all the expected errors coming out of
++                    # start_disk_erasing() because they're tested
++                    # eleswhere.
++                    pass
++
++        expected_calls = (
++            call(
++                [node.system_id], admin,
++                user_data=generate_user_data.return_value)
++            for node in nodes)
++        self.assertThat(
++            start_nodes, MockCallsMatch(*expected_calls))
++        self.assertEqual(
++            [
++                NODE_STATUS.DISK_ERASING,
++                NODE_STATUS.FAILED_DISK_ERASING,
++                NODE_STATUS.DISK_ERASING,
++            ],
++            [node.status for node in nodes])
++
++    def test_start_disk_erasing_logs_and_raises_errors_in_starting(self):
++        admin = factory.make_admin()
++        node = factory.make_Node(status=NODE_STATUS.ALLOCATED)
++        maaslog = self.patch(node_module, 'maaslog')
++        exception = NoConnectionsAvailable(factory.make_name())
++        self.patch(Node.objects, 'start_nodes').side_effect = exception
++        self.assertRaises(
++            NoConnectionsAvailable, node.start_disk_erasing, admin)
++        self.assertEqual(NODE_STATUS.FAILED_DISK_ERASING, node.status)
++        self.assertThat(
++            maaslog.error, MockCalledOnceWith(
++                "%s: Unable to start node: %s",
++                node.hostname, unicode(exception)))
++
      def test_abort_operation_aborts_disk_erasing(self):
          agent_name = factory.make_name('agent-name')
          owner = factory.make_User()
@@ -761,6 +830,63 @@
              agent_name=agent_name)
          self.assertRaises(NodeStateViolation, node.abort_operation, owner)
++    def test_abort_disk_erasing_reverts_to_sane_state_on_error(self):
++        # If start_disk_erasing encounters an error when calling
++        # start_nodes(), it will transition the node to a sane state.
++        # Failures encountered in one call to start_disk_erasing() won't
++        # affect subsequent calls.
++        admin = factory.make_admin()
++        nodes = [
++            factory.make_Node(
++                status=NODE_STATUS.DISK_ERASING, power_type="virsh")
++            for _ in range(3)
++            ]
++        stop_nodes = self.patch(Node.objects, 'stop_nodes')
++        stop_nodes.return_value = [
++            [node] for node in nodes
++            ]
++        stop_nodes.side_effect = [
++            None,
++            MultipleFailures(
++                Failure(NoConnectionsAvailable())),
++            None,
++            ]
++
++        with transaction.atomic():
++            for node in nodes:
++                try:
++                    node.abort_disk_erasing(admin)
++                except RPC_EXCEPTIONS:
++                    # Suppress all the expected errors coming out of
++                    # abort_disk_erasing() because they're tested
++                    # eleswhere.
++                    pass
++
++        self.assertThat(
++            stop_nodes, MockCallsMatch(
++                *(call([node.system_id], admin) for node in nodes)))
++        self.assertEqual(
++            [
++                NODE_STATUS.FAILED_DISK_ERASING,
++                NODE_STATUS.DISK_ERASING,
++                NODE_STATUS.FAILED_DISK_ERASING,
++            ],
++            [node.status for node in nodes])
++
++    def test_abort_disk_erasing_logs_and_raises_errors_in_stopping(self):
++        admin = factory.make_admin()
++        node = factory.make_Node(status=NODE_STATUS.DISK_ERASING)
++        maaslog = self.patch(node_module, 'maaslog')
++        exception = NoConnectionsAvailable(factory.make_name())
++        self.patch(Node.objects, 'stop_nodes').side_effect = exception
++        self.assertRaises(
++            NoConnectionsAvailable, node.abort_disk_erasing, admin)
++        self.assertEqual(NODE_STATUS.DISK_ERASING, node.status)
++        self.assertThat(
++            maaslog.error, MockCalledOnceWith(
++                "%s: Unable to shut node down: %s",
++                node.hostname, unicode(exception)))
++
      def test_release_node_that_has_power_on_and_controlled_power_type(self):
          self.patch(node_module, 'wait_for_power_commands')
          agent_name = factory.make_name('agent-name')
@@ -1080,6 +1206,57 @@
          node.release()
          self.assertThat(change_dns_zones, MockCalledOnceWith([node.nodegroup]))
++    def test_release_logs_and_raises_errors_in_stopping(self):
++        node = factory.make_Node(status=NODE_STATUS.DEPLOYED)
++        maaslog = self.patch(node_module, 'maaslog')
++        exception = NoConnectionsAvailable(factory.make_name())
++        self.patch(Node.objects, 'stop_nodes').side_effect = exception
++        self.assertRaises(NoConnectionsAvailable, node.release)
++        self.assertEqual(NODE_STATUS.DEPLOYED, node.status)
++        self.assertThat(
++            maaslog.error, MockCalledOnceWith(
++                "%s: Unable to shut node down: %s",
++                node.hostname, unicode(exception)))
++
++    def test_release_reverts_to_sane_state_on_error(self):
++        # If release() encounters an error when stopping the node, it
++        # will leave the node in its previous state (i.e. DEPLOYED).
++        nodes = [
++            factory.make_Node(
++                status=NODE_STATUS.DEPLOYED, power_type="virsh")
++            for _ in range(3)
++            ]
++        stop_nodes = self.patch(Node.objects, 'stop_nodes')
++        stop_nodes.return_value = [
++            [node] for node in nodes
++            ]
++        stop_nodes.side_effect = [
++            None,
++            MultipleFailures(
++                Failure(NoConnectionsAvailable())),
++            None,
++            ]
++
++        with transaction.atomic():
++            for node in nodes:
++                try:
++                    node.release()
++                except RPC_EXCEPTIONS:
++                    # Suppress all expected errors; we test for them
++                    # elsewhere.
++                    pass
++
++        self.assertThat(
++            stop_nodes, MockCallsMatch(
++                *(call([node.system_id], None) for node in nodes)))
++        self.assertEqual(
++            [
++                NODE_STATUS.RELEASING,
++                NODE_STATUS.DEPLOYED,
++                NODE_STATUS.RELEASING,
++            ],
++            [node.status for node in nodes])
++
      def test_accept_enlistment_gets_node_out_of_declared_state(self):
          # If called on a node in New state, accept_enlistment()
          # changes the node's status, and returns the node.
@@ -1141,10 +1318,10 @@
              {status: node.status for status, node in nodes.items()})
      def test_start_commissioning_changes_status_and_starts_node(self):
--        start_nodes = self.patch(Node.objects, "start_nodes")
--
          node = factory.make_Node(
              status=NODE_STATUS.NEW, power_type='ether_wake')
++        start_nodes = self.patch(Node.objects, "start_nodes")
++        start_nodes.return_value = [node]
          factory.make_MACAddress(node=node)
          admin = factory.make_admin()
          node.start_commissioning(admin)
@@ -1192,6 +1369,117 @@
          self.assertEqual(
              data, NodeResult.objects.get_data(node, filename))
++    def test_start_commissioning_reverts_to_sane_state_on_error(self):
++        # When start_commissioning encounters an error when trying to
++        # start the node, it will revert the node to its previous
++        # status.
++        admin = factory.make_admin()
++        nodes = [
++            factory.make_Node(status=NODE_STATUS.NEW, power_type="ether_wake")
++            for _ in range(3)
++            ]
++        generate_user_data = self.patch(commissioning, 'generate_user_data')
++        start_nodes = self.patch(Node.objects, 'start_nodes')
++        start_nodes.side_effect = [
++            None,
++            MultipleFailures(
++                Failure(NoConnectionsAvailable())),
++            None,
++            ]
++
++        with transaction.atomic():
++            for node in nodes:
++                try:
++                    node.start_commissioning(admin)
++                except RPC_EXCEPTIONS:
++                    # Suppress all expected errors; we test for them
++                    # elsewhere.
++                    pass
++
++        expected_calls = (
++            call(
++                [node.system_id], admin,
++                user_data=generate_user_data.return_value)
++            for node in nodes)
++        self.assertThat(
++            start_nodes, MockCallsMatch(*expected_calls))
++        self.assertEqual(
++            [
++                NODE_STATUS.COMMISSIONING,
++                NODE_STATUS.NEW,
++                NODE_STATUS.COMMISSIONING
++            ],
++            [node.status for node in nodes])
++
++    def test_start_commissioning_logs_and_raises_errors_in_starting(self):
++        admin = factory.make_admin()
++        node = factory.make_Node(status=NODE_STATUS.NEW)
++        maaslog = self.patch(node_module, 'maaslog')
++        exception = NoConnectionsAvailable(factory.make_name())
++        self.patch(Node.objects, 'start_nodes').side_effect = exception
++        self.assertRaises(
++            NoConnectionsAvailable, node.start_commissioning, admin)
++        self.assertEqual(NODE_STATUS.NEW, node.status)
++        self.assertThat(
++            maaslog.error, MockCalledOnceWith(
++                "%s: Unable to start node: %s",
++                node.hostname, unicode(exception)))
++
++    def test_abort_commissioning_reverts_to_sane_state_on_error(self):
++        # If abort commissioning hits an error when trying to stop the
++        # node, it will revert the node to the state it was in before
++        # abort_commissioning() was called.
++        admin = factory.make_admin()
++        nodes = [
++            factory.make_Node(
++                status=NODE_STATUS.COMMISSIONING, power_type="virsh")
++            for _ in range(3)
++            ]
++        stop_nodes = self.patch(Node.objects, 'stop_nodes')
++        stop_nodes.return_value = [
++            [node] for node in nodes
++            ]
++        stop_nodes.side_effect = [
++            None,
++            MultipleFailures(
++                Failure(NoConnectionsAvailable())),
++            None,
++            ]
++
++        with transaction.atomic():
++            for node in nodes:
++                try:
++                    node.abort_commissioning(admin)
++                except RPC_EXCEPTIONS:
++                    # Suppress all expected errors; we test for them
++                    # elsewhere.
++                    pass
++
++        self.assertThat(
++            stop_nodes, MockCallsMatch(
++                *(call([node.system_id], admin) for node in nodes)))
++        self.assertEqual(
++            [
++                NODE_STATUS.NEW,
++                NODE_STATUS.COMMISSIONING,
++                NODE_STATUS.NEW,
++            ],
++            [node.status for node in nodes])
++
++    def test_abort_commissioning_logs_and_raises_errors_in_stopping(self):
++        admin = factory.make_admin()
++        node = factory.make_Node(status=NODE_STATUS.COMMISSIONING)
++        maaslog = self.patch(node_module, 'maaslog')
++        exception = NoConnectionsAvailable(factory.make_name())
++        self.patch(Node.objects, 'stop_nodes').side_effect = exception
++        self.assertRaises(
++            NoConnectionsAvailable, node.abort_commissioning, admin)
++        self.assertEqual(NODE_STATUS.COMMISSIONING, node.status)
++        self.assertThat(
++            maaslog.error, MockCalledOnceWith(
++                "%s: Unable to shut node down: %s",
++                node.hostname, unicode(exception)))
++
      def test_abort_commissioning_changes_status_and_stops_node(self):
          node = factory.make_Node(
              status=NODE_STATUS.COMMISSIONING, power_type='virsh')

MAAS

Merge lp:~gmb/maas/catch-errors-in-power-callsites into lp:~maas-committers/maas/trunk

Commit message

Description of the change

Preview Diff

Subscribers