Launchpad itself

Merge lp:~jml/launchpad/broken-resume-handling into lp:launchpad

broken-resume-handling
Merge into devel

Proposed by Jonathan Lange on 2010-04-23

Status:	Merged
Approved by:	Julian Edwards on 2010-04-23
Approved revision:	no longer in the source branch.
Merged at revision:	not available
Proposed branch:	lp:~jml/launchpad/broken-resume-handling
Merge into:	lp:launchpad
Diff against target:	92 lines (+64/-1) 2 files modified lib/lp/buildmaster/manager.py (+0/-1) lib/lp/buildmaster/tests/test_manager.py (+64/-0)
To merge this branch:	bzr merge lp:~jml/launchpad/broken-resume-handling
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Julian Edwards (community)		2010-04-23	Approve on 2010-04-23
Review via email: mp+24004@code.launchpad.net

Description of the change

This branch fixes a bug in the build master code that Julian discovered while trying to land his own buildmaster branch.

I don't really understand the problem domain very well, but I can string words together meaningfully.

When a slave fails to resume, we need to reset it. I think. This work is done in finishCycle(), which calls an object returned by one of the callbacks, calling this object does the reset.

EXCEPT, if something goes wrong in the callbacks, say a ValueError is raised, then finishCycle() won't even call the object that resets the slave, because it won't *have* the object that resets the slave. It will have a Failure instead.

This branch adds some tests for this whole code path, and fixes the case where a slave was being removed from a list twice, thus raising a ValueError.

More generally, the build manager could be written more defensively by being more explicitly a state machine and clearly separating success and error paths, saving addBoth only for cleanup cases.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-04-23:

Thanks for helping me write the tests for this! The change looks good. With any luck my own branch will start working now!

Cheers.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Jonathan Lange

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

 === modified file 'lib/lp/buildmaster/manager.py'
 --- lib/lp/buildmaster/manager.py	2010-04-03 03:46:16 +0000
 +++ lib/lp/buildmaster/manager.py	2010-04-23 12:14:27 +0000
@@ -367,7 +367,6 @@
          self.logger.error(
              '%s resume failure: %s' % (slave, response.getErrorMessage()))
--        self.slaveDone(slave)
          return self.reset_result(slave)
      def checkDispatch(self, response, method, slave):
 === modified file 'lib/lp/buildmaster/tests/test_manager.py'
 --- lib/lp/buildmaster/tests/test_manager.py	2010-04-12 16:23:13 +0000
 +++ lib/lp/buildmaster/tests/test_manager.py	2010-04-23 12:14:27 +0000
@@ -236,6 +236,7 @@
          # Stop automatic collection of dispatching results.
          def testSlaveDone(slave):
              pass
++        self._realSlaveDone = self.manager.slaveDone
          self.manager.slaveDone = testSlaveDone
      def testFinishCycle(self):
@@ -316,6 +317,69 @@
          self.assertEqual(
              '<foo:http://foo.buildd:8221/> reset', repr(result))
++    def test_fail_to_resume_slave_resets_slave(self):
++        # If an attempt to resume and dispatch a slave fails, we reset the
++        # slave by calling self.reset_result(slave)().
++
++        reset_result_calls = []
++        class LoggingResetResult(BaseDispatchResult):
++            """A DispatchResult that logs calls to itself.
++
++            This *must* subclass BaseDispatchResult, otherwise finishCycle()
++            won't treat it like a dispatch result.
++            """
++            def __init__(self, slave, info=None):
++                self.slave = slave
++            def __call__(self):
++                reset_result_calls.append(self.slave)
++
++        # Make a failing slave that is requesting a resume.
++        slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
++        slave.resume_requested = True
++        slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err', 1)))
++
++        # Make the manager log the reset result calls.
++        self.manager.reset_result = LoggingResetResult
++        # Restore the slaveDone method. It's very relevant to this test.
++        self.manager.slaveDone = self._realSlaveDone
++        # We only care about this one slave. Reset the list of manager
++        # deferreds in case setUp did something unexpected.
++        self.manager._deferreds = []
++
++        self.manager.resumeAndDispatch([slave])
++        # Note: finishCycle isn't generally called by external users, normally
++        # resumeAndDispatch or slaveDone calls it. However, these calls
++        # swallow the Deferred that finishCycle returns, and we need that
++        # Deferred to make sure this test completes properly.
++        d = self.manager.finishCycle()
++        return d.addCallback(
++            lambda ignored: self.assertEqual([slave], reset_result_calls))
++
++    def test_failed_to_resume_slave_ready_for_reset(self):
++        # When a slave fails to resume, the manager has a Deferred in its
++        # Deferred list that is ready to fire with a ResetDispatchResult.
++
++        # Make a failing slave that is requesting a resume.
++        slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
++        slave.resume_requested = True
++        slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err', 1)))
++
++        # We only care about this one slave. Reset the list of manager
++        # deferreds in case setUp did something unexpected.
++        self.manager._deferreds = []
++        # Restore the slaveDone method. It's very relevant to this test.
++        self.manager.slaveDone = self._realSlaveDone
++        self.manager.resumeAndDispatch([slave])
++        [d] = self.manager._deferreds
++
++        # The Deferred for our failing slave should be ready to fire
++        # successfully with a ResetDispatchResult.
++        def check_result(result):
++            self.assertIsInstance(result, ResetDispatchResult)
++            self.assertEqual(slave, result.slave)
++            self.assertFalse(result.processed)
++        return d.addCallback(check_result)
++
      def testCheckDispatch(self):
          """`BuilddManager.checkDispatch` is chained after dispatch requests.

Launchpad itself

Merge lp:~jml/launchpad/broken-resume-handling into lp:launchpad

Commit message

Description of the change

Preview Diff

Subscribers