Merge lp:~jml/launchpad/broken-resume-handling into lp:launchpad

Proposed by Jonathan Lange
Status: Merged
Approved by: Julian Edwards
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~jml/launchpad/broken-resume-handling
Merge into: lp:launchpad
Diff against target: 92 lines (+64/-1)
2 files modified
lib/lp/buildmaster/manager.py (+0/-1)
lib/lp/buildmaster/tests/test_manager.py (+64/-0)
To merge this branch: bzr merge lp:~jml/launchpad/broken-resume-handling
Reviewer Review Type Date Requested Status
Julian Edwards (community) Approve
Review via email: mp+24004@code.launchpad.net

Description of the change

This branch fixes a bug in the build master code that Julian discovered while trying to land his own buildmaster branch.

I don't really understand the problem domain very well, but I can string words together meaningfully.

When a slave fails to resume, we need to reset it. I think. This work is done in finishCycle(), which calls an object returned by one of the callbacks, calling this object does the reset.

EXCEPT, if something goes wrong in the callbacks, say a ValueError is raised, then finishCycle() won't even call the object that resets the slave, because it won't *have* the object that resets the slave. It will have a Failure instead.

This branch adds some tests for this whole code path, and fixes the case where a slave was being removed from a list twice, thus raising a ValueError.

More generally, the build manager could be written more defensively by being more explicitly a state machine and clearly separating success and error paths, saving addBoth only for cleanup cases.

To post a comment you must log in.
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Thanks for helping me write the tests for this! The change looks good. With any luck my own branch will start working now!

Cheers.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/buildmaster/manager.py'
2--- lib/lp/buildmaster/manager.py 2010-04-03 03:46:16 +0000
3+++ lib/lp/buildmaster/manager.py 2010-04-23 12:14:27 +0000
4@@ -367,7 +367,6 @@
5
6 self.logger.error(
7 '%s resume failure: %s' % (slave, response.getErrorMessage()))
8- self.slaveDone(slave)
9 return self.reset_result(slave)
10
11 def checkDispatch(self, response, method, slave):
12
13=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
14--- lib/lp/buildmaster/tests/test_manager.py 2010-04-12 16:23:13 +0000
15+++ lib/lp/buildmaster/tests/test_manager.py 2010-04-23 12:14:27 +0000
16@@ -236,6 +236,7 @@
17 # Stop automatic collection of dispatching results.
18 def testSlaveDone(slave):
19 pass
20+ self._realSlaveDone = self.manager.slaveDone
21 self.manager.slaveDone = testSlaveDone
22
23 def testFinishCycle(self):
24@@ -316,6 +317,69 @@
25 self.assertEqual(
26 '<foo:http://foo.buildd:8221/> reset', repr(result))
27
28+ def test_fail_to_resume_slave_resets_slave(self):
29+ # If an attempt to resume and dispatch a slave fails, we reset the
30+ # slave by calling self.reset_result(slave)().
31+
32+ reset_result_calls = []
33+ class LoggingResetResult(BaseDispatchResult):
34+ """A DispatchResult that logs calls to itself.
35+
36+ This *must* subclass BaseDispatchResult, otherwise finishCycle()
37+ won't treat it like a dispatch result.
38+ """
39+ def __init__(self, slave, info=None):
40+ self.slave = slave
41+ def __call__(self):
42+ reset_result_calls.append(self.slave)
43+
44+ # Make a failing slave that is requesting a resume.
45+ slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
46+ slave.resume_requested = True
47+ slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err', 1)))
48+
49+ # Make the manager log the reset result calls.
50+ self.manager.reset_result = LoggingResetResult
51+ # Restore the slaveDone method. It's very relevant to this test.
52+ self.manager.slaveDone = self._realSlaveDone
53+ # We only care about this one slave. Reset the list of manager
54+ # deferreds in case setUp did something unexpected.
55+ self.manager._deferreds = []
56+
57+ self.manager.resumeAndDispatch([slave])
58+ # Note: finishCycle isn't generally called by external users, normally
59+ # resumeAndDispatch or slaveDone calls it. However, these calls
60+ # swallow the Deferred that finishCycle returns, and we need that
61+ # Deferred to make sure this test completes properly.
62+ d = self.manager.finishCycle()
63+ return d.addCallback(
64+ lambda ignored: self.assertEqual([slave], reset_result_calls))
65+
66+ def test_failed_to_resume_slave_ready_for_reset(self):
67+ # When a slave fails to resume, the manager has a Deferred in its
68+ # Deferred list that is ready to fire with a ResetDispatchResult.
69+
70+ # Make a failing slave that is requesting a resume.
71+ slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
72+ slave.resume_requested = True
73+ slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err', 1)))
74+
75+ # We only care about this one slave. Reset the list of manager
76+ # deferreds in case setUp did something unexpected.
77+ self.manager._deferreds = []
78+ # Restore the slaveDone method. It's very relevant to this test.
79+ self.manager.slaveDone = self._realSlaveDone
80+ self.manager.resumeAndDispatch([slave])
81+ [d] = self.manager._deferreds
82+
83+ # The Deferred for our failing slave should be ready to fire
84+ # successfully with a ResetDispatchResult.
85+ def check_result(result):
86+ self.assertIsInstance(result, ResetDispatchResult)
87+ self.assertEqual(slave, result.slave)
88+ self.assertFalse(result.processed)
89+ return d.addCallback(check_result)
90+
91 def testCheckDispatch(self):
92 """`BuilddManager.checkDispatch` is chained after dispatch requests.
93