Merge lp:~julian-edwards/launchpad/lost-builder-bug-463046 into lp:launchpad

Proposed by Julian Edwards
Status: Merged
Approved by: Julian Edwards
Approved revision: no longer in the source branch.
Merged at revision: 11414
Proposed branch: lp:~julian-edwards/launchpad/lost-builder-bug-463046
Merge into: lp:launchpad
Diff against target: 74 lines (+39/-1)
2 files modified
lib/lp/buildmaster/model/builder.py (+15/-1)
lib/lp/buildmaster/tests/test_builder.py (+24/-0)
To merge this branch: bzr merge lp:~julian-edwards/launchpad/lost-builder-bug-463046
Reviewer Review Type Date Requested Status
Henning Eggers (community) code Approve
Review via email: mp+33369@code.launchpad.net

Commit message

Reset aborted builders that don't have any buildqueue entry so they don't sit idle forever.

Description of the change

= Summary =
Reset aborted builders that don't have any buildqueue entry

== Proposed fix ==
See bug 436046.

Basically, when builders end up building something that we don't know about,
they need to be aborted and cleaned. Right now they sit in the "aborted"
state forever because the current handler for that relies on there being a
buildqueue row for the build job. In the case of comms failures etc., that
may not be the case.

== Pre-implementation notes ==
Talked to wgrant.

== Implementation details ==
There's an existing function that checks for the slave state and does things
based on what it returns to us. I've changed it so that if it sees the
ABORTED state and builder.currentjob is None, then we need to clean it -
slave.clean() just resets it back to IDLE, ready for another job).

== Tests ==
bin/test -cvvt Test_rescueBuilderIfLost

== Demo and Q/A ==
The QA plan on dogfood is as follows:

 * Initiate a build
 * Stop the buildd-manager to give us time to fiddle about.
 * Rip all traces of the build out of dogfood's database by deleting the
buildqueue and job.
 * Start the buildd-manager - when it scans the builder with the orphaned
build it should now reset it successfully.

= Launchpad lint =

(this lint is all crazy nonsense, we need to fix our linter!)

Checking for conflicts and issues in changed files.

Linting changed files:
  lib/lp/buildmaster/model/builder.py
  lib/lp/buildmaster/tests/test_builder.py

./lib/lp/buildmaster/model/builder.py
     191: E302 expected 2 blank lines, found 1
     201: E202 whitespace before '}'
     421: E231 missing whitespace after ','
     432: E231 missing whitespace after ','
     557: E301 expected 1 blank line, found 0
     609: E231 missing whitespace after ','
     730: E231 missing whitespace after ','
./lib/lp/buildmaster/tests/test_builder.py
     354: E301 expected 1 blank line, found 2

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :

Thanks!

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/lp/buildmaster/model/builder.py'
--- lib/lp/buildmaster/model/builder.py 2010-08-20 20:31:18 +0000
+++ lib/lp/buildmaster/model/builder.py 2010-08-23 11:01:11 +0000
@@ -206,6 +206,20 @@
206 # IBuilder.slaveStatusSentence().206 # IBuilder.slaveStatusSentence().
207 status = status_sentence[0]207 status = status_sentence[0]
208208
209 # If the cookie test below fails, it will request an abort of the
210 # builder. This will leave the builder in the aborted state and
211 # with no assigned job, and we should now "clean" the slave which
212 # will reset its state back to IDLE, ready to accept new builds.
213 # This situation is usually caused by a temporary loss of
214 # communications with the slave and the build manager had to reset
215 # the job.
216 if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
217 builder.cleanSlave()
218 if logger is not None:
219 logger.info(
220 "Builder '%s' cleaned up from ABORTED" % builder.name)
221 return
222
209 # If slave is not building nor waiting, it's not in need of rescuing.223 # If slave is not building nor waiting, it's not in need of rescuing.
210 if status not in ident_position.keys():224 if status not in ident_position.keys():
211 return225 return
@@ -220,7 +234,7 @@
220 else:234 else:
221 builder.requestAbort()235 builder.requestAbort()
222 if logger:236 if logger:
223 logger.warn(237 logger.info(
224 "Builder '%s' rescued from '%s': '%s'" %238 "Builder '%s' rescued from '%s': '%s'" %
225 (builder.name, slave_build_id, reason))239 (builder.name, slave_build_id, reason))
226240
227241
=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
--- lib/lp/buildmaster/tests/test_builder.py 2010-08-20 20:31:18 +0000
+++ lib/lp/buildmaster/tests/test_builder.py 2010-08-23 11:01:11 +0000
@@ -29,6 +29,10 @@
29from lp.soyuz.model.binarypackagebuildbehavior import (29from lp.soyuz.model.binarypackagebuildbehavior import (
30 BinaryPackageBuildBehavior,30 BinaryPackageBuildBehavior,
31 )31 )
32from lp.soyuz.tests.soyuzbuilddhelpers import (
33 AbortedSlave,
34 MockBuilder,
35 )
32from lp.soyuz.tests.test_publishing import SoyuzTestPublisher36from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
33from lp.testing import TestCaseWithFactory37from lp.testing import TestCaseWithFactory
34from lp.testing.fakemethod import FakeMethod38from lp.testing.fakemethod import FakeMethod
@@ -90,6 +94,26 @@
90 self.assertEqual(0, builder.handleTimeout.call_count)94 self.assertEqual(0, builder.handleTimeout.call_count)
9195
9296
97class Test_rescueBuilderIfLost(TestCaseWithFactory):
98 """Tests for lp.buildmaster.model.builder.rescueBuilderIfLost."""
99
100 layer = LaunchpadZopelessLayer
101
102 def test_recovery_of_aborted_slave(self):
103 # If a slave is in the ABORTED state, rescueBuilderIfLost should
104 # clean it if we don't think it's currently building anything.
105 # See bug 463046.
106 aborted_slave = AbortedSlave()
107 # The slave's clean() method is normally an XMLRPC call, so we
108 # can just stub it out and check that it got called.
109 aborted_slave.clean = FakeMethod()
110 builder = MockBuilder("mock_builder", aborted_slave)
111 builder.currentjob = None
112 builder.rescueIfLost()
113
114 self.assertEqual(1, aborted_slave.clean.call_count)
115
116
93class TestFindBuildCandidateBase(TestCaseWithFactory):117class TestFindBuildCandidateBase(TestCaseWithFactory):
94 """Setup the test publisher and some builders."""118 """Setup the test publisher and some builders."""
95119