Merge lp:~julian-edwards/launchpad/lost-builder-bug-463046 into lp:launchpad

Proposed by Julian Edwards
Status: Merged
Approved by: Julian Edwards
Approved revision: no longer in the source branch.
Merged at revision: 11414
Proposed branch: lp:~julian-edwards/launchpad/lost-builder-bug-463046
Merge into: lp:launchpad
Diff against target: 74 lines (+39/-1)
2 files modified
lib/lp/buildmaster/model/builder.py (+15/-1)
lib/lp/buildmaster/tests/test_builder.py (+24/-0)
To merge this branch: bzr merge lp:~julian-edwards/launchpad/lost-builder-bug-463046
Reviewer Review Type Date Requested Status
Henning Eggers (community) code Approve
Review via email: mp+33369@code.launchpad.net

Commit message

Reset aborted builders that don't have any buildqueue entry so they don't sit idle forever.

Description of the change

= Summary =
Reset aborted builders that don't have any buildqueue entry

== Proposed fix ==
See bug 436046.

Basically, when builders end up building something that we don't know about,
they need to be aborted and cleaned. Right now they sit in the "aborted"
state forever because the current handler for that relies on there being a
buildqueue row for the build job. In the case of comms failures etc., that
may not be the case.

== Pre-implementation notes ==
Talked to wgrant.

== Implementation details ==
There's an existing function that checks for the slave state and does things
based on what it returns to us. I've changed it so that if it sees the
ABORTED state and builder.currentjob is None, then we need to clean it -
slave.clean() just resets it back to IDLE, ready for another job).

== Tests ==
bin/test -cvvt Test_rescueBuilderIfLost

== Demo and Q/A ==
The QA plan on dogfood is as follows:

 * Initiate a build
 * Stop the buildd-manager to give us time to fiddle about.
 * Rip all traces of the build out of dogfood's database by deleting the
buildqueue and job.
 * Start the buildd-manager - when it scans the builder with the orphaned
build it should now reset it successfully.

= Launchpad lint =

(this lint is all crazy nonsense, we need to fix our linter!)

Checking for conflicts and issues in changed files.

Linting changed files:
  lib/lp/buildmaster/model/builder.py
  lib/lp/buildmaster/tests/test_builder.py

./lib/lp/buildmaster/model/builder.py
     191: E302 expected 2 blank lines, found 1
     201: E202 whitespace before '}'
     421: E231 missing whitespace after ','
     432: E231 missing whitespace after ','
     557: E301 expected 1 blank line, found 0
     609: E231 missing whitespace after ','
     730: E231 missing whitespace after ','
./lib/lp/buildmaster/tests/test_builder.py
     354: E301 expected 1 blank line, found 2

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :

Thanks!

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/buildmaster/model/builder.py'
2--- lib/lp/buildmaster/model/builder.py 2010-08-20 20:31:18 +0000
3+++ lib/lp/buildmaster/model/builder.py 2010-08-23 11:01:11 +0000
4@@ -206,6 +206,20 @@
5 # IBuilder.slaveStatusSentence().
6 status = status_sentence[0]
7
8+ # If the cookie test below fails, it will request an abort of the
9+ # builder. This will leave the builder in the aborted state and
10+ # with no assigned job, and we should now "clean" the slave which
11+ # will reset its state back to IDLE, ready to accept new builds.
12+ # This situation is usually caused by a temporary loss of
13+ # communications with the slave and the build manager had to reset
14+ # the job.
15+ if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
16+ builder.cleanSlave()
17+ if logger is not None:
18+ logger.info(
19+ "Builder '%s' cleaned up from ABORTED" % builder.name)
20+ return
21+
22 # If slave is not building nor waiting, it's not in need of rescuing.
23 if status not in ident_position.keys():
24 return
25@@ -220,7 +234,7 @@
26 else:
27 builder.requestAbort()
28 if logger:
29- logger.warn(
30+ logger.info(
31 "Builder '%s' rescued from '%s': '%s'" %
32 (builder.name, slave_build_id, reason))
33
34
35=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
36--- lib/lp/buildmaster/tests/test_builder.py 2010-08-20 20:31:18 +0000
37+++ lib/lp/buildmaster/tests/test_builder.py 2010-08-23 11:01:11 +0000
38@@ -29,6 +29,10 @@
39 from lp.soyuz.model.binarypackagebuildbehavior import (
40 BinaryPackageBuildBehavior,
41 )
42+from lp.soyuz.tests.soyuzbuilddhelpers import (
43+ AbortedSlave,
44+ MockBuilder,
45+ )
46 from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
47 from lp.testing import TestCaseWithFactory
48 from lp.testing.fakemethod import FakeMethod
49@@ -90,6 +94,26 @@
50 self.assertEqual(0, builder.handleTimeout.call_count)
51
52
53+class Test_rescueBuilderIfLost(TestCaseWithFactory):
54+ """Tests for lp.buildmaster.model.builder.rescueBuilderIfLost."""
55+
56+ layer = LaunchpadZopelessLayer
57+
58+ def test_recovery_of_aborted_slave(self):
59+ # If a slave is in the ABORTED state, rescueBuilderIfLost should
60+ # clean it if we don't think it's currently building anything.
61+ # See bug 463046.
62+ aborted_slave = AbortedSlave()
63+ # The slave's clean() method is normally an XMLRPC call, so we
64+ # can just stub it out and check that it got called.
65+ aborted_slave.clean = FakeMethod()
66+ builder = MockBuilder("mock_builder", aborted_slave)
67+ builder.currentjob = None
68+ builder.rescueIfLost()
69+
70+ self.assertEqual(1, aborted_slave.clean.call_count)
71+
72+
73 class TestFindBuildCandidateBase(TestCaseWithFactory):
74 """Setup the test publisher and some builders."""
75