Merge lp:~julian-edwards/launchpad/builderslave-resume into lp:launchpad

Proposed by Julian Edwards
Status: Merged
Approved by: Julian Edwards
Approved revision: no longer in the source branch.
Merged at revision: 11801
Proposed branch: lp:~julian-edwards/launchpad/builderslave-resume
Merge into: lp:launchpad
Diff against target: 7192 lines (+2211/-3509)
24 files modified
lib/lp/buildmaster/doc/builder.txt (+2/-118)
lib/lp/buildmaster/interfaces/builder.py (+83/-62)
lib/lp/buildmaster/manager.py (+205/-469)
lib/lp/buildmaster/model/builder.py (+240/-224)
lib/lp/buildmaster/model/buildfarmjobbehavior.py (+60/-52)
lib/lp/buildmaster/model/packagebuild.py (+6/-0)
lib/lp/buildmaster/tests/mock_slaves.py (+157/-32)
lib/lp/buildmaster/tests/test_builder.py (+582/-154)
lib/lp/buildmaster/tests/test_manager.py (+248/-782)
lib/lp/buildmaster/tests/test_packagebuild.py (+12/-0)
lib/lp/code/model/recipebuilder.py (+32/-28)
lib/lp/soyuz/browser/tests/test_builder_views.py (+1/-1)
lib/lp/soyuz/doc/buildd-dispatching.txt (+0/-371)
lib/lp/soyuz/doc/buildd-slavescanner.txt (+0/-876)
lib/lp/soyuz/model/binarypackagebuildbehavior.py (+59/-41)
lib/lp/soyuz/tests/test_binarypackagebuildbehavior.py (+290/-8)
lib/lp/soyuz/tests/test_doc.py (+0/-6)
lib/lp/testing/factory.py (+8/-2)
lib/lp/translations/doc/translationtemplatesbuildbehavior.txt (+0/-114)
lib/lp/translations/model/translationtemplatesbuildbehavior.py (+20/-14)
lib/lp/translations/stories/buildfarm/xx-build-summary.txt (+1/-1)
lib/lp/translations/tests/test_translationtemplatesbuildbehavior.py (+202/-153)
lib/lp_sitecustomize.py (+3/-0)
utilities/migrater/file-ownership.txt (+0/-1)
To merge this branch: bzr merge lp:~julian-edwards/launchpad/builderslave-resume
Reviewer Review Type Date Requested Status
Jonathan Lange (community) Approve
Review via email: mp+36351@code.launchpad.net

Description of the change

This is the integration branch for the "fully asynchronous build manager" changes.

To post a comment you must log in.
Revision history for this message
Jonathan Lange (jml) wrote :

Looks good. Thanks.

I think that as long as we have resumeSlave, we should keep the tests in 'lib/lp/buildmaster/tests/test_manager.py'. Please revert your changes to that file & land this branch.

Revision history for this message
Jonathan Lange (jml) :
review: Approve
Revision history for this message
Jonathan Lange (jml) wrote :

Branch has been pushed to since review. Definitely not approved.

review: Abstain
Revision history for this message
Julian Edwards (julian-edwards) wrote :

This is now the integration branch. WIP.

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (9.8 KiB)

Some quick comments, almost all of which are shallow. This doesn't count as a proper code review. Importantly, I haven't rigorously checked that the deleted tests have been properly replaced, and I haven't checked a lot of the Deferred-transformation work for correctness.

Will respond soon with a review of the new manager & test_manager.

> === modified file 'lib/lp/buildmaster/interfaces/builder.py'
> --- lib/lp/buildmaster/interfaces/builder.py 2010-09-23 18:17:21 +0000
> +++ lib/lp/buildmaster/interfaces/builder.py 2010-10-08 17:29:52 +0000
> @@ -154,11 +154,6 @@ class IBuilder(IHasOwner):
>

Could you please go through this interface and:

  * Make sure that all methods that return Deferreds are documented as doing
    so.

  * Possibly, group all of the methods that return Deferreds. I think it will
    make the interface clearer.

> === modified file 'lib/lp/buildmaster/model/builder.py'
> --- lib/lp/buildmaster/model/builder.py 2010-09-24 13:39:27 +0000
> +++ lib/lp/buildmaster/model/builder.py 2010-10-18 10:09:54 +0000
...
> @@ -125,24 +112,17 @@ class BuilderSlave(object):
> # many false positives in your test run and will most likely break
> # production.
>

> # XXX: Have a documented interface for the XML-RPC server:
> # - what methods
> # - what return values expected
> # - what faults
> # (see XMLRPCBuildDSlave in lib/canonical/buildd/slave.py).
>

I've filed bug https://bugs.edge.launchpad.net/soyuz/+bug/662599 for this. I
don't think having the XXX here helps.

> # XXX: Once we have a client object with a defined, tested interface, we
> # should make a test double that doesn't do any XML-RPC and can be used to
> # make testing easier & tests faster.
>

This XXX can be safely deleted, I think.

> def getFile(self, sha_sum):
> """Construct a file-like object to return the named file."""
> + # XXX: Change this to do non-blocking IO.

Please file a bug.

...
> + # Twisted API requires string but the configuration provides unicode.
> + resume_argv = [str(term) for term in resume_command.split()]

It's more explicit to do .encode('utf-8'), rather than str().

> def updateStatus(self, logger=None):
> """See `IBuilder`."""
> - updateBuilderStatus(self, logger)
> + # updateBuilderStatus returns a Deferred if the builder timed
> + # out, otherwise it returns a thing that we can wrap in a
> + # defer.succeed. maybeDeferred() handles this for us.
> + return defer.maybeDeferred(updateBuilderStatus, self, logger)
>

This comment seems bogus. As far as I can tell, updateBuilderStatus always
returns a Deferred.

> - if builder_should_be_failed:
> + d = self.resumeSlaveHost()
> + return d
> + else:
> + # XXX: This should really let the failure bubble up to the
> + # scan() method that does the failure counting.
> # Mark builder as 'failed'.

Are you going to fix this in this branch or in another? If so, when?

> logger.warn(
> - "Disabling builder: %s -- %s" % (self.url, error_message),
> - ...

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (16.2 KiB)

Hey Julian,

The new manager is far, far more readable than before. The hard work has paid off.

I mention a lot of small things in the comments below. I'd really appreciate it if you could address them all, since now is the best opportunity we'll have in a while to make this code comprehensible to others.

> # Copyright 2009 Canonical Ltd. This software is licensed under the
> # GNU Affero General Public License version 3 (see the file LICENSE).
>
> """Soyuz buildd slave manager logic."""
>
> __metaclass__ = type
>
> __all__ = [
> 'BuilddManager',
> 'BUILDD_MANAGER_LOG_NAME',
> 'buildd_success_result_map',
> ]
>
> import logging
>
> import transaction
> from twisted.application import service
> from twisted.internet import (
> defer,
> reactor,
> )
> from twisted.internet.task import LoopingCall
> from twisted.python import log
> from zope.component import getUtility
>
> from lp.buildmaster.enums import BuildStatus
> from lp.buildmaster.interfaces.buildfarmjobbehavior import (
> BuildBehaviorMismatch,
> )
> from lp.buildmaster.interfaces.builder import (
> BuildDaemonError,
> BuildSlaveFailure,
> CannotBuild,
> CannotResumeHost,
> )
>
>
> BUILDD_MANAGER_LOG_NAME = "slave-scanner"
>
>
> buildd_success_result_map = {
> 'ensurepresent': True,
> 'build': 'BuilderStatus.BUILDING',
> }
>

You can delete this now. Yay. (Don't forget the __all__ too).

> class SlaveScanner:
> """A manager for a single builder."""
>
> SCAN_INTERVAL = 5
>

Can you please add a comment explaining what this means, what unit it's in,
and hinting at why 5 is a good number for it?

> def startCycle(self):
> """Scan the builder and dispatch to it or deal with failures."""
> self.loop = LoopingCall(self._startCycle)
> self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
> return self.stopping_deferred
>
> def _startCycle(self):
> # Same as _startCycle but the next cycle is not scheduled. This
> # is so tests can initiate a single scan.

This comment is obsolete. Also, there's probably a better name than
"_startCycle", since this is pretty much doing the scan. Perhaps 'oneCycle'.

> def _scanFailed(self, failure):
> # Trap known exceptions and print a message without a
> # stack trace in that case, or if we don't know about it,
> # include the trace.
>

This comment is also obsolete. Although the traceback/no-traceback logic is
still here, it's hardly the point of the method.

> # Paranoia.
> transaction.abort()
>

Can you please explain in the comment exactly what you are being paranoid
about?

> error_message = failure.getErrorMessage()
> if failure.check(
> BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch,
> CannotResumeHost, BuildDaemonError):
> self.logger.info("Scanning failed with: %s" % error_message)
> else:
> self.logger.info("Scanning failed with: %s\n%s" %
> (failure.getErrorMessage(), failure.getTraceback()))
>
> builder = get_builder(self.builder_name)
>

Shouldn...

review: Needs Fixing
Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (8.9 KiB)

On Monday 18 October 2010 11:59:57 Jonathan Lange wrote:
> Some quick comments, almost all of which are shallow. This doesn't count as
> a proper code review. Importantly, I haven't rigorously checked that the
> deleted tests have been properly replaced, and I haven't checked a lot of
> the Deferred-transformation work for correctness.
>
> Will respond soon with a review of the new manager & test_manager.

Cheers, I replied inline.

>
> > === modified file 'lib/lp/buildmaster/interfaces/builder.py'
> > --- lib/lp/buildmaster/interfaces/builder.py 2010-09-23 18:17:21 +0000
> > +++ lib/lp/buildmaster/interfaces/builder.py 2010-10-08 17:29:52 +0000
>
> > @@ -154,11 +154,6 @@ class IBuilder(IHasOwner):
> Could you please go through this interface and:
>
> * Make sure that all methods that return Deferreds are documented as
> doing so.
>
> * Possibly, group all of the methods that return Deferreds. I think it
> will make the interface clearer.

I've done both of these are done as you suggest.

> > === modified file 'lib/lp/buildmaster/model/builder.py'
> > --- lib/lp/buildmaster/model/builder.py 2010-09-24 13:39:27 +0000
> > +++ lib/lp/buildmaster/model/builder.py 2010-10-18 10:09:54 +0000
>
> ...
>
> > @@ -125,24 +112,17 @@ class BuilderSlave(object):
> > # many false positives in your test run and will most likely break
> > # production.
> >
> >
> > # XXX: Have a documented interface for the XML-RPC server:
> > # - what methods
> > # - what return values expected
> > # - what faults
> > # (see XMLRPCBuildDSlave in lib/canonical/buildd/slave.py).
>
> I've filed bug https://bugs.edge.launchpad.net/soyuz/+bug/662599 for this.
> I don't think having the XXX here helps.

Right, I've deleted it.

>
> > # XXX: Once we have a client object with a defined, tested
> > interface, we # should make a test double that doesn't do any
> > XML-RPC and can be used to # make testing easier & tests faster.
>
> This XXX can be safely deleted, I think.

Done.

>
> > def getFile(self, sha_sum):
> > """Construct a file-like object to return the named file."""
> >
> > + # XXX: Change this to do non-blocking IO.
>
> Please file a bug.

https://bugs.edge.launchpad.net/soyuz/+bug/662631

>
> ...
>
> > + # Twisted API requires string but the configuration provides
> > unicode. + resume_argv = [str(term) for term in
> > resume_command.split()]
>
> It's more explicit to do .encode('utf-8'), rather than str().

Not my code, but I've done that.

>
> > def updateStatus(self, logger=None):
> > """See `IBuilder`."""
> >
> > - updateBuilderStatus(self, logger)
> > + # updateBuilderStatus returns a Deferred if the builder timed
> > + # out, otherwise it returns a thing that we can wrap in a
> > + # defer.succeed. maybeDeferred() handles this for us.
> > + return defer.maybeDeferred(updateBuilderStatus, self, logger)
>
> This comment seems bogus. As far as I can tell, updateBuilderStatus always
> returns a Deferred.

It does, so I've removed the comment and the maybeDeferred().

>
> > - if builder_shoul...

Read more...

Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (20.4 KiB)

On Monday 18 October 2010 13:21:45 you wrote:
> Review: Needs Fixing
> Hey Julian,
>
> The new manager is far, far more readable than before. The hard work has
> paid off.

\o/

> I mention a lot of small things in the comments below. I'd really
> appreciate it if you could address them all, since now is the best
> opportunity we'll have in a while to make this code comprehensible to
> others.

I shall endeavour to do so.

> > # Copyright 2009 Canonical Ltd. This software is licensed under the
> > # GNU Affero General Public License version 3 (see the file LICENSE).
> >
> > """Soyuz buildd slave manager logic."""
> >
> > __metaclass__ = type
> >
> > __all__ = [
> >
> > 'BuilddManager',
> > 'BUILDD_MANAGER_LOG_NAME',
> > 'buildd_success_result_map',
> > ]
> >
> > import logging
> >
> > import transaction
> > from twisted.application import service
> > from twisted.internet import (
> >
> > defer,
> > reactor,
> > )
> >
> > from twisted.internet.task import LoopingCall
> > from twisted.python import log
> > from zope.component import getUtility
> >
> > from lp.buildmaster.enums import BuildStatus
> > from lp.buildmaster.interfaces.buildfarmjobbehavior import (
> >
> > BuildBehaviorMismatch,
> > )
> >
> > from lp.buildmaster.interfaces.builder import (
> >
> > BuildDaemonError,
> > BuildSlaveFailure,
> > CannotBuild,
> > CannotResumeHost,
> > )
> >
> > BUILDD_MANAGER_LOG_NAME = "slave-scanner"
> >
> >
> > buildd_success_result_map = {
> >
> > 'ensurepresent': True,
> > 'build': 'BuilderStatus.BUILDING',
> > }
>
> You can delete this now. Yay. (Don't forget the __all__ too).

Right!

>
> > class SlaveScanner:
> > """A manager for a single builder."""
> >
> > SCAN_INTERVAL = 5
>
> Can you please add a comment explaining what this means, what unit it's in,
> and hinting at why 5 is a good number for it?

Done.

>
> > def startCycle(self):
> > """Scan the builder and dispatch to it or deal with failures."""
> > self.loop = LoopingCall(self._startCycle)
> > self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
> > return self.stopping_deferred
> >
> > def _startCycle(self):
> > # Same as _startCycle but the next cycle is not scheduled. This
> > # is so tests can initiate a single scan.
>
> This comment is obsolete. Also, there's probably a better name than
> "_startCycle", since this is pretty much doing the scan. Perhaps
> 'oneCycle'.

I named it singleCycle()

>
> > def _scanFailed(self, failure):
> > # Trap known exceptions and print a message without a
> > # stack trace in that case, or if we don't know about it,
> > # include the trace.
>
> This comment is also obsolete. Although the traceback/no-traceback logic
> is still here, it's hardly the point of the method.

Spruced up as discussed in IRC.

>
> > # Paranoia.
> > transaction.abort()
>
> Can you please explain in the comment exactly what you are being paranoid
> about?

Yup, done.

>
> > error_message = failure.getErrorMessage()
> > if failure.check(
...

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (9.4 KiB)

On Mon, Oct 18, 2010 at 4:49 PM, Julian Edwards
<email address hidden> wrote:
> On Monday 18 October 2010 13:21:45 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> The new manager is far, far more readable than before.  The hard work has
>> paid off.
>
> \o/
>
>> I mention a lot of small things in the comments below.  I'd really
>> appreciate it if you could address them all, since now is the best
>> opportunity we'll have in a while to make this code comprehensible to
>> others.
>
> I shall endeavour to do so.
>
...
>>
>> > class SlaveScanner:
>> >     """A manager for a single builder."""
>> >
>> >     SCAN_INTERVAL = 5
>>
>> Can you please add a comment explaining what this means, what unit it's in,
>> and hinting at why 5 is a good number for it?
>
> Done.
>

...
>>
>> >         """
>> >         # We need to re-fetch the builder object on each cycle as the
>> >         # Storm store is invalidated over transaction boundaries.
>>
>> This method is complicated enough that I think it would benefit from a
>> prose summary of what it does.
>>
>> For example:
>>
>>   If the builder is OK [XXX - I don't know what this actually means - jml],
>>   then update its status [XXX - this seems redundant, didn't we just check
>>   that it was OK? - jml].  If we think there's an active build on the
>>   builder, then check to see if it's done (builder.updateBuild).  If it
>>   is done, then check that it's available and not in manual mode, then
>>   dispatch the job to the build.
>>
>> Well, that's my best guess.  I think it's only worth describing the happy
>> case, since the unusual cases will be easy enough to follow in the code.
>
> I've written something - let me know if you think it's easy enough to follow.
> It does require some knowledge of how the farm works and I don't think code
> comments are the best place for explaining that.
>

It's good. Much more helpful, thanks.

>> > class BuilddManager(service.Service):
>> >     """Main Buildd Manager service class."""
>> >
>> >     def __init__(self, clock=None):
>> >         self.builder_slaves = []
>> >         self.logger = self._setupLogger()
>>
>> Given that _setupLogger changes global state, it's better to put it in
>> startService.
>
> Nocando (which is a bit near Katmandu) - the NewBuildersScanner needs it
> there.  (More below on this section of code)
>

>>
>> >         self.new_builders_scanner = NewBuildersScanner(
>> >
>> >             manager=self, clock=clock)
>> >
>> >     def _setupLogger(self):
>> >         """Setup a 'slave-scanner' logger that redirects to twisted.
>>
>> FWIW, "Setup" is a noun. "Set up" is a verb.
>
> Setup is not a word, it's a typo.
>

>> > class TestSlaveScannerScan(TrialTestCase):
>> >     """Tests `SlaveScanner.scan` method.
>> >
>> >     This method uses the old framework for scanning and dispatching
>> >     builds. """
>> >     layer = LaunchpadZopelessLayer
>> >
>> >     def setUp(self):
>> >         """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest.
>> >
>> >         Also adjust the sampledata in a way a build can be dispatched to
>> >         'bob' builder.
>> >         """
>> >         from lp.soyuz.tests.test_publishing import...

Read more...

Revision history for this message
Jonathan Lange (jml) wrote :

Add the XXXs as recommended in the last review comment and please land!

review: Approve
Revision history for this message
Jonathan Lange (jml) wrote :

Oh, forgot about the two WIP things mentioned in your reply.

review: Needs Fixing
Revision history for this message
Jonathan Lange (jml) wrote :

The new code looks good. You should add an explanation about why you don't disable builders as soon as they've got a failure, i.e. why the threshold exists at all.

Also, in my previous review just prior to commit r11699, I suggested adding several XXXs. Could you please do that.

Do both of these things, and then land.

review: Approve
Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Tuesday 19 October 2010 16:39:43 you wrote:
> Review: Approve
> The new code looks good. You should add an explanation about why you don't
> disable builders as soon as they've got a failure, i.e. why the threshold
> exists at all.

Roger.

> Also, in my previous review just prior to commit r11699, I suggested adding
> several XXXs. Could you please do that.

Grar, I forgot to finish that, thanks for reminding me.

> Do both of these things, and then land.

I'm not going to land it right away. I want to soak test it on dogfood first,
borrowing some builders from production. Once I'm happy with it on there,
it's time to let loose the hounds.

Thanks for all your help on this branch! You'll notice the large list of bugs
I linked to it just now. We've fixed all of those in this branch, and
probably more.

Cheers.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/buildmaster/doc/builder.txt'
2--- lib/lp/buildmaster/doc/builder.txt 2010-09-23 12:35:21 +0000
3+++ lib/lp/buildmaster/doc/builder.txt 2010-10-25 19:14:01 +0000
4@@ -19,9 +19,6 @@
5 As expected, it implements IBuilder.
6
7 >>> from canonical.launchpad.webapp.testing import verifyObject
8- >>> from lp.buildmaster.interfaces.builder import IBuilder
9- >>> verifyObject(IBuilder, builder)
10- True
11
12 >>> print builder.name
13 bob
14@@ -86,7 +83,7 @@
15 The 'new' method will create a new builder in the database.
16
17 >>> bnew = builderset.new(1, 'http://dummy.com:8221/', 'dummy',
18- ... 'Dummy Title', 'eh ?', 1)
19+ ... 'Dummy Title', 'eh ?', 1)
20 >>> bnew.name
21 u'dummy'
22
23@@ -170,7 +167,7 @@
24 >>> recipe_bq.processor = i386_family.processors[0]
25 >>> recipe_bq.virtualized = True
26 >>> transaction.commit()
27-
28+
29 >>> queue_sizes = builderset.getBuildQueueSizes()
30 >>> print queue_sizes['virt']['386']
31 (1L, datetime.timedelta(0, 64))
32@@ -188,116 +185,3 @@
33
34 >>> print queue_sizes['virt']['386']
35 (2L, datetime.timedelta(0, 128))
36-
37-
38-Resuming buildd slaves
39-======================
40-
41-Virtual slaves are resumed using a command specified in the
42-configuration profile. Production configuration uses a SSH trigger
43-account accessed via a private key available in the builddmaster
44-machine (which used ftpmaster configuration profile) as in:
45-
46-{{{
47-ssh ~/.ssh/ppa-reset-key ppa@%(vm_host)s
48-}}}
49-
50-The test configuration uses a fake command that can be performed in
51-development machine and allow us to tests the important features used
52-in production, as 'vm_host' variable replacement.
53-
54- >>> from canonical.config import config
55- >>> config.builddmaster.vm_resume_command
56- 'echo %(vm_host)s'
57-
58-Before performing the command, it checks if the builder is indeed
59-virtual and raises CannotResumeHost if it isn't.
60-
61- >>> bob = getUtility(IBuilderSet)['bob']
62- >>> bob.resumeSlaveHost()
63- Traceback (most recent call last):
64- ...
65- CannotResumeHost: Builder is not virtualized.
66-
67-For testing purposes resumeSlaveHost returns the stdout and stderr
68-buffer resulted from the command.
69-
70- >>> frog = getUtility(IBuilderSet)['frog']
71- >>> out, err = frog.resumeSlaveHost()
72- >>> print out.strip()
73- localhost-host.ppa
74-
75-If the specified command fails, resumeSlaveHost also raises
76-CannotResumeHost exception with the results stdout and stderr.
77-
78- # The command must have a vm_host dict key and when executed,
79- # have a returncode that is not 0.
80- >>> vm_resume_command = """
81- ... [builddmaster]
82- ... vm_resume_command: test "%(vm_host)s = 'false'"
83- ... """
84- >>> config.push('vm_resume_command', vm_resume_command)
85- >>> frog.resumeSlaveHost()
86- Traceback (most recent call last):
87- ...
88- CannotResumeHost: Resuming failed:
89- OUT:
90- <BLANKLINE>
91- ERR:
92- <BLANKLINE>
93-
94-Restore default value for resume command.
95-
96- >>> config_data = config.pop('vm_resume_command')
97-
98-
99-Rescuing lost slaves
100-====================
101-
102-Builder.rescueIfLost() checks the build ID reported in the slave status
103-against the database. If it isn't building what we think it should be,
104-the current build will be aborted and the slave cleaned in preparation
105-for a new task. The decision about the slave's correctness is left up
106-to IBuildFarmJobBehavior.verifySlaveBuildCookie -- for these examples we
107-will use a special behavior that just checks if the cookie reads 'good'.
108-
109- >>> import logging
110- >>> from lp.buildmaster.interfaces.builder import CorruptBuildCookie
111- >>> from lp.buildmaster.tests.mock_slaves import (
112- ... BuildingSlave, MockBuilder, OkSlave, WaitingSlave)
113-
114- >>> class TestBuildBehavior:
115- ... def verifySlaveBuildCookie(self, cookie):
116- ... if cookie != 'good':
117- ... raise CorruptBuildCookie('Bad value')
118-
119- >>> def rescue_slave_if_lost(slave):
120- ... builder = MockBuilder('mock', slave, TestBuildBehavior())
121- ... builder.rescueIfLost(logging.getLogger())
122-
123-An idle slave is not rescued.
124-
125- >>> rescue_slave_if_lost(OkSlave())
126-
127-Slaves building or having built the correct build are not rescued
128-either.
129-
130- >>> rescue_slave_if_lost(BuildingSlave(build_id='good'))
131- >>> rescue_slave_if_lost(WaitingSlave(build_id='good'))
132-
133-But if a slave is building the wrong ID, it is declared lost and
134-an abort is attempted. MockSlave prints out a message when it is aborted
135-or cleaned.
136-
137- >>> rescue_slave_if_lost(BuildingSlave(build_id='bad'))
138- Aborting slave
139- INFO:root:Builder 'mock' rescued from 'bad': 'Bad value'
140-
141-Slaves having completed an incorrect build are also declared lost,
142-but there's no need to abort a completed build. Such builders are
143-instead simply cleaned, ready for the next build.
144-
145- >>> rescue_slave_if_lost(WaitingSlave(build_id='bad'))
146- Cleaning slave
147- INFO:root:Builder 'mock' rescued from 'bad': 'Bad value'
148-
149
150=== modified file 'lib/lp/buildmaster/interfaces/builder.py'
151--- lib/lp/buildmaster/interfaces/builder.py 2010-09-23 18:17:21 +0000
152+++ lib/lp/buildmaster/interfaces/builder.py 2010-10-25 19:14:01 +0000
153@@ -154,11 +154,6 @@
154
155 currentjob = Attribute("BuildQueue instance for job being processed.")
156
157- is_available = Bool(
158- title=_("Whether or not a builder is available for building "
159- "new jobs. "),
160- required=False)
161-
162 failure_count = Int(
163 title=_('Failure Count'), required=False, default=0,
164 description=_("Number of consecutive failures for this builder."))
165@@ -173,32 +168,74 @@
166 def resetFailureCount():
167 """Set the failure_count back to zero."""
168
169- def checkSlaveAlive():
170- """Check that the buildd slave is alive.
171-
172- This pings the slave over the network via the echo method and looks
173- for the sent message as the reply.
174-
175- :raises BuildDaemonError: When the slave is down.
176+ def failBuilder(reason):
177+ """Mark builder as failed for a given reason."""
178+
179+ def setSlaveForTesting(proxy):
180+ """Sets the RPC proxy through which to operate the build slave."""
181+
182+ def verifySlaveBuildCookie(slave_build_id):
183+ """Verify that a slave's build cookie is consistent.
184+
185+ This should delegate to the current `IBuildFarmJobBehavior`.
186+ """
187+
188+ def transferSlaveFileToLibrarian(file_sha1, filename, private):
189+ """Transfer a file from the slave to the librarian.
190+
191+ :param file_sha1: The file's sha1, which is how the file is addressed
192+ in the slave XMLRPC protocol. Specially, the file_sha1 'buildlog'
193+ will cause the build log to be retrieved and gzipped.
194+ :param filename: The name of the file to be given to the librarian file
195+ alias.
196+ :param private: True if the build is for a private archive.
197+ :return: A librarian file alias.
198+ """
199+
200+ def getBuildQueue():
201+ """Return a `BuildQueue` if there's an active job on this builder.
202+
203+ :return: A BuildQueue, or None.
204+ """
205+
206+ def getCurrentBuildFarmJob():
207+ """Return a `BuildFarmJob` for this builder."""
208+
209+ # All methods below here return Deferred.
210+
211+ def isAvailable():
212+ """Whether or not a builder is available for building new jobs.
213+
214+ :return: A Deferred that fires with True or False, depending on
215+ whether the builder is available or not.
216 """
217
218 def rescueIfLost(logger=None):
219 """Reset the slave if its job information doesn't match the DB.
220
221- If the builder is BUILDING or WAITING but has a build ID string
222- that doesn't match what is stored in the DB, we have to dismiss
223- its current actions and clean the slave for another job, assuming
224- the XMLRPC is working properly at this point.
225+ This checks the build ID reported in the slave status against the
226+ database. If it isn't building what we think it should be, the current
227+ build will be aborted and the slave cleaned in preparation for a new
228+ task. The decision about the slave's correctness is left up to
229+ `IBuildFarmJobBehavior.verifySlaveBuildCookie`.
230+
231+ :return: A Deferred that fires when the dialog with the slave is
232+ finished. It does not have a return value.
233 """
234
235 def updateStatus(logger=None):
236- """Update the builder's status by probing it."""
237+ """Update the builder's status by probing it.
238+
239+ :return: A Deferred that fires when the dialog with the slave is
240+ finished. It does not have a return value.
241+ """
242
243 def cleanSlave():
244- """Clean any temporary files from the slave."""
245-
246- def failBuilder(reason):
247- """Mark builder as failed for a given reason."""
248+ """Clean any temporary files from the slave.
249+
250+ :return: A Deferred that fires when the dialog with the slave is
251+ finished. It does not have a return value.
252+ """
253
254 def requestAbort():
255 """Ask that a build be aborted.
256@@ -206,6 +243,9 @@
257 This takes place asynchronously: Actually killing everything running
258 can take some time so the slave status should be queried again to
259 detect when the abort has taken effect. (Look for status ABORTED).
260+
261+ :return: A Deferred that fires when the dialog with the slave is
262+ finished. It does not have a return value.
263 """
264
265 def resumeSlaveHost():
266@@ -217,37 +257,35 @@
267 :raises: CannotResumeHost: if builder is not virtual or if the
268 configuration command has failed.
269
270- :return: command stdout and stderr buffers as a tuple.
271+ :return: A Deferred that fires when the resume operation finishes,
272+ whose value is a (stdout, stderr) tuple for success, or a Failure
273+ whose value is a CannotResumeHost exception.
274 """
275
276- def setSlaveForTesting(proxy):
277- """Sets the RPC proxy through which to operate the build slave."""
278-
279 def slaveStatus():
280 """Get the slave status for this builder.
281
282- :return: a dict containing at least builder_status, but potentially
283- other values included by the current build behavior.
284+ :return: A Deferred which fires when the slave dialog is complete.
285+ Its value is a dict containing at least builder_status, but
286+ potentially other values included by the current build
287+ behavior.
288 """
289
290 def slaveStatusSentence():
291 """Get the slave status sentence for this builder.
292
293- :return: A tuple with the first element containing the slave status,
294- build_id-queue-id and then optionally more elements depending on
295- the status.
296- """
297-
298- def verifySlaveBuildCookie(slave_build_id):
299- """Verify that a slave's build cookie is consistent.
300-
301- This should delegate to the current `IBuildFarmJobBehavior`.
302+ :return: A Deferred which fires when the slave dialog is complete.
303+ Its value is a tuple with the first element containing the
304+ slave status, build_id-queue-id and then optionally more
305+ elements depending on the status.
306 """
307
308 def updateBuild(queueItem):
309 """Verify the current build job status.
310
311 Perform the required actions for each state.
312+
313+ :return: A Deferred that fires when the slave dialog is finished.
314 """
315
316 def startBuild(build_queue_item, logger):
317@@ -255,21 +293,10 @@
318
319 :param build_queue_item: A BuildQueueItem to build.
320 :param logger: A logger to be used to log diagnostic information.
321- :raises BuildSlaveFailure: When the build slave fails.
322- :raises CannotBuild: When a build cannot be started for some reason
323- other than the build slave failing.
324- """
325-
326- def transferSlaveFileToLibrarian(file_sha1, filename, private):
327- """Transfer a file from the slave to the librarian.
328-
329- :param file_sha1: The file's sha1, which is how the file is addressed
330- in the slave XMLRPC protocol. Specially, the file_sha1 'buildlog'
331- will cause the build log to be retrieved and gzipped.
332- :param filename: The name of the file to be given to the librarian file
333- alias.
334- :param private: True if the build is for a private archive.
335- :return: A librarian file alias.
336+
337+ :return: A Deferred that fires after the dispatch has completed whose
338+ value is None, or a Failure that contains an exception
339+ explaining what went wrong.
340 """
341
342 def handleTimeout(logger, error_message):
343@@ -284,6 +311,8 @@
344
345 :param logger: The logger object to be used for logging.
346 :param error_message: The error message to be used for logging.
347+ :return: A Deferred that fires after the virtual slave was resumed
348+ or immediately if it's a non-virtual slave.
349 """
350
351 def findAndStartJob(buildd_slave=None):
352@@ -291,17 +320,9 @@
353
354 :param buildd_slave: An optional buildd slave that this builder should
355 talk to.
356- :return: the `IBuildQueue` instance found or None if no job was found.
357- """
358-
359- def getBuildQueue():
360- """Return a `BuildQueue` if there's an active job on this builder.
361-
362- :return: A BuildQueue, or None.
363- """
364-
365- def getCurrentBuildFarmJob():
366- """Return a `BuildFarmJob` for this builder."""
367+ :return: A Deferred whose value is the `IBuildQueue` instance
368+ found or None if no job was found.
369+ """
370
371
372 class IBuilderSet(Interface):
373
374=== modified file 'lib/lp/buildmaster/manager.py'
375--- lib/lp/buildmaster/manager.py 2010-09-24 15:40:49 +0000
376+++ lib/lp/buildmaster/manager.py 2010-10-25 19:14:01 +0000
377@@ -10,13 +10,10 @@
378 'BuilddManager',
379 'BUILDD_MANAGER_LOG_NAME',
380 'FailDispatchResult',
381- 'RecordingSlave',
382 'ResetDispatchResult',
383- 'buildd_success_result_map',
384 ]
385
386 import logging
387-import os
388
389 import transaction
390 from twisted.application import service
391@@ -24,129 +21,27 @@
392 defer,
393 reactor,
394 )
395-from twisted.protocols.policies import TimeoutMixin
396+from twisted.internet.task import LoopingCall
397 from twisted.python import log
398-from twisted.python.failure import Failure
399-from twisted.web import xmlrpc
400 from zope.component import getUtility
401
402-from canonical.config import config
403-from canonical.launchpad.webapp import urlappend
404-from lp.services.database import write_transaction
405 from lp.buildmaster.enums import BuildStatus
406-from lp.services.twistedsupport.processmonitor import ProcessWithTimeout
407+from lp.buildmaster.interfaces.buildfarmjobbehavior import (
408+ BuildBehaviorMismatch,
409+ )
410+from lp.buildmaster.model.builder import Builder
411+from lp.buildmaster.interfaces.builder import (
412+ BuildDaemonError,
413+ BuildSlaveFailure,
414+ CannotBuild,
415+ CannotFetchFile,
416+ CannotResumeHost,
417+ )
418
419
420 BUILDD_MANAGER_LOG_NAME = "slave-scanner"
421
422
423-buildd_success_result_map = {
424- 'ensurepresent': True,
425- 'build': 'BuilderStatus.BUILDING',
426- }
427-
428-
429-class QueryWithTimeoutProtocol(xmlrpc.QueryProtocol, TimeoutMixin):
430- """XMLRPC query protocol with a configurable timeout.
431-
432- XMLRPC queries using this protocol will be unconditionally closed
433- when the timeout is elapsed. The timeout is fetched from the context
434- Launchpad configuration file (`config.builddmaster.socket_timeout`).
435- """
436- def connectionMade(self):
437- xmlrpc.QueryProtocol.connectionMade(self)
438- self.setTimeout(config.builddmaster.socket_timeout)
439-
440-
441-class QueryFactoryWithTimeout(xmlrpc._QueryFactory):
442- """XMLRPC client factory with timeout support."""
443- # Make this factory quiet.
444- noisy = False
445- # Use the protocol with timeout support.
446- protocol = QueryWithTimeoutProtocol
447-
448-
449-class RecordingSlave:
450- """An RPC proxy for buildd slaves that records instructions to the latter.
451-
452- The idea here is to merely record the instructions that the slave-scanner
453- issues to the buildd slaves and "replay" them a bit later in asynchronous
454- and parallel fashion.
455-
456- By dealing with a number of buildd slaves in parallel we remove *the*
457- major slave-scanner throughput issue while avoiding large-scale changes to
458- its code base.
459- """
460-
461- def __init__(self, name, url, vm_host):
462- self.name = name
463- self.url = url
464- self.vm_host = vm_host
465-
466- self.resume_requested = False
467- self.calls = []
468-
469- def __repr__(self):
470- return '<%s:%s>' % (self.name, self.url)
471-
472- def cacheFile(self, logger, libraryfilealias):
473- """Cache the file on the server."""
474- self.ensurepresent(
475- libraryfilealias.content.sha1, libraryfilealias.http_url, '', '')
476-
477- def sendFileToSlave(self, *args):
478- """Helper to send a file to this builder."""
479- return self.ensurepresent(*args)
480-
481- def ensurepresent(self, *args):
482- """Download files needed for the build."""
483- self.calls.append(('ensurepresent', args))
484- result = buildd_success_result_map.get('ensurepresent')
485- return [result, 'Download']
486-
487- def build(self, *args):
488- """Perform the build."""
489- # XXX: This method does not appear to be used.
490- self.calls.append(('build', args))
491- result = buildd_success_result_map.get('build')
492- return [result, args[0]]
493-
494- def resume(self):
495- """Record the request to resume the builder..
496-
497- Always succeed.
498-
499- :return: a (stdout, stderr, subprocess exitcode) triple
500- """
501- self.resume_requested = True
502- return ['', '', 0]
503-
504- def resumeSlave(self, clock=None):
505- """Resume the builder in a asynchronous fashion.
506-
507- Used the configuration command-line in the same way
508- `BuilddSlave.resume` does.
509-
510- Also use the builddmaster configuration 'socket_timeout' as
511- the process timeout.
512-
513- :param clock: An optional twisted.internet.task.Clock to override
514- the default clock. For use in tests.
515-
516- :return: a Deferred
517- """
518- resume_command = config.builddmaster.vm_resume_command % {
519- 'vm_host': self.vm_host}
520- # Twisted API require string and the configuration provides unicode.
521- resume_argv = [str(term) for term in resume_command.split()]
522-
523- d = defer.Deferred()
524- p = ProcessWithTimeout(
525- d, config.builddmaster.socket_timeout, clock=clock)
526- p.spawnProcess(resume_argv[0], tuple(resume_argv))
527- return d
528-
529-
530 def get_builder(name):
531 """Helper to return the builder given the slave for this request."""
532 # Avoiding circular imports.
533@@ -159,9 +54,12 @@
534 # builder.currentjob hides a complicated query, don't run it twice.
535 # See bug 623281.
536 current_job = builder.currentjob
537- build_job = current_job.specific_job.build
538+ if current_job is None:
539+ job_failure_count = 0
540+ else:
541+ job_failure_count = current_job.specific_job.build.failure_count
542
543- if builder.failure_count == build_job.failure_count:
544+ if builder.failure_count == job_failure_count and current_job is not None:
545 # If the failure count for the builder is the same as the
546 # failure count for the job being built, then we cannot
547 # tell whether the job or the builder is at fault. The best
548@@ -170,17 +68,28 @@
549 current_job.reset()
550 return
551
552- if builder.failure_count > build_job.failure_count:
553+ if builder.failure_count > job_failure_count:
554 # The builder has failed more than the jobs it's been
555- # running, so let's disable it and re-schedule the build.
556- builder.failBuilder(fail_notes)
557- current_job.reset()
558+ # running.
559+
560+ # Re-schedule the build if there is one.
561+ if current_job is not None:
562+ current_job.reset()
563+
564+ # We are a little more tolerant with failing builders than
565+ # failing jobs because sometimes they get unresponsive due to
566+ # human error, flaky networks etc. We expect the builder to get
567+ # better, whereas jobs are very unlikely to get better.
568+ if builder.failure_count >= Builder.FAILURE_THRESHOLD:
569+ # It's also gone over the threshold so let's disable it.
570+ builder.failBuilder(fail_notes)
571 else:
572 # The job is the culprit! Override its status to 'failed'
573 # to make sure it won't get automatically dispatched again,
574 # and remove the buildqueue request. The failure should
575 # have already caused any relevant slave data to be stored
576 # on the build record so don't worry about that here.
577+ build_job = current_job.specific_job.build
578 build_job.status = BuildStatus.FAILEDTOBUILD
579 builder.currentjob.destroySelf()
580
581@@ -190,133 +99,108 @@
582 # next buildd scan.
583
584
585-class BaseDispatchResult:
586- """Base class for *DispatchResult variations.
587-
588- It will be extended to represent dispatching results and allow
589- homogeneous processing.
590- """
591-
592- def __init__(self, slave, info=None):
593- self.slave = slave
594- self.info = info
595-
596- def _cleanJob(self, job):
597- """Clean up in case of builder reset or dispatch failure."""
598- if job is not None:
599- job.reset()
600-
601- def assessFailureCounts(self):
602- """View builder/job failure_count and work out which needs to die.
603-
604- :return: True if we disabled something, False if we did not.
605- """
606- builder = get_builder(self.slave.name)
607- assessFailureCounts(builder, self.info)
608-
609- def ___call__(self):
610- raise NotImplementedError(
611- "Call sites must define an evaluation method.")
612-
613-
614-class FailDispatchResult(BaseDispatchResult):
615- """Represents a communication failure while dispatching a build job..
616-
617- When evaluated this object mark the corresponding `IBuilder` as
618- 'NOK' with the given text as 'failnotes'. It also cleans up the running
619- job (`IBuildQueue`).
620- """
621-
622- def __repr__(self):
623- return '%r failure (%s)' % (self.slave, self.info)
624-
625- @write_transaction
626- def __call__(self):
627- self.assessFailureCounts()
628-
629-
630-class ResetDispatchResult(BaseDispatchResult):
631- """Represents a failure to reset a builder.
632-
633- When evaluated this object simply cleans up the running job
634- (`IBuildQueue`) and marks the builder down.
635- """
636-
637- def __repr__(self):
638- return '%r reset failure' % self.slave
639-
640- @write_transaction
641- def __call__(self):
642- builder = get_builder(self.slave.name)
643- # Builders that fail to reset should be disabled as per bug
644- # 563353.
645- # XXX Julian bug=586362
646- # This is disabled until this code is not also used for dispatch
647- # failures where we *don't* want to disable the builder.
648- # builder.failBuilder(self.info)
649- self._cleanJob(builder.currentjob)
650-
651-
652 class SlaveScanner:
653 """A manager for a single builder."""
654
655+ # The interval between each poll cycle, in seconds. We'd ideally
656+ # like this to be lower but 5 seems a reasonable compromise between
657+ # responsivity and load on the database server, since in each cycle
658+ # we can run quite a few queries.
659 SCAN_INTERVAL = 5
660
661- # These are for the benefit of tests; see `TestingSlaveScanner`.
662- # It pokes fake versions in here so that it can verify methods were
663- # called. The tests should really be using FakeMethod() though.
664- reset_result = ResetDispatchResult
665- fail_result = FailDispatchResult
666-
667 def __init__(self, builder_name, logger):
668 self.builder_name = builder_name
669 self.logger = logger
670- self._deferred_list = []
671-
672- def scheduleNextScanCycle(self):
673- """Schedule another scan of the builder some time in the future."""
674- self._deferred_list = []
675- # XXX: Change this to use LoopingCall.
676- reactor.callLater(self.SCAN_INTERVAL, self.startCycle)
677
678 def startCycle(self):
679 """Scan the builder and dispatch to it or deal with failures."""
680+ self.loop = LoopingCall(self.singleCycle)
681+ self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
682+ return self.stopping_deferred
683+
684+ def stopCycle(self):
685+ """Terminate the LoopingCall."""
686+ self.loop.stop()
687+
688+ def singleCycle(self):
689 self.logger.debug("Scanning builder: %s" % self.builder_name)
690-
691+ d = self.scan()
692+
693+ d.addErrback(self._scanFailed)
694+ return d
695+
696+ def _scanFailed(self, failure):
697+ """Deal with failures encountered during the scan cycle.
698+
699+ 1. Print the error in the log
700+ 2. Increment and assess failure counts on the builder and job.
701+ """
702+ # Make sure that pending database updates are removed as it
703+ # could leave the database in an inconsistent state (e.g. The
704+ # job says it's running but the buildqueue has no builder set).
705+ transaction.abort()
706+
707+ # If we don't recognise the exception include a stack trace with
708+ # the error.
709+ error_message = failure.getErrorMessage()
710+ if failure.check(
711+ BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch,
712+ CannotResumeHost, BuildDaemonError, CannotFetchFile):
713+ self.logger.info("Scanning failed with: %s" % error_message)
714+ else:
715+ self.logger.info("Scanning failed with: %s\n%s" %
716+ (failure.getErrorMessage(), failure.getTraceback()))
717+
718+ # Decide if we need to terminate the job or fail the
719+ # builder.
720 try:
721- slave = self.scan()
722- if slave is None:
723- self.scheduleNextScanCycle()
724+ builder = get_builder(self.builder_name)
725+ builder.gotFailure()
726+ if builder.currentjob is not None:
727+ build_farm_job = builder.getCurrentBuildFarmJob()
728+ build_farm_job.gotFailure()
729+ self.logger.info(
730+ "builder %s failure count: %s, "
731+ "job '%s' failure count: %s" % (
732+ self.builder_name,
733+ builder.failure_count,
734+ build_farm_job.title,
735+ build_farm_job.failure_count))
736 else:
737- # XXX: Ought to return Deferred.
738- self.resumeAndDispatch(slave)
739+ self.logger.info(
740+ "Builder %s failed a probe, count: %s" % (
741+ self.builder_name, builder.failure_count))
742+ assessFailureCounts(builder, failure.getErrorMessage())
743+ transaction.commit()
744 except:
745- error = Failure()
746- self.logger.info("Scanning failed with: %s\n%s" %
747- (error.getErrorMessage(), error.getTraceback()))
748-
749- builder = get_builder(self.builder_name)
750-
751- # Decide if we need to terminate the job or fail the
752- # builder.
753- self._incrementFailureCounts(builder)
754- self.logger.info(
755- "builder failure count: %s, job failure count: %s" % (
756- builder.failure_count,
757- builder.getCurrentBuildFarmJob().failure_count))
758- assessFailureCounts(builder, error.getErrorMessage())
759- transaction.commit()
760-
761- self.scheduleNextScanCycle()
762-
763- @write_transaction
764+ # Catastrophic code failure! Not much we can do.
765+ self.logger.error(
766+ "Miserable failure when trying to examine failure counts:\n",
767+ exc_info=True)
768+ transaction.abort()
769+
770 def scan(self):
771 """Probe the builder and update/dispatch/collect as appropriate.
772
773- The whole method is wrapped in a transaction, but we do partial
774- commits to avoid holding locks on tables.
775-
776- :return: A `RecordingSlave` if we dispatched a job to it, or None.
777+ There are several steps to scanning:
778+
779+ 1. If the builder is marked as "ok" then probe it to see what state
780+ it's in. This is where lost jobs are rescued if we think the
781+ builder is doing something that it later tells us it's not,
782+ and also where the multi-phase abort procedure happens.
783+ See IBuilder.rescueIfLost, which is called by
784+ IBuilder.updateStatus().
785+ 2. If the builder is still happy, we ask it if it has an active build
786+ and then either update the build in Launchpad or collect the
787+ completed build. (builder.updateBuild)
788+ 3. If the builder is not happy or it was marked as unavailable
789+ mid-build, we need to reset the job that we thought it had, so
790+ that the job is dispatched elsewhere.
791+ 4. If the builder is idle and we have another build ready, dispatch
792+ it.
793+
794+ :return: A Deferred that fires when the scan is complete, whose
795+ value is A `BuilderSlave` if we dispatched a job to it, or None.
796 """
797 # We need to re-fetch the builder object on each cycle as the
798 # Storm store is invalidated over transaction boundaries.
799@@ -324,240 +208,72 @@
800 self.builder = get_builder(self.builder_name)
801
802 if self.builder.builderok:
803- self.builder.updateStatus(self.logger)
804- transaction.commit()
805-
806- # See if we think there's an active build on the builder.
807- buildqueue = self.builder.getBuildQueue()
808-
809- # XXX Julian 2010-07-29 bug=611258
810- # We're not using the RecordingSlave until dispatching, which
811- # means that this part blocks until we've received a response
812- # from the builder. updateBuild() needs to be made
813- # asyncronous.
814-
815- # Scan the slave and get the logtail, or collect the build if
816- # it's ready. Yes, "updateBuild" is a bad name.
817- if buildqueue is not None:
818- self.builder.updateBuild(buildqueue)
819- transaction.commit()
820-
821- # If the builder is in manual mode, don't dispatch anything.
822- if self.builder.manual:
823- self.logger.debug(
824- '%s is in manual mode, not dispatching.' % self.builder.name)
825- return None
826-
827- # If the builder is marked unavailable, don't dispatch anything.
828- # Additionaly, because builders can be removed from the pool at
829- # any time, we need to see if we think there was a build running
830- # on it before it was marked unavailable. In this case we reset
831- # the build thusly forcing it to get re-dispatched to another
832- # builder.
833- if not self.builder.is_available:
834- job = self.builder.currentjob
835- if job is not None and not self.builder.builderok:
836- self.logger.info(
837- "%s was made unavailable, resetting attached "
838- "job" % self.builder.name)
839- job.reset()
840- transaction.commit()
841- return None
842-
843- # See if there is a job we can dispatch to the builder slave.
844-
845- # XXX: Rather than use the slave actually associated with the builder
846- # (which, incidentally, shouldn't be a property anyway), we make a new
847- # RecordingSlave so we can get access to its asynchronous
848- # "resumeSlave" method. Blech.
849- slave = RecordingSlave(
850- self.builder.name, self.builder.url, self.builder.vm_host)
851- # XXX: Passing buildd_slave=slave overwrites the 'slave' property of
852- # self.builder. Not sure why this is needed yet.
853- self.builder.findAndStartJob(buildd_slave=slave)
854- if self.builder.currentjob is not None:
855- # After a successful dispatch we can reset the
856- # failure_count.
857- self.builder.resetFailureCount()
858- transaction.commit()
859- return slave
860-
861- return None
862-
863- def resumeAndDispatch(self, slave):
864- """Chain the resume and dispatching Deferreds."""
865- # XXX: resumeAndDispatch makes Deferreds without returning them.
866- if slave.resume_requested:
867- # The slave needs to be reset before we can dispatch to
868- # it (e.g. a virtual slave)
869-
870- # XXX: Two problems here. The first is that 'resumeSlave' only
871- # exists on RecordingSlave (BuilderSlave calls it 'resume').
872- d = slave.resumeSlave()
873- d.addBoth(self.checkResume, slave)
874+ d = self.builder.updateStatus(self.logger)
875 else:
876- # No resume required, build dispatching can commence.
877 d = defer.succeed(None)
878
879- # Dispatch the build to the slave asynchronously.
880- d.addCallback(self.initiateDispatch, slave)
881- # Store this deferred so we can wait for it along with all
882- # the others that will be generated by RecordingSlave during
883- # the dispatch process, and chain a callback after they've
884- # all fired.
885- self._deferred_list.append(d)
886-
887- def initiateDispatch(self, resume_result, slave):
888- """Start dispatching a build to a slave.
889-
890- If the previous task in chain (slave resuming) has failed it will
891- receive a `ResetBuilderRequest` instance as 'resume_result' and
892- will immediately return that so the subsequent callback can collect
893- it.
894-
895- If the slave resuming succeeded, it starts the XMLRPC dialogue. The
896- dialogue may consist of many calls to the slave before the build
897- starts. Each call is done via a Deferred event, where slave calls
898- are sent in callSlave(), and checked in checkDispatch() which will
899- keep firing events via callSlave() until all the events are done or
900- an error occurs.
901- """
902- if resume_result is not None:
903- self.slaveConversationEnded()
904- return resume_result
905-
906- self.logger.info('Dispatching: %s' % slave)
907- self.callSlave(slave)
908-
909- def _getProxyForSlave(self, slave):
910- """Return a twisted.web.xmlrpc.Proxy for the buildd slave.
911-
912- Uses a protocol with timeout support, See QueryFactoryWithTimeout.
913- """
914- proxy = xmlrpc.Proxy(str(urlappend(slave.url, 'rpc')))
915- proxy.queryFactory = QueryFactoryWithTimeout
916- return proxy
917-
918- def callSlave(self, slave):
919- """Dispatch the next XMLRPC for the given slave."""
920- if len(slave.calls) == 0:
921- # That's the end of the dialogue with the slave.
922- self.slaveConversationEnded()
923- return
924-
925- # Get an XMLRPC proxy for the buildd slave.
926- proxy = self._getProxyForSlave(slave)
927- method, args = slave.calls.pop(0)
928- d = proxy.callRemote(method, *args)
929- d.addBoth(self.checkDispatch, method, slave)
930- self._deferred_list.append(d)
931- self.logger.debug('%s -> %s(%s)' % (slave, method, args))
932-
933- def slaveConversationEnded(self):
934- """After all the Deferreds are set up, chain a callback on them."""
935- dl = defer.DeferredList(self._deferred_list, consumeErrors=True)
936- dl.addBoth(self.evaluateDispatchResult)
937- return dl
938-
939- def evaluateDispatchResult(self, deferred_list_results):
940- """Process the DispatchResult for this dispatch chain.
941-
942- After waiting for the Deferred chain to finish, we'll have a
943- DispatchResult to evaluate, which deals with the result of
944- dispatching.
945- """
946- # The `deferred_list_results` is what we get when waiting on a
947- # DeferredList. It's a list of tuples of (status, result) where
948- # result is what the last callback in that chain returned.
949-
950- # If the result is an instance of BaseDispatchResult we need to
951- # evaluate it, as there's further action required at the end of
952- # the dispatch chain. None, resulting from successful chains,
953- # are discarded.
954-
955- dispatch_results = [
956- result for status, result in deferred_list_results
957- if isinstance(result, BaseDispatchResult)]
958-
959- for result in dispatch_results:
960- self.logger.info("%r" % result)
961- result()
962-
963- # At this point, we're done dispatching, so we can schedule the
964- # next scan cycle.
965- self.scheduleNextScanCycle()
966-
967- # For the test suite so that it can chain callback results.
968- return deferred_list_results
969-
970- def checkResume(self, response, slave):
971- """Check the result of resuming a slave.
972-
973- If there's a problem resuming, we return a ResetDispatchResult which
974- will get evaluated at the end of the scan, or None if the resume
975- was OK.
976-
977- :param response: the tuple that's constructed in
978- ProcessWithTimeout.processEnded(), or a Failure that
979- contains the tuple.
980- :param slave: the slave object we're talking to
981- """
982- if isinstance(response, Failure):
983- out, err, code = response.value
984- else:
985- out, err, code = response
986- if code == os.EX_OK:
987- return None
988-
989- error_text = '%s\n%s' % (out, err)
990- self.logger.error('%s resume failure: %s' % (slave, error_text))
991- return self.reset_result(slave, error_text)
992-
993- def _incrementFailureCounts(self, builder):
994- builder.gotFailure()
995- builder.getCurrentBuildFarmJob().gotFailure()
996-
997- def checkDispatch(self, response, method, slave):
998- """Verify the results of a slave xmlrpc call.
999-
1000- If it failed and it compromises the slave then return a corresponding
1001- `FailDispatchResult`, if it was a communication failure, simply
1002- reset the slave by returning a `ResetDispatchResult`.
1003- """
1004- from lp.buildmaster.interfaces.builder import IBuilderSet
1005- builder = getUtility(IBuilderSet)[slave.name]
1006-
1007- # XXX these DispatchResult classes are badly named and do the
1008- # same thing. We need to fix that.
1009- self.logger.debug(
1010- '%s response for "%s": %s' % (slave, method, response))
1011-
1012- if isinstance(response, Failure):
1013- self.logger.warn(
1014- '%s communication failed (%s)' %
1015- (slave, response.getErrorMessage()))
1016- self.slaveConversationEnded()
1017- self._incrementFailureCounts(builder)
1018- return self.fail_result(slave)
1019-
1020- if isinstance(response, list) and len(response) == 2:
1021- if method in buildd_success_result_map:
1022- expected_status = buildd_success_result_map.get(method)
1023- status, info = response
1024- if status == expected_status:
1025- self.callSlave(slave)
1026+ def status_updated(ignored):
1027+ # Commit the changes done while possibly rescuing jobs, to
1028+ # avoid holding table locks.
1029+ transaction.commit()
1030+
1031+ # See if we think there's an active build on the builder.
1032+ buildqueue = self.builder.getBuildQueue()
1033+
1034+ # Scan the slave and get the logtail, or collect the build if
1035+ # it's ready. Yes, "updateBuild" is a bad name.
1036+ if buildqueue is not None:
1037+ return self.builder.updateBuild(buildqueue)
1038+
1039+ def build_updated(ignored):
1040+ # Commit changes done while updating the build, to avoid
1041+ # holding table locks.
1042+ transaction.commit()
1043+
1044+ # If the builder is in manual mode, don't dispatch anything.
1045+ if self.builder.manual:
1046+ self.logger.debug(
1047+ '%s is in manual mode, not dispatching.' %
1048+ self.builder.name)
1049+ return
1050+
1051+ # If the builder is marked unavailable, don't dispatch anything.
1052+ # Additionaly, because builders can be removed from the pool at
1053+ # any time, we need to see if we think there was a build running
1054+ # on it before it was marked unavailable. In this case we reset
1055+ # the build thusly forcing it to get re-dispatched to another
1056+ # builder.
1057+
1058+ return self.builder.isAvailable().addCallback(got_available)
1059+
1060+ def got_available(available):
1061+ if not available:
1062+ job = self.builder.currentjob
1063+ if job is not None and not self.builder.builderok:
1064+ self.logger.info(
1065+ "%s was made unavailable, resetting attached "
1066+ "job" % self.builder.name)
1067+ job.reset()
1068+ transaction.commit()
1069+ return
1070+
1071+ # See if there is a job we can dispatch to the builder slave.
1072+
1073+ d = self.builder.findAndStartJob()
1074+ def job_started(candidate):
1075+ if self.builder.currentjob is not None:
1076+ # After a successful dispatch we can reset the
1077+ # failure_count.
1078+ self.builder.resetFailureCount()
1079+ transaction.commit()
1080+ return self.builder.slave
1081+ else:
1082 return None
1083- else:
1084- info = 'Unknown slave method: %s' % method
1085- else:
1086- info = 'Unexpected response: %s' % repr(response)
1087-
1088- self.logger.error(
1089- '%s failed to dispatch (%s)' % (slave, info))
1090-
1091- self.slaveConversationEnded()
1092- self._incrementFailureCounts(builder)
1093- return self.fail_result(slave, info)
1094+ return d.addCallback(job_started)
1095+
1096+ d.addCallback(status_updated)
1097+ d.addCallback(build_updated)
1098+ return d
1099
1100
1101 class NewBuildersScanner:
1102@@ -578,15 +294,21 @@
1103 self.current_builders = [
1104 builder.name for builder in getUtility(IBuilderSet)]
1105
1106+ def stop(self):
1107+ """Terminate the LoopingCall."""
1108+ self.loop.stop()
1109+
1110 def scheduleScan(self):
1111 """Schedule a callback SCAN_INTERVAL seconds later."""
1112- return self._clock.callLater(self.SCAN_INTERVAL, self.scan)
1113+ self.loop = LoopingCall(self.scan)
1114+ self.loop.clock = self._clock
1115+ self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
1116+ return self.stopping_deferred
1117
1118 def scan(self):
1119 """If a new builder appears, create a SlaveScanner for it."""
1120 new_builders = self.checkForNewBuilders()
1121 self.manager.addScanForBuilders(new_builders)
1122- self.scheduleScan()
1123
1124 def checkForNewBuilders(self):
1125 """See if any new builders were added."""
1126@@ -609,10 +331,7 @@
1127 manager=self, clock=clock)
1128
1129 def _setupLogger(self):
1130- """Setup a 'slave-scanner' logger that redirects to twisted.
1131-
1132- It is going to be used locally and within the thread running
1133- the scan() method.
1134+ """Set up a 'slave-scanner' logger that redirects to twisted.
1135
1136 Make it less verbose to avoid messing too much with the old code.
1137 """
1138@@ -643,12 +362,29 @@
1139 # Events will now fire in the SlaveScanner objects to scan each
1140 # builder.
1141
1142+ def stopService(self):
1143+ """Callback for when we need to shut down."""
1144+ # XXX: lacks unit tests
1145+ # All the SlaveScanner objects need to be halted gracefully.
1146+ deferreds = [slave.stopping_deferred for slave in self.builder_slaves]
1147+ deferreds.append(self.new_builders_scanner.stopping_deferred)
1148+
1149+ self.new_builders_scanner.stop()
1150+ for slave in self.builder_slaves:
1151+ slave.stopCycle()
1152+
1153+ # The 'stopping_deferred's are called back when the loops are
1154+ # stopped, so we can wait on them all at once here before
1155+ # exiting.
1156+ d = defer.DeferredList(deferreds, consumeErrors=True)
1157+ return d
1158+
1159 def addScanForBuilders(self, builders):
1160 """Set up scanner objects for the builders specified."""
1161 for builder in builders:
1162 slave_scanner = SlaveScanner(builder, self.logger)
1163 self.builder_slaves.append(slave_scanner)
1164- slave_scanner.scheduleNextScanCycle()
1165+ slave_scanner.startCycle()
1166
1167 # Return the slave list for the benefit of tests.
1168 return self.builder_slaves
1169
1170=== modified file 'lib/lp/buildmaster/model/builder.py'
1171--- lib/lp/buildmaster/model/builder.py 2010-09-24 13:39:27 +0000
1172+++ lib/lp/buildmaster/model/builder.py 2010-10-25 19:14:01 +0000
1173@@ -13,12 +13,11 @@
1174 ]
1175
1176 import gzip
1177-import httplib
1178 import logging
1179 import os
1180 import socket
1181-import subprocess
1182 import tempfile
1183+import transaction
1184 import urllib2
1185 import xmlrpclib
1186
1187@@ -34,6 +33,13 @@
1188 Count,
1189 Sum,
1190 )
1191+
1192+from twisted.internet import (
1193+ defer,
1194+ reactor as default_reactor,
1195+ )
1196+from twisted.web import xmlrpc
1197+
1198 from zope.component import getUtility
1199 from zope.interface import implements
1200
1201@@ -58,7 +64,6 @@
1202 from lp.buildmaster.interfaces.builder import (
1203 BuildDaemonError,
1204 BuildSlaveFailure,
1205- CannotBuild,
1206 CannotFetchFile,
1207 CannotResumeHost,
1208 CorruptBuildCookie,
1209@@ -66,9 +71,6 @@
1210 IBuilderSet,
1211 )
1212 from lp.buildmaster.interfaces.buildfarmjob import IBuildFarmJobSet
1213-from lp.buildmaster.interfaces.buildfarmjobbehavior import (
1214- BuildBehaviorMismatch,
1215- )
1216 from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
1217 from lp.buildmaster.model.buildfarmjobbehavior import IdleBuildBehavior
1218 from lp.buildmaster.model.buildqueue import (
1219@@ -78,9 +80,9 @@
1220 from lp.registry.interfaces.person import validate_public_person
1221 from lp.services.job.interfaces.job import JobStatus
1222 from lp.services.job.model.job import Job
1223-from lp.services.osutils import until_no_eintr
1224 from lp.services.propertycache import cachedproperty
1225-from lp.services.twistedsupport.xmlrpc import BlockingProxy
1226+from lp.services.twistedsupport.processmonitor import ProcessWithTimeout
1227+from lp.services.twistedsupport import cancel_on_timeout
1228 # XXX Michael Nelson 2010-01-13 bug=491330
1229 # These dependencies on soyuz will be removed when getBuildRecords()
1230 # is moved.
1231@@ -92,25 +94,9 @@
1232 from lp.soyuz.model.processor import Processor
1233
1234
1235-class TimeoutHTTPConnection(httplib.HTTPConnection):
1236-
1237- def connect(self):
1238- """Override the standard connect() methods to set a timeout"""
1239- ret = httplib.HTTPConnection.connect(self)
1240- self.sock.settimeout(config.builddmaster.socket_timeout)
1241- return ret
1242-
1243-
1244-class TimeoutHTTP(httplib.HTTP):
1245- _connection_class = TimeoutHTTPConnection
1246-
1247-
1248-class TimeoutTransport(xmlrpclib.Transport):
1249- """XMLRPC Transport to setup a socket with defined timeout"""
1250-
1251- def make_connection(self, host):
1252- host, extra_headers, x509 = self.get_host_info(host)
1253- return TimeoutHTTP(host)
1254+class QuietQueryFactory(xmlrpc._QueryFactory):
1255+ """XMLRPC client factory that doesn't splatter the log with junk."""
1256+ noisy = False
1257
1258
1259 class BuilderSlave(object):
1260@@ -125,24 +111,7 @@
1261 # many false positives in your test run and will most likely break
1262 # production.
1263
1264- # XXX: This (BuilderSlave) should use composition, rather than
1265- # inheritance.
1266-
1267- # XXX: Have a documented interface for the XML-RPC server:
1268- # - what methods
1269- # - what return values expected
1270- # - what faults
1271- # (see XMLRPCBuildDSlave in lib/canonical/buildd/slave.py).
1272-
1273- # XXX: Arguably, this interface should be asynchronous
1274- # (i.e. Deferred-returning). This would mean that Builder (see below)
1275- # would have to expect Deferreds.
1276-
1277- # XXX: Once we have a client object with a defined, tested interface, we
1278- # should make a test double that doesn't do any XML-RPC and can be used to
1279- # make testing easier & tests faster.
1280-
1281- def __init__(self, proxy, builder_url, vm_host):
1282+ def __init__(self, proxy, builder_url, vm_host, reactor=None):
1283 """Initialize a BuilderSlave.
1284
1285 :param proxy: An XML-RPC proxy, implementing 'callRemote'. It must
1286@@ -155,63 +124,87 @@
1287 self._file_cache_url = urlappend(builder_url, 'filecache')
1288 self._server = proxy
1289
1290+ if reactor is None:
1291+ self.reactor = default_reactor
1292+ else:
1293+ self.reactor = reactor
1294+
1295 @classmethod
1296- def makeBlockingSlave(cls, builder_url, vm_host):
1297- rpc_url = urlappend(builder_url, 'rpc')
1298- server_proxy = xmlrpclib.ServerProxy(
1299- rpc_url, transport=TimeoutTransport(), allow_none=True)
1300- return cls(BlockingProxy(server_proxy), builder_url, vm_host)
1301+ def makeBuilderSlave(cls, builder_url, vm_host, reactor=None, proxy=None):
1302+ """Create and return a `BuilderSlave`.
1303+
1304+ :param builder_url: The URL of the slave buildd machine,
1305+ e.g. http://localhost:8221
1306+ :param vm_host: If the slave is virtual, specify its host machine here.
1307+ :param reactor: Used by tests to override the Twisted reactor.
1308+ :param proxy: Used By tests to override the xmlrpc.Proxy.
1309+ """
1310+ rpc_url = urlappend(builder_url.encode('utf-8'), 'rpc')
1311+ if proxy is None:
1312+ server_proxy = xmlrpc.Proxy(rpc_url, allowNone=True)
1313+ server_proxy.queryFactory = QuietQueryFactory
1314+ else:
1315+ server_proxy = proxy
1316+ return cls(server_proxy, builder_url, vm_host, reactor)
1317+
1318+ def _with_timeout(self, d):
1319+ TIMEOUT = config.builddmaster.socket_timeout
1320+ return cancel_on_timeout(d, TIMEOUT, self.reactor)
1321
1322 def abort(self):
1323 """Abort the current build."""
1324- return self._server.callRemote('abort')
1325+ return self._with_timeout(self._server.callRemote('abort'))
1326
1327 def clean(self):
1328 """Clean up the waiting files and reset the slave's internal state."""
1329- return self._server.callRemote('clean')
1330+ return self._with_timeout(self._server.callRemote('clean'))
1331
1332 def echo(self, *args):
1333 """Echo the arguments back."""
1334- return self._server.callRemote('echo', *args)
1335+ return self._with_timeout(self._server.callRemote('echo', *args))
1336
1337 def info(self):
1338 """Return the protocol version and the builder methods supported."""
1339- return self._server.callRemote('info')
1340+ return self._with_timeout(self._server.callRemote('info'))
1341
1342 def status(self):
1343 """Return the status of the build daemon."""
1344- return self._server.callRemote('status')
1345+ return self._with_timeout(self._server.callRemote('status'))
1346
1347 def ensurepresent(self, sha1sum, url, username, password):
1348+ # XXX: Nothing external calls this. Make it private.
1349 """Attempt to ensure the given file is present."""
1350- return self._server.callRemote(
1351- 'ensurepresent', sha1sum, url, username, password)
1352+ return self._with_timeout(self._server.callRemote(
1353+ 'ensurepresent', sha1sum, url, username, password))
1354
1355 def getFile(self, sha_sum):
1356 """Construct a file-like object to return the named file."""
1357+ # XXX 2010-10-18 bug=662631
1358+ # Change this to do non-blocking IO.
1359 file_url = urlappend(self._file_cache_url, sha_sum)
1360 return urllib2.urlopen(file_url)
1361
1362- def resume(self):
1363- """Resume a virtual builder.
1364-
1365- It uses the configuration command-line (replacing 'vm_host') and
1366- return its output.
1367-
1368- :return: a (stdout, stderr, subprocess exitcode) triple
1369+ def resume(self, clock=None):
1370+ """Resume the builder in an asynchronous fashion.
1371+
1372+ We use the builddmaster configuration 'socket_timeout' as
1373+ the process timeout.
1374+
1375+ :param clock: An optional twisted.internet.task.Clock to override
1376+ the default clock. For use in tests.
1377+
1378+ :return: a Deferred that returns a
1379+ (stdout, stderr, subprocess exitcode) triple
1380 """
1381- # XXX: This executes the vm_resume_command
1382- # synchronously. RecordingSlave does so asynchronously. Since we
1383- # always want to do this asynchronously, there's no need for the
1384- # duplication.
1385 resume_command = config.builddmaster.vm_resume_command % {
1386 'vm_host': self._vm_host}
1387- resume_argv = resume_command.split()
1388- resume_process = subprocess.Popen(
1389- resume_argv, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
1390- stdout, stderr = resume_process.communicate()
1391-
1392- return (stdout, stderr, resume_process.returncode)
1393+ # Twisted API requires string but the configuration provides unicode.
1394+ resume_argv = [term.encode('utf-8') for term in resume_command.split()]
1395+ d = defer.Deferred()
1396+ p = ProcessWithTimeout(
1397+ d, config.builddmaster.socket_timeout, clock=clock)
1398+ p.spawnProcess(resume_argv[0], tuple(resume_argv))
1399+ return d
1400
1401 def cacheFile(self, logger, libraryfilealias):
1402 """Make sure that the file at 'libraryfilealias' is on the slave.
1403@@ -224,13 +217,15 @@
1404 "Asking builder on %s to ensure it has file %s (%s, %s)" % (
1405 self._file_cache_url, libraryfilealias.filename, url,
1406 libraryfilealias.content.sha1))
1407- self.sendFileToSlave(libraryfilealias.content.sha1, url)
1408+ return self.sendFileToSlave(libraryfilealias.content.sha1, url)
1409
1410 def sendFileToSlave(self, sha1, url, username="", password=""):
1411 """Helper to send the file at 'url' with 'sha1' to this builder."""
1412- present, info = self.ensurepresent(sha1, url, username, password)
1413- if not present:
1414- raise CannotFetchFile(url, info)
1415+ d = self.ensurepresent(sha1, url, username, password)
1416+ def check_present((present, info)):
1417+ if not present:
1418+ raise CannotFetchFile(url, info)
1419+ return d.addCallback(check_present)
1420
1421 def build(self, buildid, builder_type, chroot_sha1, filemap, args):
1422 """Build a thing on this build slave.
1423@@ -243,19 +238,18 @@
1424 :param args: A dictionary of extra arguments. The contents depend on
1425 the build job type.
1426 """
1427- try:
1428- return self._server.callRemote(
1429- 'build', buildid, builder_type, chroot_sha1, filemap, args)
1430- except xmlrpclib.Fault, info:
1431- raise BuildSlaveFailure(info)
1432+ d = self._with_timeout(self._server.callRemote(
1433+ 'build', buildid, builder_type, chroot_sha1, filemap, args))
1434+ def got_fault(failure):
1435+ failure.trap(xmlrpclib.Fault)
1436+ raise BuildSlaveFailure(failure.value)
1437+ return d.addErrback(got_fault)
1438
1439
1440 # This is a separate function since MockBuilder needs to use it too.
1441 # Do not use it -- (Mock)Builder.rescueIfLost should be used instead.
1442 def rescueBuilderIfLost(builder, logger=None):
1443 """See `IBuilder`."""
1444- status_sentence = builder.slaveStatusSentence()
1445-
1446 # 'ident_position' dict relates the position of the job identifier
1447 # token in the sentence received from status(), according the
1448 # two status we care about. See see lib/canonical/buildd/slave.py
1449@@ -265,61 +259,58 @@
1450 'BuilderStatus.WAITING': 2
1451 }
1452
1453- # Isolate the BuilderStatus string, always the first token in
1454- # see lib/canonical/buildd/slave.py and
1455- # IBuilder.slaveStatusSentence().
1456- status = status_sentence[0]
1457-
1458- # If the cookie test below fails, it will request an abort of the
1459- # builder. This will leave the builder in the aborted state and
1460- # with no assigned job, and we should now "clean" the slave which
1461- # will reset its state back to IDLE, ready to accept new builds.
1462- # This situation is usually caused by a temporary loss of
1463- # communications with the slave and the build manager had to reset
1464- # the job.
1465- if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
1466- builder.cleanSlave()
1467- if logger is not None:
1468- logger.info(
1469- "Builder '%s' cleaned up from ABORTED" % builder.name)
1470- return
1471-
1472- # If slave is not building nor waiting, it's not in need of rescuing.
1473- if status not in ident_position.keys():
1474- return
1475-
1476- slave_build_id = status_sentence[ident_position[status]]
1477-
1478- try:
1479- builder.verifySlaveBuildCookie(slave_build_id)
1480- except CorruptBuildCookie, reason:
1481- if status == 'BuilderStatus.WAITING':
1482- builder.cleanSlave()
1483+ d = builder.slaveStatusSentence()
1484+
1485+ def got_status(status_sentence):
1486+ """After we get the status, clean if we have to.
1487+
1488+ Always return status_sentence.
1489+ """
1490+ # Isolate the BuilderStatus string, always the first token in
1491+ # see lib/canonical/buildd/slave.py and
1492+ # IBuilder.slaveStatusSentence().
1493+ status = status_sentence[0]
1494+
1495+ # If the cookie test below fails, it will request an abort of the
1496+ # builder. This will leave the builder in the aborted state and
1497+ # with no assigned job, and we should now "clean" the slave which
1498+ # will reset its state back to IDLE, ready to accept new builds.
1499+ # This situation is usually caused by a temporary loss of
1500+ # communications with the slave and the build manager had to reset
1501+ # the job.
1502+ if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
1503+ if logger is not None:
1504+ logger.info(
1505+ "Builder '%s' being cleaned up from ABORTED" %
1506+ (builder.name,))
1507+ d = builder.cleanSlave()
1508+ return d.addCallback(lambda ignored: status_sentence)
1509 else:
1510- builder.requestAbort()
1511- if logger:
1512- logger.info(
1513- "Builder '%s' rescued from '%s': '%s'" %
1514- (builder.name, slave_build_id, reason))
1515-
1516-
1517-def _update_builder_status(builder, logger=None):
1518- """Really update the builder status."""
1519- try:
1520- builder.checkSlaveAlive()
1521- builder.rescueIfLost(logger)
1522- # Catch only known exceptions.
1523- # XXX cprov 2007-06-15 bug=120571: ValueError & TypeError catching is
1524- # disturbing in this context. We should spend sometime sanitizing the
1525- # exceptions raised in the Builder API since we already started the
1526- # main refactoring of this area.
1527- except (ValueError, TypeError, xmlrpclib.Fault,
1528- BuildDaemonError), reason:
1529- builder.failBuilder(str(reason))
1530- if logger:
1531- logger.warn(
1532- "%s (%s) marked as failed due to: %s",
1533- builder.name, builder.url, builder.failnotes, exc_info=True)
1534+ return status_sentence
1535+
1536+ def rescue_slave(status_sentence):
1537+ # If slave is not building nor waiting, it's not in need of rescuing.
1538+ status = status_sentence[0]
1539+ if status not in ident_position.keys():
1540+ return
1541+ slave_build_id = status_sentence[ident_position[status]]
1542+ try:
1543+ builder.verifySlaveBuildCookie(slave_build_id)
1544+ except CorruptBuildCookie, reason:
1545+ if status == 'BuilderStatus.WAITING':
1546+ d = builder.cleanSlave()
1547+ else:
1548+ d = builder.requestAbort()
1549+ def log_rescue(ignored):
1550+ if logger:
1551+ logger.info(
1552+ "Builder '%s' rescued from '%s': '%s'" %
1553+ (builder.name, slave_build_id, reason))
1554+ return d.addCallback(log_rescue)
1555+
1556+ d.addCallback(got_status)
1557+ d.addCallback(rescue_slave)
1558+ return d
1559
1560
1561 def updateBuilderStatus(builder, logger=None):
1562@@ -327,16 +318,7 @@
1563 if logger:
1564 logger.debug('Checking %s' % builder.name)
1565
1566- MAX_EINTR_RETRIES = 42 # pulling a number out of my a$$ here
1567- try:
1568- return until_no_eintr(
1569- MAX_EINTR_RETRIES, _update_builder_status, builder, logger=logger)
1570- except socket.error, reason:
1571- # In Python 2.6 we can use IOError instead. It also has
1572- # reason.errno but we might be using 2.5 here so use the
1573- # index hack.
1574- error_message = str(reason)
1575- builder.handleTimeout(logger, error_message)
1576+ return builder.rescueIfLost(logger)
1577
1578
1579 class Builder(SQLBase):
1580@@ -364,6 +346,10 @@
1581 active = BoolCol(dbName='active', notNull=True, default=True)
1582 failure_count = IntCol(dbName='failure_count', default=0, notNull=True)
1583
1584+ # The number of times a builder can consecutively fail before we
1585+ # give up and mark it builderok=False.
1586+ FAILURE_THRESHOLD = 5
1587+
1588 def _getCurrentBuildBehavior(self):
1589 """Return the current build behavior."""
1590 if not safe_hasattr(self, '_current_build_behavior'):
1591@@ -409,18 +395,13 @@
1592 """See `IBuilder`."""
1593 self.failure_count = 0
1594
1595- def checkSlaveAlive(self):
1596- """See IBuilder."""
1597- if self.slave.echo("Test")[0] != "Test":
1598- raise BuildDaemonError("Failed to echo OK")
1599-
1600 def rescueIfLost(self, logger=None):
1601 """See `IBuilder`."""
1602- rescueBuilderIfLost(self, logger)
1603+ return rescueBuilderIfLost(self, logger)
1604
1605 def updateStatus(self, logger=None):
1606 """See `IBuilder`."""
1607- updateBuilderStatus(self, logger)
1608+ return updateBuilderStatus(self, logger)
1609
1610 def cleanSlave(self):
1611 """See IBuilder."""
1612@@ -440,20 +421,23 @@
1613 def resumeSlaveHost(self):
1614 """See IBuilder."""
1615 if not self.virtualized:
1616- raise CannotResumeHost('Builder is not virtualized.')
1617+ return defer.fail(CannotResumeHost('Builder is not virtualized.'))
1618
1619 if not self.vm_host:
1620- raise CannotResumeHost('Undefined vm_host.')
1621+ return defer.fail(CannotResumeHost('Undefined vm_host.'))
1622
1623 logger = self._getSlaveScannerLogger()
1624 logger.debug("Resuming %s (%s)" % (self.name, self.url))
1625
1626- stdout, stderr, returncode = self.slave.resume()
1627- if returncode != 0:
1628+ d = self.slave.resume()
1629+ def got_resume_ok((stdout, stderr, returncode)):
1630+ return stdout, stderr
1631+ def got_resume_bad(failure):
1632+ stdout, stderr, code = failure.value
1633 raise CannotResumeHost(
1634 "Resuming failed:\nOUT:\n%s\nERR:\n%s\n" % (stdout, stderr))
1635
1636- return stdout, stderr
1637+ return d.addCallback(got_resume_ok).addErrback(got_resume_bad)
1638
1639 @cachedproperty
1640 def slave(self):
1641@@ -462,7 +446,7 @@
1642 # the slave object, which is usually an XMLRPC client, with a
1643 # stub object that removes the need to actually create a buildd
1644 # slave in various states - which can be hard to create.
1645- return BuilderSlave.makeBlockingSlave(self.url, self.vm_host)
1646+ return BuilderSlave.makeBuilderSlave(self.url, self.vm_host)
1647
1648 def setSlaveForTesting(self, proxy):
1649 """See IBuilder."""
1650@@ -483,18 +467,23 @@
1651
1652 # If we are building a virtual build, resume the virtual machine.
1653 if self.virtualized:
1654- self.resumeSlaveHost()
1655+ d = self.resumeSlaveHost()
1656+ else:
1657+ d = defer.succeed(None)
1658
1659- # Do it.
1660- build_queue_item.markAsBuilding(self)
1661- try:
1662- self.current_build_behavior.dispatchBuildToSlave(
1663+ def resume_done(ignored):
1664+ return self.current_build_behavior.dispatchBuildToSlave(
1665 build_queue_item.id, logger)
1666- except BuildSlaveFailure, e:
1667- logger.debug("Disabling builder: %s" % self.url, exc_info=1)
1668+
1669+ def eb_slave_failure(failure):
1670+ failure.trap(BuildSlaveFailure)
1671+ e = failure.value
1672 self.failBuilder(
1673 "Exception (%s) when setting up to new job" % (e,))
1674- except CannotFetchFile, e:
1675+
1676+ def eb_cannot_fetch_file(failure):
1677+ failure.trap(CannotFetchFile)
1678+ e = failure.value
1679 message = """Slave '%s' (%s) was unable to fetch file.
1680 ****** URL ********
1681 %s
1682@@ -503,10 +492,19 @@
1683 *******************
1684 """ % (self.name, self.url, e.file_url, e.error_information)
1685 raise BuildDaemonError(message)
1686- except socket.error, e:
1687+
1688+ def eb_socket_error(failure):
1689+ failure.trap(socket.error)
1690+ e = failure.value
1691 error_message = "Exception (%s) when setting up new job" % (e,)
1692- self.handleTimeout(logger, error_message)
1693- raise BuildSlaveFailure
1694+ d = self.handleTimeout(logger, error_message)
1695+ return d.addBoth(lambda ignored: failure)
1696+
1697+ d.addCallback(resume_done)
1698+ d.addErrback(eb_slave_failure)
1699+ d.addErrback(eb_cannot_fetch_file)
1700+ d.addErrback(eb_socket_error)
1701+ return d
1702
1703 def failBuilder(self, reason):
1704 """See IBuilder"""
1705@@ -534,22 +532,24 @@
1706
1707 def slaveStatus(self):
1708 """See IBuilder."""
1709- builder_version, builder_arch, mechanisms = self.slave.info()
1710- status_sentence = self.slave.status()
1711-
1712- status = {'builder_status': status_sentence[0]}
1713-
1714- # Extract detailed status and log information if present.
1715- # Although build_id is also easily extractable here, there is no
1716- # valid reason for anything to use it, so we exclude it.
1717- if status['builder_status'] == 'BuilderStatus.WAITING':
1718- status['build_status'] = status_sentence[1]
1719- else:
1720- if status['builder_status'] == 'BuilderStatus.BUILDING':
1721- status['logtail'] = status_sentence[2]
1722-
1723- self.current_build_behavior.updateSlaveStatus(status_sentence, status)
1724- return status
1725+ d = self.slave.status()
1726+ def got_status(status_sentence):
1727+ status = {'builder_status': status_sentence[0]}
1728+
1729+ # Extract detailed status and log information if present.
1730+ # Although build_id is also easily extractable here, there is no
1731+ # valid reason for anything to use it, so we exclude it.
1732+ if status['builder_status'] == 'BuilderStatus.WAITING':
1733+ status['build_status'] = status_sentence[1]
1734+ else:
1735+ if status['builder_status'] == 'BuilderStatus.BUILDING':
1736+ status['logtail'] = status_sentence[2]
1737+
1738+ self.current_build_behavior.updateSlaveStatus(
1739+ status_sentence, status)
1740+ return status
1741+
1742+ return d.addCallback(got_status)
1743
1744 def slaveStatusSentence(self):
1745 """See IBuilder."""
1746@@ -562,13 +562,15 @@
1747
1748 def updateBuild(self, queueItem):
1749 """See `IBuilder`."""
1750- self.current_build_behavior.updateBuild(queueItem)
1751+ return self.current_build_behavior.updateBuild(queueItem)
1752
1753 def transferSlaveFileToLibrarian(self, file_sha1, filename, private):
1754 """See IBuilder."""
1755 out_file_fd, out_file_name = tempfile.mkstemp(suffix=".buildlog")
1756 out_file = os.fdopen(out_file_fd, "r+")
1757 try:
1758+ # XXX 2010-10-18 bug=662631
1759+ # Change this to do non-blocking IO.
1760 slave_file = self.slave.getFile(file_sha1)
1761 copy_and_close(slave_file, out_file)
1762 # If the requested file is the 'buildlog' compress it using gzip
1763@@ -599,18 +601,17 @@
1764
1765 return library_file.id
1766
1767- @property
1768- def is_available(self):
1769+ def isAvailable(self):
1770 """See `IBuilder`."""
1771 if not self.builderok:
1772- return False
1773- try:
1774- slavestatus = self.slaveStatusSentence()
1775- except (xmlrpclib.Fault, socket.error):
1776- return False
1777- if slavestatus[0] != BuilderStatus.IDLE:
1778- return False
1779- return True
1780+ return defer.succeed(False)
1781+ d = self.slaveStatusSentence()
1782+ def catch_fault(failure):
1783+ failure.trap(xmlrpclib.Fault, socket.error)
1784+ return False
1785+ def check_available(status):
1786+ return status[0] == BuilderStatus.IDLE
1787+ return d.addCallbacks(check_available, catch_fault)
1788
1789 def _getSlaveScannerLogger(self):
1790 """Return the logger instance from buildd-slave-scanner.py."""
1791@@ -621,6 +622,27 @@
1792 logger = logging.getLogger('slave-scanner')
1793 return logger
1794
1795+ def acquireBuildCandidate(self):
1796+ """Acquire a build candidate in an atomic fashion.
1797+
1798+ When retrieiving a candidate we need to mark it as building
1799+ immediately so that it is not dispatched by another builder in the
1800+ build manager.
1801+
1802+ We can consider this to be atomic because although the build manager
1803+ is a Twisted app and gives the appearance of doing lots of things at
1804+ once, it's still single-threaded so no more than one builder scan
1805+ can be in this code at the same time.
1806+
1807+ If there's ever more than one build manager running at once, then
1808+ this code will need some sort of mutex.
1809+ """
1810+ candidate = self._findBuildCandidate()
1811+ if candidate is not None:
1812+ candidate.markAsBuilding(self)
1813+ transaction.commit()
1814+ return candidate
1815+
1816 def _findBuildCandidate(self):
1817 """Find a candidate job for dispatch to an idle buildd slave.
1818
1819@@ -700,52 +722,46 @@
1820 :param candidate: The job to dispatch.
1821 """
1822 logger = self._getSlaveScannerLogger()
1823- try:
1824- self.startBuild(candidate, logger)
1825- except (BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch), err:
1826- logger.warn('Could not build: %s' % err)
1827+ # Using maybeDeferred ensures that any exceptions are also
1828+ # wrapped up and caught later.
1829+ d = defer.maybeDeferred(self.startBuild, candidate, logger)
1830+ return d
1831
1832 def handleTimeout(self, logger, error_message):
1833 """See IBuilder."""
1834- builder_should_be_failed = True
1835-
1836 if self.virtualized:
1837 # Virtualized/PPA builder: attempt a reset.
1838 logger.warn(
1839 "Resetting builder: %s -- %s" % (self.url, error_message),
1840 exc_info=True)
1841- try:
1842- self.resumeSlaveHost()
1843- except CannotResumeHost, err:
1844- # Failed to reset builder.
1845- logger.warn(
1846- "Failed to reset builder: %s -- %s" %
1847- (self.url, str(err)), exc_info=True)
1848- else:
1849- # Builder was reset, do *not* mark it as failed.
1850- builder_should_be_failed = False
1851-
1852- if builder_should_be_failed:
1853+ d = self.resumeSlaveHost()
1854+ return d
1855+ else:
1856+ # XXX: This should really let the failure bubble up to the
1857+ # scan() method that does the failure counting.
1858 # Mark builder as 'failed'.
1859 logger.warn(
1860- "Disabling builder: %s -- %s" % (self.url, error_message),
1861- exc_info=True)
1862+ "Disabling builder: %s -- %s" % (self.url, error_message))
1863 self.failBuilder(error_message)
1864+ return defer.succeed(None)
1865
1866 def findAndStartJob(self, buildd_slave=None):
1867 """See IBuilder."""
1868+ # XXX This method should be removed in favour of two separately
1869+ # called methods that find and dispatch the job. It will
1870+ # require a lot of test fixing.
1871 logger = self._getSlaveScannerLogger()
1872- candidate = self._findBuildCandidate()
1873+ candidate = self.acquireBuildCandidate()
1874
1875 if candidate is None:
1876 logger.debug("No build candidates available for builder.")
1877- return None
1878+ return defer.succeed(None)
1879
1880 if buildd_slave is not None:
1881 self.setSlaveForTesting(buildd_slave)
1882
1883- self._dispatchBuildCandidate(candidate)
1884- return candidate
1885+ d = self._dispatchBuildCandidate(candidate)
1886+ return d.addCallback(lambda ignored: candidate)
1887
1888 def getBuildQueue(self):
1889 """See `IBuilder`."""
1890
1891=== modified file 'lib/lp/buildmaster/model/buildfarmjobbehavior.py'
1892--- lib/lp/buildmaster/model/buildfarmjobbehavior.py 2010-08-20 20:31:18 +0000
1893+++ lib/lp/buildmaster/model/buildfarmjobbehavior.py 2010-10-25 19:14:01 +0000
1894@@ -16,13 +16,18 @@
1895 import socket
1896 import xmlrpclib
1897
1898+from twisted.internet import defer
1899+
1900 from zope.component import getUtility
1901 from zope.interface import implements
1902 from zope.security.proxy import removeSecurityProxy
1903
1904 from canonical import encoding
1905 from canonical.librarian.interfaces import ILibrarianClient
1906-from lp.buildmaster.interfaces.builder import CorruptBuildCookie
1907+from lp.buildmaster.interfaces.builder import (
1908+ BuildSlaveFailure,
1909+ CorruptBuildCookie,
1910+ )
1911 from lp.buildmaster.interfaces.buildfarmjobbehavior import (
1912 BuildBehaviorMismatch,
1913 IBuildFarmJobBehavior,
1914@@ -69,54 +74,53 @@
1915 """See `IBuildFarmJobBehavior`."""
1916 logger = logging.getLogger('slave-scanner')
1917
1918- try:
1919- slave_status = self._builder.slaveStatus()
1920- except (xmlrpclib.Fault, socket.error), info:
1921- # XXX cprov 2005-06-29:
1922- # Hmm, a problem with the xmlrpc interface,
1923- # disable the builder ?? or simple notice the failure
1924- # with a timestamp.
1925+ d = self._builder.slaveStatus()
1926+
1927+ def got_failure(failure):
1928+ failure.trap(xmlrpclib.Fault, socket.error)
1929+ info = failure.value
1930 info = ("Could not contact the builder %s, caught a (%s)"
1931 % (queueItem.builder.url, info))
1932- logger.debug(info, exc_info=True)
1933- # keep the job for scan
1934- return
1935-
1936- builder_status_handlers = {
1937- 'BuilderStatus.IDLE': self.updateBuild_IDLE,
1938- 'BuilderStatus.BUILDING': self.updateBuild_BUILDING,
1939- 'BuilderStatus.ABORTING': self.updateBuild_ABORTING,
1940- 'BuilderStatus.ABORTED': self.updateBuild_ABORTED,
1941- 'BuilderStatus.WAITING': self.updateBuild_WAITING,
1942- }
1943-
1944- builder_status = slave_status['builder_status']
1945- if builder_status not in builder_status_handlers:
1946- logger.critical(
1947- "Builder on %s returned unknown status %s, failing it"
1948- % (self._builder.url, builder_status))
1949- self._builder.failBuilder(
1950- "Unknown status code (%s) returned from status() probe."
1951- % builder_status)
1952- # XXX: This will leave the build and job in a bad state, but
1953- # should never be possible, since our builder statuses are
1954- # known.
1955- queueItem._builder = None
1956- queueItem.setDateStarted(None)
1957- return
1958-
1959- # Since logtail is a xmlrpclib.Binary container and it is returned
1960- # from the IBuilder content class, it arrives protected by a Zope
1961- # Security Proxy, which is not declared, thus empty. Before passing
1962- # it to the status handlers we will simply remove the proxy.
1963- logtail = removeSecurityProxy(slave_status.get('logtail'))
1964-
1965- method = builder_status_handlers[builder_status]
1966- try:
1967- method(queueItem, slave_status, logtail, logger)
1968- except TypeError, e:
1969- logger.critical("Received wrong number of args in response.")
1970- logger.exception(e)
1971+ raise BuildSlaveFailure(info)
1972+
1973+ def got_status(slave_status):
1974+ builder_status_handlers = {
1975+ 'BuilderStatus.IDLE': self.updateBuild_IDLE,
1976+ 'BuilderStatus.BUILDING': self.updateBuild_BUILDING,
1977+ 'BuilderStatus.ABORTING': self.updateBuild_ABORTING,
1978+ 'BuilderStatus.ABORTED': self.updateBuild_ABORTED,
1979+ 'BuilderStatus.WAITING': self.updateBuild_WAITING,
1980+ }
1981+
1982+ builder_status = slave_status['builder_status']
1983+ if builder_status not in builder_status_handlers:
1984+ logger.critical(
1985+ "Builder on %s returned unknown status %s, failing it"
1986+ % (self._builder.url, builder_status))
1987+ self._builder.failBuilder(
1988+ "Unknown status code (%s) returned from status() probe."
1989+ % builder_status)
1990+ # XXX: This will leave the build and job in a bad state, but
1991+ # should never be possible, since our builder statuses are
1992+ # known.
1993+ queueItem._builder = None
1994+ queueItem.setDateStarted(None)
1995+ return
1996+
1997+ # Since logtail is a xmlrpclib.Binary container and it is
1998+ # returned from the IBuilder content class, it arrives
1999+ # protected by a Zope Security Proxy, which is not declared,
2000+ # thus empty. Before passing it to the status handlers we
2001+ # will simply remove the proxy.
2002+ logtail = removeSecurityProxy(slave_status.get('logtail'))
2003+
2004+ method = builder_status_handlers[builder_status]
2005+ return defer.maybeDeferred(
2006+ method, queueItem, slave_status, logtail, logger)
2007+
2008+ d.addErrback(got_failure)
2009+ d.addCallback(got_status)
2010+ return d
2011
2012 def updateBuild_IDLE(self, queueItem, slave_status, logtail, logger):
2013 """Somehow the builder forgot about the build job.
2014@@ -146,11 +150,13 @@
2015
2016 Clean the builder for another jobs.
2017 """
2018- queueItem.builder.cleanSlave()
2019- queueItem.builder = None
2020- if queueItem.job.status != JobStatus.FAILED:
2021- queueItem.job.fail()
2022- queueItem.specific_job.jobAborted()
2023+ d = queueItem.builder.cleanSlave()
2024+ def got_cleaned(ignored):
2025+ queueItem.builder = None
2026+ if queueItem.job.status != JobStatus.FAILED:
2027+ queueItem.job.fail()
2028+ queueItem.specific_job.jobAborted()
2029+ return d.addCallback(got_cleaned)
2030
2031 def extractBuildStatus(self, slave_status):
2032 """Read build status name.
2033@@ -185,6 +191,8 @@
2034 # XXX: dsilvers 2005-03-02: Confirm the builder has the right build?
2035
2036 build = queueItem.specific_job.build
2037+ # XXX 2010-10-18 bug=662631
2038+ # Change this to do non-blocking IO.
2039 build.handleStatus(build_status, librarian, slave_status)
2040
2041
2042
2043=== modified file 'lib/lp/buildmaster/model/packagebuild.py'
2044--- lib/lp/buildmaster/model/packagebuild.py 2010-10-02 11:41:43 +0000
2045+++ lib/lp/buildmaster/model/packagebuild.py 2010-10-25 19:14:01 +0000
2046@@ -165,6 +165,8 @@
2047 def getLogFromSlave(package_build):
2048 """See `IPackageBuild`."""
2049 builder = package_build.buildqueue_record.builder
2050+ # XXX 2010-10-18 bug=662631
2051+ # Change this to do non-blocking IO.
2052 return builder.transferSlaveFileToLibrarian(
2053 SLAVE_LOG_FILENAME,
2054 package_build.buildqueue_record.getLogFileName(),
2055@@ -180,6 +182,8 @@
2056 # log, builder and date_finished are read-only, so we must
2057 # currently remove the security proxy to set them.
2058 naked_build = removeSecurityProxy(build)
2059+ # XXX 2010-10-18 bug=662631
2060+ # Change this to do non-blocking IO.
2061 naked_build.log = build.getLogFromSlave(build)
2062 naked_build.builder = build.buildqueue_record.builder
2063 # XXX cprov 20060615 bug=120584: Currently buildduration includes
2064@@ -276,6 +280,8 @@
2065 logger.critical("Unknown BuildStatus '%s' for builder '%s'"
2066 % (status, self.buildqueue_record.builder.url))
2067 return
2068+ # XXX 2010-10-18 bug=662631
2069+ # Change this to do non-blocking IO.
2070 method(librarian, slave_status, logger)
2071
2072 def _handleStatus_OK(self, librarian, slave_status, logger):
2073
2074=== modified file 'lib/lp/buildmaster/tests/mock_slaves.py'
2075--- lib/lp/buildmaster/tests/mock_slaves.py 2010-09-23 12:35:21 +0000
2076+++ lib/lp/buildmaster/tests/mock_slaves.py 2010-10-25 19:14:01 +0000
2077@@ -6,21 +6,40 @@
2078 __metaclass__ = type
2079
2080 __all__ = [
2081+ 'AbortedSlave',
2082+ 'AbortingSlave',
2083+ 'BrokenSlave',
2084+ 'BuildingSlave',
2085+ 'CorruptBehavior',
2086+ 'DeadProxy',
2087+ 'LostBuildingBrokenSlave',
2088 'MockBuilder',
2089- 'LostBuildingBrokenSlave',
2090- 'BrokenSlave',
2091 'OkSlave',
2092- 'BuildingSlave',
2093- 'AbortedSlave',
2094+ 'SlaveTestHelpers',
2095+ 'TrivialBehavior',
2096 'WaitingSlave',
2097- 'AbortingSlave',
2098 ]
2099
2100+import fixtures
2101+import os
2102+
2103 from StringIO import StringIO
2104 import xmlrpclib
2105
2106-from lp.buildmaster.interfaces.builder import CannotFetchFile
2107+from testtools.content import Content
2108+from testtools.content_type import UTF8_TEXT
2109+
2110+from twisted.internet import defer
2111+from twisted.web import xmlrpc
2112+
2113+from canonical.buildd.tests.harness import BuilddSlaveTestSetup
2114+
2115+from lp.buildmaster.interfaces.builder import (
2116+ CannotFetchFile,
2117+ CorruptBuildCookie,
2118+ )
2119 from lp.buildmaster.model.builder import (
2120+ BuilderSlave,
2121 rescueBuilderIfLost,
2122 updateBuilderStatus,
2123 )
2124@@ -59,15 +78,9 @@
2125 slave_build_id)
2126
2127 def cleanSlave(self):
2128- # XXX: This should not print anything. The print is only here to make
2129- # doc/builder.txt a meaningful test.
2130- print 'Cleaning slave'
2131 return self.slave.clean()
2132
2133 def requestAbort(self):
2134- # XXX: This should not print anything. The print is only here to make
2135- # doc/builder.txt a meaningful test.
2136- print 'Aborting slave'
2137 return self.slave.abort()
2138
2139 def resumeSlave(self, logger):
2140@@ -77,10 +90,10 @@
2141 pass
2142
2143 def rescueIfLost(self, logger=None):
2144- rescueBuilderIfLost(self, logger)
2145+ return rescueBuilderIfLost(self, logger)
2146
2147 def updateStatus(self, logger=None):
2148- updateBuilderStatus(self, logger)
2149+ return defer.maybeDeferred(updateBuilderStatus, self, logger)
2150
2151
2152 # XXX: It would be *really* nice to run some set of tests against the real
2153@@ -95,36 +108,44 @@
2154 self.arch_tag = arch_tag
2155
2156 def status(self):
2157- return ('BuilderStatus.IDLE', '')
2158+ return defer.succeed(('BuilderStatus.IDLE', ''))
2159
2160 def ensurepresent(self, sha1, url, user=None, password=None):
2161 self.call_log.append(('ensurepresent', url, user, password))
2162- return True, None
2163+ return defer.succeed((True, None))
2164
2165 def build(self, buildid, buildtype, chroot, filemap, args):
2166 self.call_log.append(
2167 ('build', buildid, buildtype, chroot, filemap.keys(), args))
2168 info = 'OkSlave BUILDING'
2169- return ('BuildStatus.Building', info)
2170+ return defer.succeed(('BuildStatus.Building', info))
2171
2172 def echo(self, *args):
2173 self.call_log.append(('echo',) + args)
2174- return args
2175+ return defer.succeed(args)
2176
2177 def clean(self):
2178 self.call_log.append('clean')
2179+ return defer.succeed(None)
2180
2181 def abort(self):
2182 self.call_log.append('abort')
2183+ return defer.succeed(None)
2184
2185 def info(self):
2186 self.call_log.append('info')
2187- return ('1.0', self.arch_tag, 'debian')
2188+ return defer.succeed(('1.0', self.arch_tag, 'debian'))
2189+
2190+ def resume(self):
2191+ self.call_log.append('resume')
2192+ return defer.succeed(("", "", 0))
2193
2194 def sendFileToSlave(self, sha1, url, username="", password=""):
2195- present, info = self.ensurepresent(sha1, url, username, password)
2196- if not present:
2197- raise CannotFetchFile(url, info)
2198+ d = self.ensurepresent(sha1, url, username, password)
2199+ def check_present((present, info)):
2200+ if not present:
2201+ raise CannotFetchFile(url, info)
2202+ return d.addCallback(check_present)
2203
2204 def cacheFile(self, logger, libraryfilealias):
2205 return self.sendFileToSlave(
2206@@ -141,9 +162,11 @@
2207 def status(self):
2208 self.call_log.append('status')
2209 buildlog = xmlrpclib.Binary("This is a build log")
2210- return ('BuilderStatus.BUILDING', self.build_id, buildlog)
2211+ return defer.succeed(
2212+ ('BuilderStatus.BUILDING', self.build_id, buildlog))
2213
2214 def getFile(self, sum):
2215+ # XXX: This needs to be updated to return a Deferred.
2216 self.call_log.append('getFile')
2217 if sum == "buildlog":
2218 s = StringIO("This is a build log")
2219@@ -155,11 +178,15 @@
2220 """A mock slave that looks like it's currently waiting."""
2221
2222 def __init__(self, state='BuildStatus.OK', dependencies=None,
2223- build_id='1-1'):
2224+ build_id='1-1', filemap=None):
2225 super(WaitingSlave, self).__init__()
2226 self.state = state
2227 self.dependencies = dependencies
2228 self.build_id = build_id
2229+ if filemap is None:
2230+ self.filemap = {}
2231+ else:
2232+ self.filemap = filemap
2233
2234 # By default, the slave only has a buildlog, but callsites
2235 # can update this list as needed.
2236@@ -167,10 +194,12 @@
2237
2238 def status(self):
2239 self.call_log.append('status')
2240- return ('BuilderStatus.WAITING', self.state, self.build_id, {},
2241- self.dependencies)
2242+ return defer.succeed((
2243+ 'BuilderStatus.WAITING', self.state, self.build_id, self.filemap,
2244+ self.dependencies))
2245
2246 def getFile(self, hash):
2247+ # XXX: This needs to be updated to return a Deferred.
2248 self.call_log.append('getFile')
2249 if hash in self.valid_file_hashes:
2250 content = "This is a %s" % hash
2251@@ -184,15 +213,19 @@
2252
2253 def status(self):
2254 self.call_log.append('status')
2255- return ('BuilderStatus.ABORTING', '1-1')
2256+ return defer.succeed(('BuilderStatus.ABORTING', '1-1'))
2257
2258
2259 class AbortedSlave(OkSlave):
2260 """A mock slave that looks like it's aborted."""
2261
2262+ def clean(self):
2263+ self.call_log.append('status')
2264+ return defer.succeed(None)
2265+
2266 def status(self):
2267- self.call_log.append('status')
2268- return ('BuilderStatus.ABORTED', '1-1')
2269+ self.call_log.append('clean')
2270+ return defer.succeed(('BuilderStatus.ABORTED', '1-1'))
2271
2272
2273 class LostBuildingBrokenSlave:
2274@@ -206,16 +239,108 @@
2275
2276 def status(self):
2277 self.call_log.append('status')
2278- return ('BuilderStatus.BUILDING', '1000-10000')
2279+ return defer.succeed(('BuilderStatus.BUILDING', '1000-10000'))
2280
2281 def abort(self):
2282 self.call_log.append('abort')
2283- raise xmlrpclib.Fault(8002, "Could not abort")
2284+ return defer.fail(xmlrpclib.Fault(8002, "Could not abort"))
2285
2286
2287 class BrokenSlave:
2288 """A mock slave that reports that it is broken."""
2289
2290+ def __init__(self):
2291+ self.call_log = []
2292+
2293 def status(self):
2294 self.call_log.append('status')
2295- raise xmlrpclib.Fault(8001, "Broken slave")
2296+ return defer.fail(xmlrpclib.Fault(8001, "Broken slave"))
2297+
2298+
2299+class CorruptBehavior:
2300+
2301+ def verifySlaveBuildCookie(self, cookie):
2302+ raise CorruptBuildCookie("Bad value: %r" % (cookie,))
2303+
2304+
2305+class TrivialBehavior:
2306+
2307+ def verifySlaveBuildCookie(self, cookie):
2308+ pass
2309+
2310+
2311+class DeadProxy(xmlrpc.Proxy):
2312+ """An xmlrpc.Proxy that doesn't actually send any messages.
2313+
2314+ Used when you want to test timeouts, for example.
2315+ """
2316+
2317+ def callRemote(self, *args, **kwargs):
2318+ return defer.Deferred()
2319+
2320+
2321+class SlaveTestHelpers(fixtures.Fixture):
2322+
2323+ # The URL for the XML-RPC service set up by `BuilddSlaveTestSetup`.
2324+ BASE_URL = 'http://localhost:8221'
2325+ TEST_URL = '%s/rpc/' % (BASE_URL,)
2326+
2327+ def getServerSlave(self):
2328+ """Set up a test build slave server.
2329+
2330+ :return: A `BuilddSlaveTestSetup` object.
2331+ """
2332+ tachandler = BuilddSlaveTestSetup()
2333+ tachandler.setUp()
2334+ # Basically impossible to do this w/ TrialTestCase. But it would be
2335+ # really nice to keep it.
2336+ #
2337+ # def addLogFile(exc_info):
2338+ # self.addDetail(
2339+ # 'xmlrpc-log-file',
2340+ # Content(UTF8_TEXT, lambda: open(tachandler.logfile, 'r').read()))
2341+ # self.addOnException(addLogFile)
2342+ self.addCleanup(tachandler.tearDown)
2343+ return tachandler
2344+
2345+ def getClientSlave(self, reactor=None, proxy=None):
2346+ """Return a `BuilderSlave` for use in testing.
2347+
2348+ Points to a fixed URL that is also used by `BuilddSlaveTestSetup`.
2349+ """
2350+ return BuilderSlave.makeBuilderSlave(
2351+ self.TEST_URL, 'vmhost', reactor, proxy)
2352+
2353+ def makeCacheFile(self, tachandler, filename):
2354+ """Make a cache file available on the remote slave.
2355+
2356+ :param tachandler: The TacTestSetup object used to start the remote
2357+ slave.
2358+ :param filename: The name of the file to create in the file cache
2359+ area.
2360+ """
2361+ path = os.path.join(tachandler.root, 'filecache', filename)
2362+ fd = open(path, 'w')
2363+ fd.write('something')
2364+ fd.close()
2365+ self.addCleanup(os.unlink, path)
2366+
2367+ def triggerGoodBuild(self, slave, build_id=None):
2368+ """Trigger a good build on 'slave'.
2369+
2370+ :param slave: A `BuilderSlave` instance to trigger the build on.
2371+ :param build_id: The build identifier. If not specified, defaults to
2372+ an arbitrary string.
2373+ :type build_id: str
2374+ :return: The build id returned by the slave.
2375+ """
2376+ if build_id is None:
2377+ build_id = 'random-build-id'
2378+ tachandler = self.getServerSlave()
2379+ chroot_file = 'fake-chroot'
2380+ dsc_file = 'thing'
2381+ self.makeCacheFile(tachandler, chroot_file)
2382+ self.makeCacheFile(tachandler, dsc_file)
2383+ return slave.build(
2384+ build_id, 'debian', chroot_file, {'.dsc': dsc_file},
2385+ {'ogrecomponent': 'main'})
2386
2387=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
2388--- lib/lp/buildmaster/tests/test_builder.py 2010-10-06 09:06:30 +0000
2389+++ lib/lp/buildmaster/tests/test_builder.py 2010-10-25 19:14:01 +0000
2390@@ -3,20 +3,24 @@
2391
2392 """Test Builder features."""
2393
2394-import errno
2395 import os
2396-import socket
2397+import signal
2398 import xmlrpclib
2399
2400-from testtools.content import Content
2401-from testtools.content_type import UTF8_TEXT
2402+from twisted.web.client import getPage
2403+
2404+from twisted.internet.defer import CancelledError
2405+from twisted.internet.task import Clock
2406+from twisted.python.failure import Failure
2407+from twisted.trial.unittest import TestCase as TrialTestCase
2408
2409 from zope.component import getUtility
2410 from zope.security.proxy import removeSecurityProxy
2411
2412 from canonical.buildd.slave import BuilderStatus
2413-from canonical.buildd.tests.harness import BuilddSlaveTestSetup
2414+from canonical.config import config
2415 from canonical.database.sqlbase import flush_database_updates
2416+from canonical.launchpad.scripts import QuietFakeLogger
2417 from canonical.launchpad.webapp.interfaces import (
2418 DEFAULT_FLAVOR,
2419 IStoreSelector,
2420@@ -24,21 +28,38 @@
2421 )
2422 from canonical.testing.layers import (
2423 DatabaseFunctionalLayer,
2424- LaunchpadZopelessLayer
2425+ LaunchpadZopelessLayer,
2426+ TwistedLaunchpadZopelessLayer,
2427+ TwistedLayer,
2428 )
2429 from lp.buildmaster.enums import BuildStatus
2430-from lp.buildmaster.interfaces.builder import IBuilder, IBuilderSet
2431+from lp.buildmaster.interfaces.builder import (
2432+ CannotFetchFile,
2433+ IBuilder,
2434+ IBuilderSet,
2435+ )
2436 from lp.buildmaster.interfaces.buildfarmjobbehavior import (
2437 IBuildFarmJobBehavior,
2438 )
2439 from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
2440-from lp.buildmaster.model.builder import BuilderSlave
2441+from lp.buildmaster.interfaces.builder import CannotResumeHost
2442 from lp.buildmaster.model.buildfarmjobbehavior import IdleBuildBehavior
2443 from lp.buildmaster.model.buildqueue import BuildQueue
2444 from lp.buildmaster.tests.mock_slaves import (
2445 AbortedSlave,
2446+ AbortingSlave,
2447+ BrokenSlave,
2448+ BuildingSlave,
2449+ CorruptBehavior,
2450+ DeadProxy,
2451+ LostBuildingBrokenSlave,
2452 MockBuilder,
2453+ OkSlave,
2454+ SlaveTestHelpers,
2455+ TrivialBehavior,
2456+ WaitingSlave,
2457 )
2458+from lp.services.job.interfaces.job import JobStatus
2459 from lp.soyuz.enums import (
2460 ArchivePurpose,
2461 PackagePublishingStatus,
2462@@ -49,9 +70,12 @@
2463 )
2464 from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
2465 from lp.testing import (
2466- TestCase,
2467+ ANONYMOUS,
2468+ login_as,
2469+ logout,
2470 TestCaseWithFactory,
2471 )
2472+from lp.testing.factory import LaunchpadObjectFactory
2473 from lp.testing.fakemethod import FakeMethod
2474
2475
2476@@ -92,42 +116,121 @@
2477 bq = builder.getBuildQueue()
2478 self.assertIs(None, bq)
2479
2480- def test_updateBuilderStatus_catches_repeated_EINTR(self):
2481- # A single EINTR return from a socket operation should cause the
2482- # operation to be retried, not fail/reset the builder.
2483- builder = removeSecurityProxy(self.factory.makeBuilder())
2484- builder.handleTimeout = FakeMethod()
2485- builder.rescueIfLost = FakeMethod()
2486-
2487- def _fake_checkSlaveAlive():
2488- # Raise an EINTR error for all invocations.
2489- raise socket.error(errno.EINTR, "fake eintr")
2490-
2491- builder.checkSlaveAlive = _fake_checkSlaveAlive
2492- builder.updateStatus()
2493-
2494- # builder.updateStatus should eventually have called
2495- # handleTimeout()
2496- self.assertEqual(1, builder.handleTimeout.call_count)
2497-
2498- def test_updateBuilderStatus_catches_single_EINTR(self):
2499- builder = removeSecurityProxy(self.factory.makeBuilder())
2500- builder.handleTimeout = FakeMethod()
2501- builder.rescueIfLost = FakeMethod()
2502- self.eintr_returned = False
2503-
2504- def _fake_checkSlaveAlive():
2505- # raise an EINTR error for the first invocation only.
2506- if not self.eintr_returned:
2507- self.eintr_returned = True
2508- raise socket.error(errno.EINTR, "fake eintr")
2509-
2510- builder.checkSlaveAlive = _fake_checkSlaveAlive
2511- builder.updateStatus()
2512-
2513- # builder.updateStatus should never call handleTimeout() for a
2514- # single EINTR.
2515- self.assertEqual(0, builder.handleTimeout.call_count)
2516+
2517+class TestBuilderWithTrial(TrialTestCase):
2518+
2519+ layer = TwistedLaunchpadZopelessLayer
2520+
2521+ def setUp(self):
2522+ super(TestBuilderWithTrial, self)
2523+ self.slave_helper = SlaveTestHelpers()
2524+ self.slave_helper.setUp()
2525+ self.addCleanup(self.slave_helper.cleanUp)
2526+ self.factory = LaunchpadObjectFactory()
2527+ login_as(ANONYMOUS)
2528+ self.addCleanup(logout)
2529+
2530+ def test_updateStatus_aborts_lost_and_broken_slave(self):
2531+ # A slave that's 'lost' should be aborted; when the slave is
2532+ # broken then abort() should also throw a fault.
2533+ slave = LostBuildingBrokenSlave()
2534+ lostbuilding_builder = MockBuilder(
2535+ 'Lost Building Broken Slave', slave, behavior=CorruptBehavior())
2536+ d = lostbuilding_builder.updateStatus(QuietFakeLogger())
2537+ def check_slave_status(failure):
2538+ self.assertIn('abort', slave.call_log)
2539+ # 'Fault' comes from the LostBuildingBrokenSlave, this is
2540+ # just testing that the value is passed through.
2541+ self.assertIsInstance(failure.value, xmlrpclib.Fault)
2542+ return d.addBoth(check_slave_status)
2543+
2544+ def test_resumeSlaveHost_nonvirtual(self):
2545+ builder = self.factory.makeBuilder(virtualized=False)
2546+ d = builder.resumeSlaveHost()
2547+ return self.assertFailure(d, CannotResumeHost)
2548+
2549+ def test_resumeSlaveHost_no_vmhost(self):
2550+ builder = self.factory.makeBuilder(virtualized=True, vm_host=None)
2551+ d = builder.resumeSlaveHost()
2552+ return self.assertFailure(d, CannotResumeHost)
2553+
2554+ def test_resumeSlaveHost_success(self):
2555+ reset_config = """
2556+ [builddmaster]
2557+ vm_resume_command: /bin/echo -n parp"""
2558+ config.push('reset', reset_config)
2559+ self.addCleanup(config.pop, 'reset')
2560+
2561+ builder = self.factory.makeBuilder(virtualized=True, vm_host="pop")
2562+ d = builder.resumeSlaveHost()
2563+ def got_resume(output):
2564+ self.assertEqual(('parp', ''), output)
2565+ return d.addCallback(got_resume)
2566+
2567+ def test_resumeSlaveHost_command_failed(self):
2568+ reset_fail_config = """
2569+ [builddmaster]
2570+ vm_resume_command: /bin/false"""
2571+ config.push('reset fail', reset_fail_config)
2572+ self.addCleanup(config.pop, 'reset fail')
2573+ builder = self.factory.makeBuilder(virtualized=True, vm_host="pop")
2574+ d = builder.resumeSlaveHost()
2575+ return self.assertFailure(d, CannotResumeHost)
2576+
2577+ def test_handleTimeout_resume_failure(self):
2578+ reset_fail_config = """
2579+ [builddmaster]
2580+ vm_resume_command: /bin/false"""
2581+ config.push('reset fail', reset_fail_config)
2582+ self.addCleanup(config.pop, 'reset fail')
2583+ builder = self.factory.makeBuilder(virtualized=True, vm_host="pop")
2584+ builder.builderok = True
2585+ d = builder.handleTimeout(QuietFakeLogger(), 'blah')
2586+ return self.assertFailure(d, CannotResumeHost)
2587+
2588+ def _setupRecipeBuildAndBuilder(self):
2589+ # Helper function to make a builder capable of building a
2590+ # recipe, returning both.
2591+ processor = self.factory.makeProcessor(name="i386")
2592+ builder = self.factory.makeBuilder(
2593+ processor=processor, virtualized=True, vm_host="bladh")
2594+ builder.setSlaveForTesting(OkSlave())
2595+ distroseries = self.factory.makeDistroSeries()
2596+ das = self.factory.makeDistroArchSeries(
2597+ distroseries=distroseries, architecturetag="i386",
2598+ processorfamily=processor.family)
2599+ chroot = self.factory.makeLibraryFileAlias()
2600+ das.addOrUpdateChroot(chroot)
2601+ distroseries.nominatedarchindep = das
2602+ build = self.factory.makeSourcePackageRecipeBuild(
2603+ distroseries=distroseries)
2604+ return builder, build
2605+
2606+ def test_findAndStartJob_returns_candidate(self):
2607+ # findAndStartJob finds the next queued job using _findBuildCandidate.
2608+ # We don't care about the type of build at all.
2609+ builder, build = self._setupRecipeBuildAndBuilder()
2610+ candidate = build.queueBuild()
2611+ # _findBuildCandidate is tested elsewhere, we just make sure that
2612+ # findAndStartJob delegates to it.
2613+ removeSecurityProxy(builder)._findBuildCandidate = FakeMethod(
2614+ result=candidate)
2615+ d = builder.findAndStartJob()
2616+ return d.addCallback(self.assertEqual, candidate)
2617+
2618+ def test_findAndStartJob_starts_job(self):
2619+ # findAndStartJob finds the next queued job using _findBuildCandidate
2620+ # and then starts it.
2621+ # We don't care about the type of build at all.
2622+ builder, build = self._setupRecipeBuildAndBuilder()
2623+ candidate = build.queueBuild()
2624+ removeSecurityProxy(builder)._findBuildCandidate = FakeMethod(
2625+ result=candidate)
2626+ d = builder.findAndStartJob()
2627+ def check_build_started(candidate):
2628+ self.assertEqual(candidate.builder, builder)
2629+ self.assertEqual(BuildStatus.BUILDING, build.status)
2630+ return d.addCallback(check_build_started)
2631
2632 def test_slave(self):
2633 # Builder.slave is a BuilderSlave that points at the actual Builder.
2634@@ -136,25 +239,147 @@
2635 builder = removeSecurityProxy(self.factory.makeBuilder())
2636 self.assertEqual(builder.url, builder.slave.url)
2637
2638-
2639-class Test_rescueBuilderIfLost(TestCaseWithFactory):
2640- """Tests for lp.buildmaster.model.builder.rescueBuilderIfLost."""
2641-
2642- layer = LaunchpadZopelessLayer
2643-
2644 def test_recovery_of_aborted_slave(self):
2645 # If a slave is in the ABORTED state, rescueBuilderIfLost should
2646 # clean it if we don't think it's currently building anything.
2647 # See bug 463046.
2648 aborted_slave = AbortedSlave()
2649- # The slave's clean() method is normally an XMLRPC call, so we
2650- # can just stub it out and check that it got called.
2651- aborted_slave.clean = FakeMethod()
2652 builder = MockBuilder("mock_builder", aborted_slave)
2653 builder.currentjob = None
2654- builder.rescueIfLost()
2655-
2656- self.assertEqual(1, aborted_slave.clean.call_count)
2657+ d = builder.rescueIfLost()
2658+ def check_slave_calls(ignored):
2659+ self.assertIn('clean', aborted_slave.call_log)
2660+ return d.addCallback(check_slave_calls)
2661+
2662+ def test_recover_ok_slave(self):
2663+ # An idle slave is not rescued.
2664+ slave = OkSlave()
2665+ builder = MockBuilder("mock_builder", slave, TrivialBehavior())
2666+ d = builder.rescueIfLost()
2667+ def check_slave_calls(ignored):
2668+ self.assertNotIn('abort', slave.call_log)
2669+ self.assertNotIn('clean', slave.call_log)
2670+ return d.addCallback(check_slave_calls)
2671+
2672+ def test_recover_waiting_slave_with_good_id(self):
2673+ # rescueIfLost does not attempt to abort or clean a builder that is
2674+ # WAITING.
2675+ waiting_slave = WaitingSlave()
2676+ builder = MockBuilder("mock_builder", waiting_slave, TrivialBehavior())
2677+ d = builder.rescueIfLost()
2678+ def check_slave_calls(ignored):
2679+ self.assertNotIn('abort', waiting_slave.call_log)
2680+ self.assertNotIn('clean', waiting_slave.call_log)
2681+ return d.addCallback(check_slave_calls)
2682+
2683+ def test_recover_waiting_slave_with_bad_id(self):
2684+ # If a slave is WAITING with a build for us to get, and the build
2685+ # cookie cannot be verified, which means we don't recognize the build,
2686+ # then rescueBuilderIfLost should attempt to abort it, so that the
2687+ # builder is reset for a new build, and the corrupt build is
2688+ # discarded.
2689+ waiting_slave = WaitingSlave()
2690+ builder = MockBuilder("mock_builder", waiting_slave, CorruptBehavior())
2691+ d = builder.rescueIfLost()
2692+ def check_slave_calls(ignored):
2693+ self.assertNotIn('abort', waiting_slave.call_log)
2694+ self.assertIn('clean', waiting_slave.call_log)
2695+ return d.addCallback(check_slave_calls)
2696+
2697+ def test_recover_building_slave_with_good_id(self):
2698+ # rescueIfLost does not attempt to abort or clean a builder that is
2699+ # BUILDING.
2700+ building_slave = BuildingSlave()
2701+ builder = MockBuilder("mock_builder", building_slave, TrivialBehavior())
2702+ d = builder.rescueIfLost()
2703+ def check_slave_calls(ignored):
2704+ self.assertNotIn('abort', building_slave.call_log)
2705+ self.assertNotIn('clean', building_slave.call_log)
2706+ return d.addCallback(check_slave_calls)
2707+
2708+ def test_recover_building_slave_with_bad_id(self):
2709+ # If a slave is BUILDING with a build id we don't recognize, then we
2710+ # abort the build, thus stopping it in its tracks.
2711+ building_slave = BuildingSlave()
2712+ builder = MockBuilder("mock_builder", building_slave, CorruptBehavior())
2713+ d = builder.rescueIfLost()
2714+ def check_slave_calls(ignored):
2715+ self.assertIn('abort', building_slave.call_log)
2716+ self.assertNotIn('clean', building_slave.call_log)
2717+ return d.addCallback(check_slave_calls)
2718+
2719+
2720+class TestBuilderSlaveStatus(TestBuilderWithTrial):
2721+
2722+ # Verify what IBuilder.slaveStatus returns with slaves in different
2723+ # states.
2724+
2725+ def assertStatus(self, slave, builder_status=None,
2726+ build_status=None, logtail=False, filemap=None,
2727+ dependencies=None):
2728+ builder = self.factory.makeBuilder()
2729+ builder.setSlaveForTesting(slave)
2730+ d = builder.slaveStatus()
2731+
2732+ def got_status(status_dict):
2733+ expected = {}
2734+ if builder_status is not None:
2735+ expected["builder_status"] = builder_status
2736+ if build_status is not None:
2737+ expected["build_status"] = build_status
2738+ if dependencies is not None:
2739+ expected["dependencies"] = dependencies
2740+
2741+ # We don't care so much about the content of the logtail,
2742+ # just that it's there.
2743+ if logtail:
2744+ tail = status_dict.pop("logtail")
2745+ self.assertIsInstance(tail, xmlrpclib.Binary)
2746+
2747+ self.assertEqual(expected, status_dict)
2748+
2749+ return d.addCallback(got_status)
2750+
2751+ def test_slaveStatus_idle_slave(self):
2752+ self.assertStatus(
2753+ OkSlave(), builder_status='BuilderStatus.IDLE')
2754+
2755+ def test_slaveStatus_building_slave(self):
2756+ self.assertStatus(
2757+ BuildingSlave(), builder_status='BuilderStatus.BUILDING',
2758+ logtail=True)
2759+
2760+ def test_slaveStatus_waiting_slave(self):
2761+ self.assertStatus(
2762+ WaitingSlave(), builder_status='BuilderStatus.WAITING',
2763+ build_status='BuildStatus.OK', filemap={})
2764+
2765+ def test_slaveStatus_aborting_slave(self):
2766+ self.assertStatus(
2767+ AbortingSlave(), builder_status='BuilderStatus.ABORTING')
2768+
2769+ def test_slaveStatus_aborted_slave(self):
2770+ self.assertStatus(
2771+ AbortedSlave(), builder_status='BuilderStatus.ABORTED')
2772+
2773+ def test_isAvailable_with_not_builderok(self):
2774+ # isAvailable() is a wrapper around slaveStatusSentence()
2775+ builder = self.factory.makeBuilder()
2776+ builder.builderok = False
2777+ d = builder.isAvailable()
2778+ return d.addCallback(self.assertFalse)
2779+
2780+ def test_isAvailable_with_slave_fault(self):
2781+ builder = self.factory.makeBuilder()
2782+ builder.setSlaveForTesting(BrokenSlave())
2783+ d = builder.isAvailable()
2784+ return d.addCallback(self.assertFalse)
2785+
2786+ def test_isAvailable_with_slave_idle(self):
2787+ builder = self.factory.makeBuilder()
2788+ builder.setSlaveForTesting(OkSlave())
2789+ d = builder.isAvailable()
2790+ return d.addCallback(self.assertTrue)
2791
2792
2793 class TestFindBuildCandidateBase(TestCaseWithFactory):
2794@@ -188,6 +413,49 @@
2795 builder.manual = False
2796
2797
2798+class TestFindBuildCandidateGeneralCases(TestFindBuildCandidateBase):
2799+ # Test usage of findBuildCandidate not specific to any archive type.
2800+
2801+ def test_findBuildCandidate_supersedes_builds(self):
2802+ # IBuilder._findBuildCandidate identifies if there are builds
2803+ # for superseded source package releases in the queue and marks
2804+ # the corresponding build record as SUPERSEDED.
2805+ archive = self.factory.makeArchive()
2806+ self.publisher.getPubSource(
2807+ sourcename="gedit", status=PackagePublishingStatus.PUBLISHED,
2808+ archive=archive).createMissingBuilds()
2809+ old_candidate = removeSecurityProxy(
2810+ self.frog_builder)._findBuildCandidate()
2811+
2812+ # The candidate starts off as NEEDSBUILD:
2813+ build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(
2814+ old_candidate)
2815+ self.assertEqual(BuildStatus.NEEDSBUILD, build.status)
2816+
2817+ # Now supersede the source package:
2818+ publication = build.current_source_publication
2819+ publication.status = PackagePublishingStatus.SUPERSEDED
2820+
2821+ # The candidate returned is now a different one:
2822+ new_candidate = removeSecurityProxy(
2823+ self.frog_builder)._findBuildCandidate()
2824+ self.assertNotEqual(new_candidate, old_candidate)
2825+
2826+ # And the old_candidate is superseded:
2827+ self.assertEqual(BuildStatus.SUPERSEDED, build.status)
2828+
2829+ def test_acquireBuildCandidate_marks_building(self):
2830+ # acquireBuildCandidate() should call _findBuildCandidate and
2831+ # mark the build as building.
2832+ archive = self.factory.makeArchive()
2833+ self.publisher.getPubSource(
2834+ sourcename="gedit", status=PackagePublishingStatus.PUBLISHED,
2835+ archive=archive).createMissingBuilds()
2836+ candidate = removeSecurityProxy(
2837+ self.frog_builder).acquireBuildCandidate()
2838+ self.assertEqual(JobStatus.RUNNING, candidate.job.status)
2839+
2840+
2841 class TestFindBuildCandidatePPAWithSingleBuilder(TestCaseWithFactory):
2842
2843 layer = LaunchpadZopelessLayer
2844@@ -320,6 +588,16 @@
2845 build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(next_job)
2846 self.failUnlessEqual('joesppa', build.archive.name)
2847
2848+ def test_findBuildCandidate_with_disabled_archive(self):
2849+ # Disabled archives should not be considered for dispatching
2850+ # builds.
2851+ disabled_job = removeSecurityProxy(self.builder4)._findBuildCandidate()
2852+ build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(
2853+ disabled_job)
2854+ build.archive.disable()
2855+ next_job = removeSecurityProxy(self.builder4)._findBuildCandidate()
2856+ self.assertNotEqual(disabled_job, next_job)
2857+
2858
2859 class TestFindBuildCandidatePrivatePPA(TestFindBuildCandidatePPABase):
2860
2861@@ -332,6 +610,14 @@
2862 build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(next_job)
2863 self.failUnlessEqual('joesppa', build.archive.name)
2864
2865+ # If the source for the build is still pending, it won't be
2866+ # dispatched because the builder has to fetch the source files
2867+ # from the (password protected) repo area, not the librarian.
2868+ pub = build.current_source_publication
2869+ pub.status = PackagePublishingStatus.PENDING
2870+ candidate = removeSecurityProxy(self.builder4)._findBuildCandidate()
2871+ self.assertNotEqual(next_job.id, candidate.id)
2872+
2873
2874 class TestFindBuildCandidateDistroArchive(TestFindBuildCandidateBase):
2875
2876@@ -474,97 +760,48 @@
2877 self.builder.current_build_behavior, BinaryPackageBuildBehavior)
2878
2879
2880-class TestSlave(TestCase):
2881+class TestSlave(TrialTestCase):
2882 """
2883 Integration tests for BuilderSlave that verify how it works against a
2884 real slave server.
2885 """
2886
2887+ layer = TwistedLayer
2888+
2889+ def setUp(self):
2890+ super(TestSlave, self).setUp()
2891+ self.slave_helper = SlaveTestHelpers()
2892+ self.slave_helper.setUp()
2893+ self.addCleanup(self.slave_helper.cleanUp)
2894+
2895 # XXX: JonathanLange 2010-09-20 bug=643521: There are also tests for
2896 # BuilderSlave in buildd-slave.txt and in other places. The tests here
2897 # ought to become the canonical tests for BuilderSlave vs running buildd
2898 # XML-RPC server interaction.
2899
2900- # The URL for the XML-RPC service set up by `BuilddSlaveTestSetup`.
2901- TEST_URL = 'http://localhost:8221/rpc/'
2902-
2903- def getServerSlave(self):
2904- """Set up a test build slave server.
2905-
2906- :return: A `BuilddSlaveTestSetup` object.
2907- """
2908- tachandler = BuilddSlaveTestSetup()
2909- tachandler.setUp()
2910- self.addCleanup(tachandler.tearDown)
2911- def addLogFile(exc_info):
2912- self.addDetail(
2913- 'xmlrpc-log-file',
2914- Content(UTF8_TEXT, lambda: open(tachandler.logfile, 'r').read()))
2915- self.addOnException(addLogFile)
2916- return tachandler
2917-
2918- def getClientSlave(self):
2919- """Return a `BuilderSlave` for use in testing.
2920-
2921- Points to a fixed URL that is also used by `BuilddSlaveTestSetup`.
2922- """
2923- return BuilderSlave.makeBlockingSlave(self.TEST_URL, 'vmhost')
2924-
2925- def makeCacheFile(self, tachandler, filename):
2926- """Make a cache file available on the remote slave.
2927-
2928- :param tachandler: The TacTestSetup object used to start the remote
2929- slave.
2930- :param filename: The name of the file to create in the file cache
2931- area.
2932- """
2933- path = os.path.join(tachandler.root, 'filecache', filename)
2934- fd = open(path, 'w')
2935- fd.write('something')
2936- fd.close()
2937- self.addCleanup(os.unlink, path)
2938-
2939- def triggerGoodBuild(self, slave, build_id=None):
2940- """Trigger a good build on 'slave'.
2941-
2942- :param slave: A `BuilderSlave` instance to trigger the build on.
2943- :param build_id: The build identifier. If not specified, defaults to
2944- an arbitrary string.
2945- :type build_id: str
2946- :return: The build id returned by the slave.
2947- """
2948- if build_id is None:
2949- build_id = self.getUniqueString()
2950- tachandler = self.getServerSlave()
2951- chroot_file = 'fake-chroot'
2952- dsc_file = 'thing'
2953- self.makeCacheFile(tachandler, chroot_file)
2954- self.makeCacheFile(tachandler, dsc_file)
2955- return slave.build(
2956- build_id, 'debian', chroot_file, {'.dsc': dsc_file},
2957- {'ogrecomponent': 'main'})
2958-
2959 # XXX 2010-10-06 Julian bug=655559
2960 # This is failing on buildbot but not locally; it's trying to abort
2961 # before the build has started.
2962 def disabled_test_abort(self):
2963- slave = self.getClientSlave()
2964+ slave = self.slave_helper.getClientSlave()
2965 # We need to be in a BUILDING state before we can abort.
2966- self.triggerGoodBuild(slave)
2967- result = slave.abort()
2968- self.assertEqual(result, BuilderStatus.ABORTING)
2969+ d = self.slave_helper.triggerGoodBuild(slave)
2970+ d.addCallback(lambda ignored: slave.abort())
2971+ d.addCallback(self.assertEqual, BuilderStatus.ABORTING)
2972+ return d
2973
2974 def test_build(self):
2975 # Calling 'build' with an expected builder type, a good build id,
2976 # valid chroot & filemaps works and returns a BuilderStatus of
2977 # BUILDING.
2978 build_id = 'some-id'
2979- slave = self.getClientSlave()
2980- result = self.triggerGoodBuild(slave, build_id)
2981- self.assertEqual([BuilderStatus.BUILDING, build_id], result)
2982+ slave = self.slave_helper.getClientSlave()
2983+ d = self.slave_helper.triggerGoodBuild(slave, build_id)
2984+ return d.addCallback(
2985+ self.assertEqual, [BuilderStatus.BUILDING, build_id])
2986
2987 def test_clean(self):
2988- slave = self.getClientSlave()
2989+ slave = self.slave_helper.getClientSlave()
2990 # XXX: JonathanLange 2010-09-21: Calling clean() on the slave requires
2991 # it to be in either the WAITING or ABORTED states, and both of these
2992 # states are very difficult to achieve in a test environment. For the
2993@@ -574,57 +811,248 @@
2994 def test_echo(self):
2995 # Calling 'echo' contacts the server which returns the arguments we
2996 # gave it.
2997- self.getServerSlave()
2998- slave = self.getClientSlave()
2999- result = slave.echo('foo', 'bar', 42)
3000- self.assertEqual(['foo', 'bar', 42], result)
3001+ self.slave_helper.getServerSlave()
3002+ slave = self.slave_helper.getClientSlave()
3003+ d = slave.echo('foo', 'bar', 42)
3004+ return d.addCallback(self.assertEqual, ['foo', 'bar', 42])
3005
3006 def test_info(self):
3007 # Calling 'info' gets some information about the slave.
3008- self.getServerSlave()
3009- slave = self.getClientSlave()
3010- result = slave.info()
3011+ self.slave_helper.getServerSlave()
3012+ slave = self.slave_helper.getClientSlave()
3013+ d = slave.info()
3014 # We're testing the hard-coded values, since the version is hard-coded
3015 # into the remote slave, the supported build managers are hard-coded
3016 # into the tac file for the remote slave and config is returned from
3017 # the configuration file.
3018- self.assertEqual(
3019+ return d.addCallback(
3020+ self.assertEqual,
3021 ['1.0',
3022 'i386',
3023 ['sourcepackagerecipe',
3024- 'translation-templates', 'binarypackage', 'debian']],
3025- result)
3026+ 'translation-templates', 'binarypackage', 'debian']])
3027
3028 def test_initial_status(self):
3029 # Calling 'status' returns the current status of the slave. The
3030 # initial status is IDLE.
3031- self.getServerSlave()
3032- slave = self.getClientSlave()
3033- status = slave.status()
3034- self.assertEqual([BuilderStatus.IDLE, ''], status)
3035+ self.slave_helper.getServerSlave()
3036+ slave = self.slave_helper.getClientSlave()
3037+ d = slave.status()
3038+ return d.addCallback(self.assertEqual, [BuilderStatus.IDLE, ''])
3039
3040 def test_status_after_build(self):
3041 # Calling 'status' returns the current status of the slave. After a
3042 # build has been triggered, the status is BUILDING.
3043- slave = self.getClientSlave()
3044+ slave = self.slave_helper.getClientSlave()
3045 build_id = 'status-build-id'
3046- self.triggerGoodBuild(slave, build_id)
3047- status = slave.status()
3048- self.assertEqual([BuilderStatus.BUILDING, build_id], status[:2])
3049- [log_file] = status[2:]
3050- self.assertIsInstance(log_file, xmlrpclib.Binary)
3051+ d = self.slave_helper.triggerGoodBuild(slave, build_id)
3052+ d.addCallback(lambda ignored: slave.status())
3053+ def check_status(status):
3054+ self.assertEqual([BuilderStatus.BUILDING, build_id], status[:2])
3055+ [log_file] = status[2:]
3056+ self.assertIsInstance(log_file, xmlrpclib.Binary)
3057+ return d.addCallback(check_status)
3058
3059 def test_ensurepresent_not_there(self):
3060 # ensurepresent checks to see if a file is there.
3061- self.getServerSlave()
3062- slave = self.getClientSlave()
3063- result = slave.ensurepresent('blahblah', None, None, None)
3064- self.assertEqual([False, 'No URL'], result)
3065+ self.slave_helper.getServerSlave()
3066+ slave = self.slave_helper.getClientSlave()
3067+ d = slave.ensurepresent('blahblah', None, None, None)
3068+ d.addCallback(self.assertEqual, [False, 'No URL'])
3069+ return d
3070
3071 def test_ensurepresent_actually_there(self):
3072 # ensurepresent checks to see if a file is there.
3073- tachandler = self.getServerSlave()
3074- slave = self.getClientSlave()
3075- self.makeCacheFile(tachandler, 'blahblah')
3076- result = slave.ensurepresent('blahblah', None, None, None)
3077- self.assertEqual([True, 'No URL'], result)
3078+ tachandler = self.slave_helper.getServerSlave()
3079+ slave = self.slave_helper.getClientSlave()
3080+ self.slave_helper.makeCacheFile(tachandler, 'blahblah')
3081+ d = slave.ensurepresent('blahblah', None, None, None)
3082+ d.addCallback(self.assertEqual, [True, 'No URL'])
3083+ return d
3084+
3085+ def test_sendFileToSlave_not_there(self):
3086+ self.slave_helper.getServerSlave()
3087+ slave = self.slave_helper.getClientSlave()
3088+ d = slave.sendFileToSlave('blahblah', None, None, None)
3089+ return self.assertFailure(d, CannotFetchFile)
3090+
3091+ def test_sendFileToSlave_actually_there(self):
3092+ tachandler = self.slave_helper.getServerSlave()
3093+ slave = self.slave_helper.getClientSlave()
3094+ self.slave_helper.makeCacheFile(tachandler, 'blahblah')
3095+ d = slave.sendFileToSlave('blahblah', None, None, None)
3096+ def check_present(ignored):
3097+ d = slave.ensurepresent('blahblah', None, None, None)
3098+ return d.addCallback(self.assertEqual, [True, 'No URL'])
3099+ d.addCallback(check_present)
3100+ return d
3101+
3102+ def test_resumeHost_success(self):
3103+ # On a successful resume resume() fires the returned deferred
3104+ # callback with 'None'.
3105+ self.slave_helper.getServerSlave()
3106+ slave = self.slave_helper.getClientSlave()
3107+
3108+ # The configuration testing command-line.
3109+ self.assertEqual(
3110+ 'echo %(vm_host)s', config.builddmaster.vm_resume_command)
3111+
3112+ # On success the response is None.
3113+ def check_resume_success(response):
3114+ out, err, code = response
3115+ self.assertEqual(os.EX_OK, code)
3116+ # XXX: JonathanLange 2010-09-23: We should instead pass the
3117+ # expected vm_host into the client slave. Not doing this now,
3118+ # since the SlaveHelper is being moved around.
3119+ self.assertEqual("%s\n" % slave._vm_host, out)
3120+ d = slave.resume()
3121+ d.addBoth(check_resume_success)
3122+ return d
3123+
3124+ def test_resumeHost_failure(self):
3125+ # On a failed resume, 'resumeHost' fires the returned deferred
3126+ # errorback with the `ProcessTerminated` failure.
3127+ self.slave_helper.getServerSlave()
3128+ slave = self.slave_helper.getClientSlave()
3129+
3130+ # Override the configuration command-line with one that will fail.
3131+ failed_config = """
3132+ [builddmaster]
3133+ vm_resume_command: test "%(vm_host)s = 'no-sir'"
3134+ """
3135+ config.push('failed_resume_command', failed_config)
3136+ self.addCleanup(config.pop, 'failed_resume_command')
3137+
3138+ # On failures, the response is a twisted `Failure` object containing
3139+ # a tuple.
3140+ def check_resume_failure(failure):
3141+ out, err, code = failure.value
3142+ # The process will exit with a return code of "1".
3143+ self.assertEqual(code, 1)
3144+ d = slave.resume()
3145+ d.addBoth(check_resume_failure)
3146+ return d
3147+
3148+ def test_resumeHost_timeout(self):
3149+ # On a resume timeouts, 'resumeHost' fires the returned deferred
3150+ # errorback with the `TimeoutError` failure.
3151+ self.slave_helper.getServerSlave()
3152+ slave = self.slave_helper.getClientSlave()
3153+
3154+ # Override the configuration command-line with one that will timeout.
3155+ timeout_config = """
3156+ [builddmaster]
3157+ vm_resume_command: sleep 5
3158+ socket_timeout: 1
3159+ """
3160+ config.push('timeout_resume_command', timeout_config)
3161+ self.addCleanup(config.pop, 'timeout_resume_command')
3162+
3163+ # On timeouts, the response is a twisted `Failure` object containing
3164+ # a `TimeoutError` error.
3165+ def check_resume_timeout(failure):
3166+ self.assertIsInstance(failure, Failure)
3167+ out, err, code = failure.value
3168+ self.assertEqual(code, signal.SIGKILL)
3169+ clock = Clock()
3170+ d = slave.resume(clock=clock)
3171+ # Move the clock beyond the socket_timeout but earlier than the
3172+ # sleep 5. This stops the test having to wait for the timeout.
3173+ # Fast tests FTW!
3174+ clock.advance(2)
3175+ d.addBoth(check_resume_timeout)
3176+ return d
3177+
3178+
3179+class TestSlaveTimeouts(TrialTestCase):
3180+ # Testing that the methods that call callRemote() all time out
3181+ # as required.
3182+
3183+ layer = TwistedLayer
3184+
3185+ def setUp(self):
3186+ super(TestSlaveTimeouts, self).setUp()
3187+ self.slave_helper = SlaveTestHelpers()
3188+ self.slave_helper.setUp()
3189+ self.addCleanup(self.slave_helper.cleanUp)
3190+ self.clock = Clock()
3191+ self.proxy = DeadProxy("url")
3192+ self.slave = self.slave_helper.getClientSlave(
3193+ reactor=self.clock, proxy=self.proxy)
3194+
3195+ def assertCancelled(self, d):
3196+ self.clock.advance(config.builddmaster.socket_timeout + 1)
3197+ return self.assertFailure(d, CancelledError)
3198+
3199+ def test_timeout_abort(self):
3200+ return self.assertCancelled(self.slave.abort())
3201+
3202+ def test_timeout_clean(self):
3203+ return self.assertCancelled(self.slave.clean())
3204+
3205+ def test_timeout_echo(self):
3206+ return self.assertCancelled(self.slave.echo())
3207+
3208+ def test_timeout_info(self):
3209+ return self.assertCancelled(self.slave.info())
3210+
3211+ def test_timeout_status(self):
3212+ return self.assertCancelled(self.slave.status())
3213+
3214+ def test_timeout_ensurepresent(self):
3215+ return self.assertCancelled(
3216+ self.slave.ensurepresent(None, None, None, None))
3217+
3218+ def test_timeout_build(self):
3219+ return self.assertCancelled(
3220+ self.slave.build(None, None, None, None, None))
3221+
3222+
3223+class TestSlaveWithLibrarian(TrialTestCase):
3224+ """Tests that need more of Launchpad to run."""
3225+
3226+ layer = TwistedLaunchpadZopelessLayer
3227+
3228+ def setUp(self):
3229+ super(TestSlaveWithLibrarian, self)
3230+ self.slave_helper = SlaveTestHelpers()
3231+ self.slave_helper.setUp()
3232+ self.addCleanup(self.slave_helper.cleanUp)
3233+ self.factory = LaunchpadObjectFactory()
3234+ login_as(ANONYMOUS)
3235+ self.addCleanup(logout)
3236+
3237+ def test_ensurepresent_librarian(self):
3238+ # ensurepresent, when given an http URL for a file will download the
3239+ # file from that URL and report that the file is present, and it was
3240+ # downloaded.
3241+
3242+ # Use the Librarian because it's a "convenient" web server.
3243+ lf = self.factory.makeLibraryFileAlias(
3244+ 'HelloWorld.txt', content="Hello World")
3245+ self.layer.txn.commit()
3246+ self.slave_helper.getServerSlave()
3247+ slave = self.slave_helper.getClientSlave()
3248+ d = slave.ensurepresent(
3249+ lf.content.sha1, lf.http_url, "", "")
3250+ d.addCallback(self.assertEqual, [True, 'Download'])
3251+ return d
3252+
3253+ def test_retrieve_files_from_filecache(self):
3254+ # Files that are present on the slave can be downloaded with a
3255+ # filename made from the sha1 of the content underneath the
3256+ # 'filecache' directory.
3257+ content = "Hello World"
3258+ lf = self.factory.makeLibraryFileAlias(
3259+ 'HelloWorld.txt', content=content)
3260+ self.layer.txn.commit()
3261+ expected_url = '%s/filecache/%s' % (
3262+ self.slave_helper.BASE_URL, lf.content.sha1)
3263+ self.slave_helper.getServerSlave()
3264+ slave = self.slave_helper.getClientSlave()
3265+ d = slave.ensurepresent(
3266+ lf.content.sha1, lf.http_url, "", "")
3267+ def check_file(ignored):
3268+ d = getPage(expected_url.encode('utf8'))
3269+ return d.addCallback(self.assertEqual, content)
3270+ return d.addCallback(check_file)
3271
3272=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
3273--- lib/lp/buildmaster/tests/test_manager.py 2010-09-28 11:05:14 +0000
3274+++ lib/lp/buildmaster/tests/test_manager.py 2010-10-25 19:14:01 +0000
3275@@ -6,6 +6,7 @@
3276 import os
3277 import signal
3278 import time
3279+import xmlrpclib
3280
3281 import transaction
3282
3283@@ -14,9 +15,7 @@
3284 reactor,
3285 task,
3286 )
3287-from twisted.internet.error import ConnectionClosed
3288 from twisted.internet.task import (
3289- Clock,
3290 deferLater,
3291 )
3292 from twisted.python.failure import Failure
3293@@ -30,577 +29,45 @@
3294 ANONYMOUS,
3295 login,
3296 )
3297-from canonical.launchpad.scripts.logger import BufferLogger
3298+from canonical.launchpad.scripts.logger import (
3299+ QuietFakeLogger,
3300+ )
3301 from canonical.testing.layers import (
3302 LaunchpadScriptLayer,
3303- LaunchpadZopelessLayer,
3304+ TwistedLaunchpadZopelessLayer,
3305 TwistedLayer,
3306+ ZopelessDatabaseLayer,
3307 )
3308 from lp.buildmaster.enums import BuildStatus
3309 from lp.buildmaster.interfaces.builder import IBuilderSet
3310 from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
3311 from lp.buildmaster.manager import (
3312- BaseDispatchResult,
3313- buildd_success_result_map,
3314+ assessFailureCounts,
3315 BuilddManager,
3316- FailDispatchResult,
3317 NewBuildersScanner,
3318- RecordingSlave,
3319- ResetDispatchResult,
3320 SlaveScanner,
3321 )
3322+from lp.buildmaster.model.builder import Builder
3323 from lp.buildmaster.tests.harness import BuilddManagerTestSetup
3324-from lp.buildmaster.tests.mock_slaves import BuildingSlave
3325+from lp.buildmaster.tests.mock_slaves import (
3326+ BrokenSlave,
3327+ BuildingSlave,
3328+ OkSlave,
3329+ )
3330 from lp.registry.interfaces.distribution import IDistributionSet
3331 from lp.soyuz.interfaces.binarypackagebuild import IBinaryPackageBuildSet
3332-from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
3333-from lp.testing import TestCase as LaunchpadTestCase
3334+from lp.testing import TestCaseWithFactory
3335 from lp.testing.factory import LaunchpadObjectFactory
3336 from lp.testing.fakemethod import FakeMethod
3337 from lp.testing.sampledata import BOB_THE_BUILDER_NAME
3338
3339
3340-class TestRecordingSlaves(TrialTestCase):
3341- """Tests for the recording slave class."""
3342- layer = TwistedLayer
3343-
3344- def setUp(self):
3345- """Setup a fresh `RecordingSlave` for tests."""
3346- TrialTestCase.setUp(self)
3347- self.slave = RecordingSlave(
3348- 'foo', 'http://foo:8221/rpc', 'foo.host')
3349-
3350- def test_representation(self):
3351- """`RecordingSlave` has a custom representation.
3352-
3353- It encloses builder name and xmlrpc url for debug purposes.
3354- """
3355- self.assertEqual('<foo:http://foo:8221/rpc>', repr(self.slave))
3356-
3357- def assert_ensurepresent(self, func):
3358- """Helper function to test results from calling ensurepresent."""
3359- self.assertEqual(
3360- [True, 'Download'],
3361- func('boing', 'bar', 'baz'))
3362- self.assertEqual(
3363- [('ensurepresent', ('boing', 'bar', 'baz'))],
3364- self.slave.calls)
3365-
3366- def test_ensurepresent(self):
3367- """`RecordingSlave.ensurepresent` always succeeds.
3368-
3369- It returns the expected succeed code and records the interaction
3370- information for later use.
3371- """
3372- self.assert_ensurepresent(self.slave.ensurepresent)
3373-
3374- def test_sendFileToSlave(self):
3375- """RecordingSlave.sendFileToSlave always succeeeds.
3376-
3377- It calls ensurepresent() and hence returns the same results.
3378- """
3379- self.assert_ensurepresent(self.slave.sendFileToSlave)
3380-
3381- def test_build(self):
3382- """`RecordingSlave.build` always succeeds.
3383-
3384- It returns the expected succeed code and records the interaction
3385- information for later use.
3386- """
3387- self.assertEqual(
3388- ['BuilderStatus.BUILDING', 'boing'],
3389- self.slave.build('boing', 'bar', 'baz'))
3390- self.assertEqual(
3391- [('build', ('boing', 'bar', 'baz'))],
3392- self.slave.calls)
3393-
3394- def test_resume(self):
3395- """`RecordingSlave.resume` always returns successs."""
3396- # Resume isn't requested in a just-instantiated RecordingSlave.
3397- self.assertFalse(self.slave.resume_requested)
3398-
3399- # When resume is called, it returns the success list and mark
3400- # the slave for resuming.
3401- self.assertEqual(['', '', os.EX_OK], self.slave.resume())
3402- self.assertTrue(self.slave.resume_requested)
3403-
3404- def test_resumeHost_success(self):
3405- # On a successful resume resumeHost() fires the returned deferred
3406- # callback with 'None'.
3407-
3408- # The configuration testing command-line.
3409- self.assertEqual(
3410- 'echo %(vm_host)s', config.builddmaster.vm_resume_command)
3411-
3412- # On success the response is None.
3413- def check_resume_success(response):
3414- out, err, code = response
3415- self.assertEqual(os.EX_OK, code)
3416- self.assertEqual("%s\n" % self.slave.vm_host, out)
3417- d = self.slave.resumeSlave()
3418- d.addBoth(check_resume_success)
3419- return d
3420-
3421- def test_resumeHost_failure(self):
3422- # On a failed resume, 'resumeHost' fires the returned deferred
3423- # errorback with the `ProcessTerminated` failure.
3424-
3425- # Override the configuration command-line with one that will fail.
3426- failed_config = """
3427- [builddmaster]
3428- vm_resume_command: test "%(vm_host)s = 'no-sir'"
3429- """
3430- config.push('failed_resume_command', failed_config)
3431- self.addCleanup(config.pop, 'failed_resume_command')
3432-
3433- # On failures, the response is a twisted `Failure` object containing
3434- # a tuple.
3435- def check_resume_failure(failure):
3436- out, err, code = failure.value
3437- # The process will exit with a return code of "1".
3438- self.assertEqual(code, 1)
3439- d = self.slave.resumeSlave()
3440- d.addBoth(check_resume_failure)
3441- return d
3442-
3443- def test_resumeHost_timeout(self):
3444- # On a resume timeouts, 'resumeHost' fires the returned deferred
3445- # errorback with the `TimeoutError` failure.
3446-
3447- # Override the configuration command-line with one that will timeout.
3448- timeout_config = """
3449- [builddmaster]
3450- vm_resume_command: sleep 5
3451- socket_timeout: 1
3452- """
3453- config.push('timeout_resume_command', timeout_config)
3454- self.addCleanup(config.pop, 'timeout_resume_command')
3455-
3456- # On timeouts, the response is a twisted `Failure` object containing
3457- # a `TimeoutError` error.
3458- def check_resume_timeout(failure):
3459- self.assertIsInstance(failure, Failure)
3460- out, err, code = failure.value
3461- self.assertEqual(code, signal.SIGKILL)
3462- clock = Clock()
3463- d = self.slave.resumeSlave(clock=clock)
3464- # Move the clock beyond the socket_timeout but earlier than the
3465- # sleep 5. This stops the test having to wait for the timeout.
3466- # Fast tests FTW!
3467- clock.advance(2)
3468- d.addBoth(check_resume_timeout)
3469- return d
3470-
3471-
3472-class TestingXMLRPCProxy:
3473- """This class mimics a twisted XMLRPC Proxy class."""
3474-
3475- def __init__(self, failure_info=None):
3476- self.calls = []
3477- self.failure_info = failure_info
3478- self.works = failure_info is None
3479-
3480- def callRemote(self, *args):
3481- self.calls.append(args)
3482- if self.works:
3483- result = buildd_success_result_map.get(args[0])
3484- else:
3485- result = 'boing'
3486- return defer.succeed([result, self.failure_info])
3487-
3488-
3489-class TestingResetDispatchResult(ResetDispatchResult):
3490- """Override the evaluation method to simply annotate the call."""
3491-
3492- def __init__(self, slave, info=None):
3493- ResetDispatchResult.__init__(self, slave, info)
3494- self.processed = False
3495-
3496- def __call__(self):
3497- self.processed = True
3498-
3499-
3500-class TestingFailDispatchResult(FailDispatchResult):
3501- """Override the evaluation method to simply annotate the call."""
3502-
3503- def __init__(self, slave, info=None):
3504- FailDispatchResult.__init__(self, slave, info)
3505- self.processed = False
3506-
3507- def __call__(self):
3508- self.processed = True
3509-
3510-
3511-class TestingSlaveScanner(SlaveScanner):
3512- """Override the dispatch result factories """
3513-
3514- reset_result = TestingResetDispatchResult
3515- fail_result = TestingFailDispatchResult
3516-
3517-
3518-class TestSlaveScanner(TrialTestCase):
3519- """Tests for the actual build slave manager."""
3520- layer = LaunchpadZopelessLayer
3521-
3522- def setUp(self):
3523- TrialTestCase.setUp(self)
3524- self.manager = TestingSlaveScanner(
3525- BOB_THE_BUILDER_NAME, BufferLogger())
3526-
3527- self.fake_builder_url = 'http://bob.buildd:8221/'
3528- self.fake_builder_host = 'bob.host'
3529-
3530- # We will use an instrumented SlaveScanner instance for tests in
3531- # this context.
3532-
3533- # Stop cyclic execution and record the end of the cycle.
3534- self.stopped = False
3535-
3536- def testNextCycle():
3537- self.stopped = True
3538-
3539- self.manager.scheduleNextScanCycle = testNextCycle
3540-
3541- # Return the testing Proxy version.
3542- self.test_proxy = TestingXMLRPCProxy()
3543-
3544- def testGetProxyForSlave(slave):
3545- return self.test_proxy
3546- self.manager._getProxyForSlave = testGetProxyForSlave
3547-
3548- # Deactivate the 'scan' method.
3549- def testScan():
3550- pass
3551- self.manager.scan = testScan
3552-
3553- # Stop automatic collection of dispatching results.
3554- def testslaveConversationEnded():
3555- pass
3556- self._realslaveConversationEnded = self.manager.slaveConversationEnded
3557- self.manager.slaveConversationEnded = testslaveConversationEnded
3558-
3559- def assertIsDispatchReset(self, result):
3560- self.assertTrue(
3561- isinstance(result, TestingResetDispatchResult),
3562- 'Dispatch failure did not result in a ResetBuildResult object')
3563-
3564- def assertIsDispatchFail(self, result):
3565- self.assertTrue(
3566- isinstance(result, TestingFailDispatchResult),
3567- 'Dispatch failure did not result in a FailBuildResult object')
3568-
3569- def test_checkResume(self):
3570- """`SlaveScanner.checkResume` is chained after resume requests.
3571-
3572- If the resume request succeed it returns None, otherwise it returns
3573- a `ResetBuildResult` (the one in the test context) that will be
3574- collect and evaluated later.
3575-
3576- See `RecordingSlave.resumeHost` for more information about the resume
3577- result contents.
3578- """
3579- slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
3580-
3581- successful_response = ['', '', os.EX_OK]
3582- result = self.manager.checkResume(successful_response, slave)
3583- self.assertEqual(
3584- None, result, 'Successful resume checks should return None')
3585-
3586- failed_response = ['stdout', 'stderr', 1]
3587- result = self.manager.checkResume(failed_response, slave)
3588- self.assertIsDispatchReset(result)
3589- self.assertEqual(
3590- '<foo:http://foo.buildd:8221/> reset failure', repr(result))
3591- self.assertEqual(
3592- result.info, "stdout\nstderr")
3593-
3594- def test_fail_to_resume_slave_resets_slave(self):
3595- # If an attempt to resume and dispatch a slave fails, we reset the
3596- # slave by calling self.reset_result(slave)().
3597-
3598- reset_result_calls = []
3599-
3600- class LoggingResetResult(BaseDispatchResult):
3601- """A DispatchResult that logs calls to itself.
3602-
3603- This *must* subclass BaseDispatchResult, otherwise finishCycle()
3604- won't treat it like a dispatch result.
3605- """
3606-
3607- def __init__(self, slave, info=None):
3608- self.slave = slave
3609-
3610- def __call__(self):
3611- reset_result_calls.append(self.slave)
3612-
3613- # Make a failing slave that is requesting a resume.
3614- slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
3615- slave.resume_requested = True
3616- slave.resumeSlave = lambda: deferLater(
3617- reactor, 0, defer.fail, Failure(('out', 'err', 1)))
3618-
3619- # Make the manager log the reset result calls.
3620- self.manager.reset_result = LoggingResetResult
3621-
3622- # We only care about this one slave. Reset the list of manager
3623- # deferreds in case setUp did something unexpected.
3624- self.manager._deferred_list = []
3625-
3626- # Here, we're patching the slaveConversationEnded method so we can
3627- # get an extra callback at the end of it, so we can
3628- # verify that the reset_result was really called.
3629- def _slaveConversationEnded():
3630- d = self._realslaveConversationEnded()
3631- return d.addCallback(
3632- lambda ignored: self.assertEqual([slave], reset_result_calls))
3633- self.manager.slaveConversationEnded = _slaveConversationEnded
3634-
3635- self.manager.resumeAndDispatch(slave)
3636-
3637- def test_failed_to_resume_slave_ready_for_reset(self):
3638- # When a slave fails to resume, the manager has a Deferred in its
3639- # Deferred list that is ready to fire with a ResetDispatchResult.
3640-
3641- # Make a failing slave that is requesting a resume.
3642- slave = RecordingSlave('foo', 'http://foo.buildd:8221/', 'foo.host')
3643- slave.resume_requested = True
3644- slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err', 1)))
3645-
3646- # We only care about this one slave. Reset the list of manager
3647- # deferreds in case setUp did something unexpected.
3648- self.manager._deferred_list = []
3649- # Restore the slaveConversationEnded method. It's very relevant to
3650- # this test.
3651- self.manager.slaveConversationEnded = self._realslaveConversationEnded
3652- self.manager.resumeAndDispatch(slave)
3653- [d] = self.manager._deferred_list
3654-
3655- # The Deferred for our failing slave should be ready to fire
3656- # successfully with a ResetDispatchResult.
3657- def check_result(result):
3658- self.assertIsInstance(result, ResetDispatchResult)
3659- self.assertEqual(slave, result.slave)
3660- self.assertFalse(result.processed)
3661- return d.addCallback(check_result)
3662-
3663- def _setUpSlaveAndBuilder(self, builder_failure_count=None,
3664- job_failure_count=None):
3665- # Helper function to set up a builder and its recording slave.
3666- if builder_failure_count is None:
3667- builder_failure_count = 0
3668- if job_failure_count is None:
3669- job_failure_count = 0
3670- slave = RecordingSlave(
3671- BOB_THE_BUILDER_NAME, self.fake_builder_url,
3672- self.fake_builder_host)
3673- bob_builder = getUtility(IBuilderSet)[slave.name]
3674- bob_builder.failure_count = builder_failure_count
3675- bob_builder.getCurrentBuildFarmJob().failure_count = job_failure_count
3676- return slave, bob_builder
3677-
3678- def test_checkDispatch_success(self):
3679- # SlaveScanner.checkDispatch returns None for a successful
3680- # dispatch.
3681-
3682- """
3683- If the dispatch request fails or a unknown method is given, it
3684- returns a `FailDispatchResult` (in the test context) that will
3685- be evaluated later.
3686-
3687- Builders will be marked as failed if the following responses
3688- categories are received.
3689-
3690- * Legitimate slave failures: when the response is a list with 2
3691- elements but the first element ('status') does not correspond to
3692- the expected 'success' result. See `buildd_success_result_map`.
3693-
3694- * Unexpected (code) failures: when the given 'method' is unknown
3695- or the response isn't a 2-element list or Failure instance.
3696-
3697- Communication failures (a twisted `Failure` instance) will simply
3698- cause the builder to be reset, a `ResetDispatchResult` object is
3699- returned. In other words, network failures are ignored in this
3700- stage, broken builders will be identified and marked as so
3701- during 'scan()' stage.
3702-
3703- On success dispatching it returns None.
3704- """
3705- slave, bob_builder = self._setUpSlaveAndBuilder(
3706- builder_failure_count=0, job_failure_count=0)
3707-
3708- # Successful legitimate response, None is returned.
3709- successful_response = [
3710- buildd_success_result_map.get('ensurepresent'), 'cool builder']
3711- result = self.manager.checkDispatch(
3712- successful_response, 'ensurepresent', slave)
3713- self.assertEqual(
3714- None, result, 'Successful dispatch checks should return None')
3715-
3716- def test_checkDispatch_first_fail(self):
3717- # Failed legitimate response, results in FailDispatchResult and
3718- # failure_count on the job and the builder are both incremented.
3719- slave, bob_builder = self._setUpSlaveAndBuilder(
3720- builder_failure_count=0, job_failure_count=0)
3721-
3722- failed_response = [False, 'uncool builder']
3723- result = self.manager.checkDispatch(
3724- failed_response, 'ensurepresent', slave)
3725- self.assertIsDispatchFail(result)
3726- self.assertEqual(
3727- repr(result),
3728- '<bob:%s> failure (uncool builder)' % self.fake_builder_url)
3729- self.assertEqual(1, bob_builder.failure_count)
3730- self.assertEqual(
3731- 1, bob_builder.getCurrentBuildFarmJob().failure_count)
3732-
3733- def test_checkDispatch_second_reset_fail_by_builder(self):
3734- # Twisted Failure response, results in a `FailDispatchResult`.
3735- slave, bob_builder = self._setUpSlaveAndBuilder(
3736- builder_failure_count=1, job_failure_count=0)
3737-
3738- twisted_failure = Failure(ConnectionClosed('Boom!'))
3739- result = self.manager.checkDispatch(
3740- twisted_failure, 'ensurepresent', slave)
3741- self.assertIsDispatchFail(result)
3742- self.assertEqual(
3743- '<bob:%s> failure (None)' % self.fake_builder_url, repr(result))
3744- self.assertEqual(2, bob_builder.failure_count)
3745- self.assertEqual(
3746- 1, bob_builder.getCurrentBuildFarmJob().failure_count)
3747-
3748- def test_checkDispatch_second_comms_fail_by_builder(self):
3749- # Unexpected response, results in a `FailDispatchResult`.
3750- slave, bob_builder = self._setUpSlaveAndBuilder(
3751- builder_failure_count=1, job_failure_count=0)
3752-
3753- unexpected_response = [1, 2, 3]
3754- result = self.manager.checkDispatch(
3755- unexpected_response, 'build', slave)
3756- self.assertIsDispatchFail(result)
3757- self.assertEqual(
3758- '<bob:%s> failure '
3759- '(Unexpected response: [1, 2, 3])' % self.fake_builder_url,
3760- repr(result))
3761- self.assertEqual(2, bob_builder.failure_count)
3762- self.assertEqual(
3763- 1, bob_builder.getCurrentBuildFarmJob().failure_count)
3764-
3765- def test_checkDispatch_second_comms_fail_by_job(self):
3766- # Unknown method was given, results in a `FailDispatchResult`.
3767- # This could be caused by a faulty job which would fail the job.
3768- slave, bob_builder = self._setUpSlaveAndBuilder(
3769- builder_failure_count=0, job_failure_count=1)
3770-
3771- successful_response = [
3772- buildd_success_result_map.get('ensurepresent'), 'cool builder']
3773- result = self.manager.checkDispatch(
3774- successful_response, 'unknown-method', slave)
3775- self.assertIsDispatchFail(result)
3776- self.assertEqual(
3777- '<bob:%s> failure '
3778- '(Unknown slave method: unknown-method)' % self.fake_builder_url,
3779- repr(result))
3780- self.assertEqual(1, bob_builder.failure_count)
3781- self.assertEqual(
3782- 2, bob_builder.getCurrentBuildFarmJob().failure_count)
3783-
3784- def test_initiateDispatch(self):
3785- """Check `dispatchBuild` in various scenarios.
3786-
3787- When there are no recording slaves (i.e. no build got dispatched
3788- in scan()) it simply finishes the cycle.
3789-
3790- When there is a recording slave with pending slave calls, they are
3791- performed and if they all succeed the cycle is finished with no
3792- errors.
3793-
3794- On slave call failure the chain is stopped immediately and an
3795- FailDispatchResult is collected while finishing the cycle.
3796- """
3797- def check_no_events(results):
3798- errors = [
3799- r for s, r in results if isinstance(r, BaseDispatchResult)]
3800- self.assertEqual(0, len(errors))
3801-
3802- def check_events(results):
3803- [error] = [r for s, r in results if r is not None]
3804- self.assertEqual(
3805- '<bob:%s> failure (very broken slave)'
3806- % self.fake_builder_url,
3807- repr(error))
3808- self.assertTrue(error.processed)
3809-
3810- def _wait_on_deferreds_then_check_no_events():
3811- dl = self._realslaveConversationEnded()
3812- dl.addCallback(check_no_events)
3813-
3814- def _wait_on_deferreds_then_check_events():
3815- dl = self._realslaveConversationEnded()
3816- dl.addCallback(check_events)
3817-
3818- # A functional slave charged with some interactions.
3819- slave = RecordingSlave(
3820- BOB_THE_BUILDER_NAME, self.fake_builder_url,
3821- self.fake_builder_host)
3822- slave.ensurepresent('arg1', 'arg2', 'arg3')
3823- slave.build('arg1', 'arg2', 'arg3')
3824-
3825- # If the previous step (resuming) has failed nothing gets dispatched.
3826- reset_result = ResetDispatchResult(slave)
3827- result = self.manager.initiateDispatch(reset_result, slave)
3828- self.assertTrue(result is reset_result)
3829- self.assertFalse(slave.resume_requested)
3830- self.assertEqual(0, len(self.manager._deferred_list))
3831-
3832- # Operation with the default (funcional slave), no resets or
3833- # failures results are triggered.
3834- slave.resume()
3835- result = self.manager.initiateDispatch(None, slave)
3836- self.assertEqual(None, result)
3837- self.assertTrue(slave.resume_requested)
3838- self.assertEqual(
3839- [('ensurepresent', 'arg1', 'arg2', 'arg3'),
3840- ('build', 'arg1', 'arg2', 'arg3')],
3841- self.test_proxy.calls)
3842- self.assertEqual(2, len(self.manager._deferred_list))
3843-
3844- # Monkey patch the slaveConversationEnded method so we can chain a
3845- # callback to check the end of the result chain.
3846- self.manager.slaveConversationEnded = \
3847- _wait_on_deferreds_then_check_no_events
3848- events = self.manager.slaveConversationEnded()
3849-
3850- # Create a broken slave and insert interaction that will
3851- # cause the builder to be marked as fail.
3852- self.test_proxy = TestingXMLRPCProxy('very broken slave')
3853- slave = RecordingSlave(
3854- BOB_THE_BUILDER_NAME, self.fake_builder_url,
3855- self.fake_builder_host)
3856- slave.ensurepresent('arg1', 'arg2', 'arg3')
3857- slave.build('arg1', 'arg2', 'arg3')
3858-
3859- result = self.manager.initiateDispatch(None, slave)
3860- self.assertEqual(None, result)
3861- self.assertEqual(3, len(self.manager._deferred_list))
3862- self.assertEqual(
3863- [('ensurepresent', 'arg1', 'arg2', 'arg3')],
3864- self.test_proxy.calls)
3865-
3866- # Monkey patch the slaveConversationEnded method so we can chain a
3867- # callback to check the end of the result chain.
3868- self.manager.slaveConversationEnded = \
3869- _wait_on_deferreds_then_check_events
3870- events = self.manager.slaveConversationEnded()
3871-
3872- return events
3873-
3874-
3875 class TestSlaveScannerScan(TrialTestCase):
3876 """Tests `SlaveScanner.scan` method.
3877
3878 This method uses the old framework for scanning and dispatching builds.
3879 """
3880- layer = LaunchpadZopelessLayer
3881+ layer = TwistedLaunchpadZopelessLayer
3882
3883 def setUp(self):
3884 """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest.
3885@@ -608,19 +75,18 @@
3886 Also adjust the sampledata in a way a build can be dispatched to
3887 'bob' builder.
3888 """
3889+ from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
3890 TwistedLayer.testSetUp()
3891 TrialTestCase.setUp(self)
3892 self.slave = BuilddSlaveTestSetup()
3893 self.slave.setUp()
3894
3895 # Creating the required chroots needed for dispatching.
3896- login('foo.bar@canonical.com')
3897 test_publisher = SoyuzTestPublisher()
3898 ubuntu = getUtility(IDistributionSet).getByName('ubuntu')
3899 hoary = ubuntu.getSeries('hoary')
3900 test_publisher.setUpDefaultDistroSeries(hoary)
3901 test_publisher.addFakeChroots()
3902- login(ANONYMOUS)
3903
3904 def tearDown(self):
3905 self.slave.tearDown()
3906@@ -628,8 +94,7 @@
3907 TwistedLayer.testTearDown()
3908
3909 def _resetBuilder(self, builder):
3910- """Reset the given builder and it's job."""
3911- login('foo.bar@canonical.com')
3912+ """Reset the given builder and its job."""
3913
3914 builder.builderok = True
3915 job = builder.currentjob
3916@@ -637,7 +102,6 @@
3917 job.reset()
3918
3919 transaction.commit()
3920- login(ANONYMOUS)
3921
3922 def assertBuildingJob(self, job, builder, logtail=None):
3923 """Assert the given job is building on the given builder."""
3924@@ -653,55 +117,25 @@
3925 self.assertEqual(build.status, BuildStatus.BUILDING)
3926 self.assertEqual(job.logtail, logtail)
3927
3928- def _getManager(self):
3929+ def _getScanner(self, builder_name=None):
3930 """Instantiate a SlaveScanner object.
3931
3932 Replace its default logging handler by a testing version.
3933 """
3934- manager = SlaveScanner(BOB_THE_BUILDER_NAME, BufferLogger())
3935- manager.logger.name = 'slave-scanner'
3936+ if builder_name is None:
3937+ builder_name = BOB_THE_BUILDER_NAME
3938+ scanner = SlaveScanner(builder_name, QuietFakeLogger())
3939+ scanner.logger.name = 'slave-scanner'
3940
3941- return manager
3942+ return scanner
3943
3944 def _checkDispatch(self, slave, builder):
3945- """`SlaveScanner.scan` returns a `RecordingSlave`.
3946-
3947- The single slave returned should match the given builder and
3948- contain interactions that should be performed asynchronously for
3949- properly dispatching the sampledata job.
3950- """
3951- self.assertFalse(
3952- slave is None, "Unexpected recording_slaves.")
3953-
3954- self.assertEqual(slave.name, builder.name)
3955- self.assertEqual(slave.url, builder.url)
3956- self.assertEqual(slave.vm_host, builder.vm_host)
3957+ # SlaveScanner.scan returns a slave when a dispatch was
3958+ # successful. We also check that the builder has a job on it.
3959+
3960+ self.assertTrue(slave is not None, "Expected a slave.")
3961 self.assertEqual(0, builder.failure_count)
3962-
3963- self.assertEqual(
3964- [('ensurepresent',
3965- ('0feca720e2c29dafb2c900713ba560e03b758711',
3966- 'http://localhost:58000/93/fake_chroot.tar.gz',
3967- '', '')),
3968- ('ensurepresent',
3969- ('4e3961baf4f56fdbc95d0dd47f3c5bc275da8a33',
3970- 'http://localhost:58000/43/alsa-utils_1.0.9a-4ubuntu1.dsc',
3971- '', '')),
3972- ('build',
3973- ('6358a89e2215e19b02bf91e2e4d009640fae5cf8',
3974- 'binarypackage', '0feca720e2c29dafb2c900713ba560e03b758711',
3975- {'alsa-utils_1.0.9a-4ubuntu1.dsc':
3976- '4e3961baf4f56fdbc95d0dd47f3c5bc275da8a33'},
3977- {'arch_indep': True,
3978- 'arch_tag': 'i386',
3979- 'archive_private': False,
3980- 'archive_purpose': 'PRIMARY',
3981- 'archives':
3982- ['deb http://ftpmaster.internal/ubuntu hoary main'],
3983- 'build_debug_symbols': False,
3984- 'ogrecomponent': 'main',
3985- 'suite': u'hoary'}))],
3986- slave.calls, "Job was not properly dispatched.")
3987+ self.assertTrue(builder.currentjob is not None)
3988
3989 def testScanDispatchForResetBuilder(self):
3990 # A job gets dispatched to the sampledata builder after it's reset.
3991@@ -709,26 +143,27 @@
3992 # Reset sampledata builder.
3993 builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
3994 self._resetBuilder(builder)
3995+ builder.setSlaveForTesting(OkSlave())
3996 # Set this to 1 here so that _checkDispatch can make sure it's
3997 # reset to 0 after a successful dispatch.
3998 builder.failure_count = 1
3999
4000 # Run 'scan' and check its result.
4001- LaunchpadZopelessLayer.switchDbUser(config.builddmaster.dbuser)
4002- manager = self._getManager()
4003- d = defer.maybeDeferred(manager.scan)
4004+ self.layer.txn.commit()
4005+ self.layer.switchDbUser(config.builddmaster.dbuser)
4006+ scanner = self._getScanner()
4007+ d = defer.maybeDeferred(scanner.scan)
4008 d.addCallback(self._checkDispatch, builder)
4009 return d
4010
4011- def _checkNoDispatch(self, recording_slave, builder):
4012+ def _checkNoDispatch(self, slave, builder):
4013 """Assert that no dispatch has occurred.
4014
4015- 'recording_slave' is None, so no interations would be passed
4016+ 'slave' is None, so no interations would be passed
4017 to the asynchonous dispatcher and the builder remained active
4018 and IDLE.
4019 """
4020- self.assertTrue(
4021- recording_slave is None, "Unexpected recording_slave.")
4022+ self.assertTrue(slave is None, "Unexpected slave.")
4023
4024 builder = getUtility(IBuilderSet).get(builder.id)
4025 self.assertTrue(builder.builderok)
4026@@ -753,9 +188,9 @@
4027 login(ANONYMOUS)
4028
4029 # Run 'scan' and check its result.
4030- LaunchpadZopelessLayer.switchDbUser(config.builddmaster.dbuser)
4031- manager = self._getManager()
4032- d = defer.maybeDeferred(manager.scan)
4033+ self.layer.switchDbUser(config.builddmaster.dbuser)
4034+ scanner = self._getScanner()
4035+ d = defer.maybeDeferred(scanner.singleCycle)
4036 d.addCallback(self._checkNoDispatch, builder)
4037 return d
4038
4039@@ -793,9 +228,9 @@
4040 login(ANONYMOUS)
4041
4042 # Run 'scan' and check its result.
4043- LaunchpadZopelessLayer.switchDbUser(config.builddmaster.dbuser)
4044- manager = self._getManager()
4045- d = defer.maybeDeferred(manager.scan)
4046+ self.layer.switchDbUser(config.builddmaster.dbuser)
4047+ scanner = self._getScanner()
4048+ d = defer.maybeDeferred(scanner.scan)
4049 d.addCallback(self._checkJobRescued, builder, job)
4050 return d
4051
4052@@ -814,8 +249,6 @@
4053 self.assertBuildingJob(job, builder, logtail='This is a build log')
4054
4055 def testScanUpdatesBuildingJobs(self):
4056- # The job assigned to a broken builder is rescued.
4057-
4058 # Enable sampledata builder attached to an appropriate testing
4059 # slave. It will respond as if it was building the sampledata job.
4060 builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
4061@@ -830,188 +263,174 @@
4062 self.assertBuildingJob(job, builder)
4063
4064 # Run 'scan' and check its result.
4065- LaunchpadZopelessLayer.switchDbUser(config.builddmaster.dbuser)
4066- manager = self._getManager()
4067- d = defer.maybeDeferred(manager.scan)
4068+ self.layer.switchDbUser(config.builddmaster.dbuser)
4069+ scanner = self._getScanner()
4070+ d = defer.maybeDeferred(scanner.scan)
4071 d.addCallback(self._checkJobUpdated, builder, job)
4072 return d
4073
4074- def test_scan_assesses_failure_exceptions(self):
4075+ def test_scan_with_nothing_to_dispatch(self):
4076+ factory = LaunchpadObjectFactory()
4077+ builder = factory.makeBuilder()
4078+ builder.setSlaveForTesting(OkSlave())
4079+ scanner = self._getScanner(builder_name=builder.name)
4080+ d = scanner.scan()
4081+ return d.addCallback(self._checkNoDispatch, builder)
4082+
4083+ def test_scan_with_manual_builder(self):
4084+ # Reset sampledata builder.
4085+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
4086+ self._resetBuilder(builder)
4087+ builder.setSlaveForTesting(OkSlave())
4088+ builder.manual = True
4089+ scanner = self._getScanner()
4090+ d = scanner.scan()
4091+ d.addCallback(self._checkNoDispatch, builder)
4092+ return d
4093+
4094+ def test_scan_with_not_ok_builder(self):
4095+ # Reset sampledata builder.
4096+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
4097+ self._resetBuilder(builder)
4098+ builder.setSlaveForTesting(OkSlave())
4099+ builder.builderok = False
4100+ scanner = self._getScanner()
4101+ d = scanner.scan()
4102+ # Because the builder is not ok, we can't use _checkNoDispatch.
4103+ d.addCallback(
4104+ lambda ignored: self.assertIdentical(None, builder.currentjob))
4105+ return d
4106+
4107+ def test_scan_of_broken_slave(self):
4108+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
4109+ self._resetBuilder(builder)
4110+ builder.setSlaveForTesting(BrokenSlave())
4111+ builder.failure_count = 0
4112+ scanner = self._getScanner(builder_name=builder.name)
4113+ d = scanner.scan()
4114+ return self.assertFailure(d, xmlrpclib.Fault)
4115+
4116+ def _assertFailureCounting(self, builder_count, job_count,
4117+ expected_builder_count, expected_job_count):
4118 # If scan() fails with an exception, failure_counts should be
4119- # incremented and tested.
4120+ # incremented. What we do with the results of the failure
4121+ # counts is tested below separately, this test just makes sure that
4122+ # scan() is setting the counts.
4123 def failing_scan():
4124- raise Exception("fake exception")
4125- manager = self._getManager()
4126- manager.scan = failing_scan
4127- manager.scheduleNextScanCycle = FakeMethod()
4128+ return defer.fail(Exception("fake exception"))
4129+ scanner = self._getScanner()
4130+ scanner.scan = failing_scan
4131 from lp.buildmaster import manager as manager_module
4132 self.patch(manager_module, 'assessFailureCounts', FakeMethod())
4133- builder = getUtility(IBuilderSet)[manager.builder_name]
4134-
4135- # Failure counts start at zero.
4136- self.assertEqual(0, builder.failure_count)
4137- self.assertEqual(
4138- 0, builder.currentjob.specific_job.build.failure_count)
4139-
4140- # startCycle() calls scan() which is our fake one that throws an
4141+ builder = getUtility(IBuilderSet)[scanner.builder_name]
4142+
4143+ builder.failure_count = builder_count
4144+ builder.currentjob.specific_job.build.failure_count = job_count
4145+ # The _scanFailed() calls abort, so make sure our existing
4146+ # failure counts are persisted.
4147+ self.layer.txn.commit()
4148+
4149+ # singleCycle() calls scan() which is our fake one that throws an
4150 # exception.
4151- manager.startCycle()
4152+ d = scanner.singleCycle()
4153
4154 # Failure counts should be updated, and the assessment method
4155- # should have been called.
4156- self.assertEqual(1, builder.failure_count)
4157- self.assertEqual(
4158- 1, builder.currentjob.specific_job.build.failure_count)
4159-
4160- self.assertEqual(
4161- 1, manager_module.assessFailureCounts.call_count)
4162-
4163-
4164-class TestDispatchResult(LaunchpadTestCase):
4165- """Tests `BaseDispatchResult` variations.
4166-
4167- Variations of `BaseDispatchResult` when evaluated update the database
4168- information according to their purpose.
4169- """
4170-
4171- layer = LaunchpadZopelessLayer
4172-
4173- def _getBuilder(self, name):
4174- """Return a fixed `IBuilder` instance from the sampledata.
4175-
4176- Ensure it's active (builderok=True) and it has a in-progress job.
4177- """
4178- login('foo.bar@canonical.com')
4179-
4180- builder = getUtility(IBuilderSet)[name]
4181- builder.builderok = True
4182-
4183- job = builder.currentjob
4184- build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(job)
4185- self.assertEqual(
4186- 'i386 build of mozilla-firefox 0.9 in ubuntu hoary RELEASE',
4187- build.title)
4188-
4189- self.assertEqual('BUILDING', build.status.name)
4190- self.assertNotEqual(None, job.builder)
4191- self.assertNotEqual(None, job.date_started)
4192- self.assertNotEqual(None, job.logtail)
4193-
4194- transaction.commit()
4195-
4196- return builder, job.id
4197-
4198- def assertBuildqueueIsClean(self, buildqueue):
4199- # Check that the buildqueue is reset.
4200- self.assertEqual(None, buildqueue.builder)
4201- self.assertEqual(None, buildqueue.date_started)
4202- self.assertEqual(None, buildqueue.logtail)
4203-
4204- def assertBuilderIsClean(self, builder):
4205- # Check that the builder is ready for a new build.
4206- self.assertTrue(builder.builderok)
4207- self.assertIs(None, builder.failnotes)
4208- self.assertIs(None, builder.currentjob)
4209-
4210- def testResetDispatchResult(self):
4211- # Test that `ResetDispatchResult` resets the builder and job.
4212- builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
4213- buildqueue_id = builder.currentjob.id
4214- builder.builderok = True
4215- builder.failure_count = 1
4216-
4217- # Setup a interaction to satisfy 'write_transaction' decorator.
4218- login(ANONYMOUS)
4219- slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
4220- result = ResetDispatchResult(slave)
4221- result()
4222-
4223- buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id)
4224- self.assertBuildqueueIsClean(buildqueue)
4225-
4226- # XXX Julian
4227- # Disabled test until bug 586362 is fixed.
4228- #self.assertFalse(builder.builderok)
4229- self.assertBuilderIsClean(builder)
4230-
4231- def testFailDispatchResult(self):
4232- # Test that `FailDispatchResult` calls assessFailureCounts() so
4233- # that we know the builders and jobs are failed as necessary
4234- # when a FailDispatchResult is called at the end of the dispatch
4235- # chain.
4236- builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
4237-
4238- # Setup a interaction to satisfy 'write_transaction' decorator.
4239- login(ANONYMOUS)
4240- slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
4241- result = FailDispatchResult(slave, 'does not work!')
4242- result.assessFailureCounts = FakeMethod()
4243- self.assertEqual(0, result.assessFailureCounts.call_count)
4244- result()
4245- self.assertEqual(1, result.assessFailureCounts.call_count)
4246-
4247- def _setup_failing_dispatch_result(self):
4248- # assessFailureCounts should fail jobs or builders depending on
4249- # whether it sees the failure_counts on each increasing.
4250- builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
4251- slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
4252- result = FailDispatchResult(slave, 'does not work!')
4253- return builder, result
4254-
4255- def test_assessFailureCounts_equal_failures(self):
4256- # Basic case where the failure counts are equal and the job is
4257- # reset to try again & the builder is not failed.
4258- builder, result = self._setup_failing_dispatch_result()
4259- buildqueue = builder.currentjob
4260- build = buildqueue.specific_job.build
4261- builder.failure_count = 2
4262- build.failure_count = 2
4263- result.assessFailureCounts()
4264-
4265- self.assertBuilderIsClean(builder)
4266- self.assertEqual('NEEDSBUILD', build.status.name)
4267- self.assertBuildqueueIsClean(buildqueue)
4268-
4269- def test_assessFailureCounts_job_failed(self):
4270- # Case where the job has failed more than the builder.
4271- builder, result = self._setup_failing_dispatch_result()
4272- buildqueue = builder.currentjob
4273- build = buildqueue.specific_job.build
4274- build.failure_count = 2
4275- builder.failure_count = 1
4276- result.assessFailureCounts()
4277-
4278- self.assertBuilderIsClean(builder)
4279- self.assertEqual('FAILEDTOBUILD', build.status.name)
4280- # The buildqueue should have been removed entirely.
4281- self.assertEqual(
4282- None, getUtility(IBuildQueueSet).getByBuilder(builder),
4283- "Buildqueue was not removed when it should be.")
4284-
4285- def test_assessFailureCounts_builder_failed(self):
4286- # Case where the builder has failed more than the job.
4287- builder, result = self._setup_failing_dispatch_result()
4288- buildqueue = builder.currentjob
4289- build = buildqueue.specific_job.build
4290- build.failure_count = 2
4291- builder.failure_count = 3
4292- result.assessFailureCounts()
4293-
4294- self.assertFalse(builder.builderok)
4295- self.assertEqual('does not work!', builder.failnotes)
4296- self.assertTrue(builder.currentjob is None)
4297- self.assertEqual('NEEDSBUILD', build.status.name)
4298- self.assertBuildqueueIsClean(buildqueue)
4299+ # should have been called. The actual behaviour is tested below
4300+ # in TestFailureAssessments.
4301+ def got_scan(ignored):
4302+ self.assertEqual(expected_builder_count, builder.failure_count)
4303+ self.assertEqual(
4304+ expected_job_count,
4305+ builder.currentjob.specific_job.build.failure_count)
4306+ self.assertEqual(
4307+ 1, manager_module.assessFailureCounts.call_count)
4308+
4309+ return d.addCallback(got_scan)
4310+
4311+ def test_scan_first_fail(self):
4312+ # The first failure of a job should result in the failure_count
4313+ # on the job and the builder both being incremented.
4314+ self._assertFailureCounting(
4315+ builder_count=0, job_count=0, expected_builder_count=1,
4316+ expected_job_count=1)
4317+
4318+ def test_scan_second_builder_fail(self):
4319+ # The first failure of a job should result in the failure_count
4320+ # on the job and the builder both being incremented.
4321+ self._assertFailureCounting(
4322+ builder_count=1, job_count=0, expected_builder_count=2,
4323+ expected_job_count=1)
4324+
4325+ def test_scan_second_job_fail(self):
4326+ # The first failure of a job should result in the failure_count
4327+ # on the job and the builder both being incremented.
4328+ self._assertFailureCounting(
4329+ builder_count=0, job_count=1, expected_builder_count=1,
4330+ expected_job_count=2)
4331+
4332+ def test_scanFailed_handles_lack_of_a_job_on_the_builder(self):
4333+ def failing_scan():
4334+ return defer.fail(Exception("fake exception"))
4335+ scanner = self._getScanner()
4336+ scanner.scan = failing_scan
4337+ builder = getUtility(IBuilderSet)[scanner.builder_name]
4338+ builder.failure_count = Builder.FAILURE_THRESHOLD
4339+ builder.currentjob.reset()
4340+ self.layer.txn.commit()
4341+
4342+ d = scanner.singleCycle()
4343+
4344+ def scan_finished(ignored):
4345+ self.assertFalse(builder.builderok)
4346+
4347+ return d.addCallback(scan_finished)
4348+
4349+ def test_fail_to_resume_slave_resets_job(self):
4350+ # If an attempt to resume and dispatch a slave fails, it should
4351+ # reset the job via job.reset()
4352+
4353+ # Make a slave with a failing resume() method.
4354+ slave = OkSlave()
4355+ slave.resume = lambda: deferLater(
4356+ reactor, 0, defer.fail, Failure(('out', 'err', 1)))
4357+
4358+ # Reset sampledata builder.
4359+ builder = removeSecurityProxy(
4360+ getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME])
4361+ self._resetBuilder(builder)
4362+ self.assertEqual(0, builder.failure_count)
4363+ builder.setSlaveForTesting(slave)
4364+ builder.vm_host = "fake_vm_host"
4365+
4366+ scanner = self._getScanner()
4367+
4368+ # Get the next job that will be dispatched.
4369+ job = removeSecurityProxy(builder._findBuildCandidate())
4370+ job.virtualized = True
4371+ builder.virtualized = True
4372+ d = scanner.singleCycle()
4373+
4374+ def check(ignored):
4375+ # The failure_count will have been incremented on the
4376+ # builder, we can check that to see that a dispatch attempt
4377+ # did indeed occur.
4378+ self.assertEqual(1, builder.failure_count)
4379+ # There should also be no builder set on the job.
4380+ self.assertTrue(job.builder is None)
4381+ build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(job)
4382+ self.assertEqual(build.status, BuildStatus.NEEDSBUILD)
4383+
4384+ return d.addCallback(check)
4385
4386
4387 class TestBuilddManager(TrialTestCase):
4388
4389- layer = LaunchpadZopelessLayer
4390+ layer = TwistedLaunchpadZopelessLayer
4391
4392 def _stub_out_scheduleNextScanCycle(self):
4393 # stub out the code that adds a callLater, so that later tests
4394 # don't get surprises.
4395- self.patch(SlaveScanner, 'scheduleNextScanCycle', FakeMethod())
4396+ self.patch(SlaveScanner, 'startCycle', FakeMethod())
4397
4398 def test_addScanForBuilders(self):
4399 # Test that addScanForBuilders generates NewBuildersScanner objects.
4400@@ -1040,10 +459,62 @@
4401 self.assertNotEqual(0, manager.new_builders_scanner.scan.call_count)
4402
4403
4404+class TestFailureAssessments(TestCaseWithFactory):
4405+
4406+ layer = ZopelessDatabaseLayer
4407+
4408+ def setUp(self):
4409+ TestCaseWithFactory.setUp(self)
4410+ self.builder = self.factory.makeBuilder()
4411+ self.build = self.factory.makeSourcePackageRecipeBuild()
4412+ self.buildqueue = self.build.queueBuild()
4413+ self.buildqueue.markAsBuilding(self.builder)
4414+
4415+ def test_equal_failures_reset_job(self):
4416+ self.builder.gotFailure()
4417+ self.builder.getCurrentBuildFarmJob().gotFailure()
4418+
4419+ assessFailureCounts(self.builder, "failnotes")
4420+ self.assertIs(None, self.builder.currentjob)
4421+ self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
4422+
4423+ def test_job_failing_more_than_builder_fails_job(self):
4424+ self.builder.getCurrentBuildFarmJob().gotFailure()
4425+
4426+ assessFailureCounts(self.builder, "failnotes")
4427+ self.assertIs(None, self.builder.currentjob)
4428+ self.assertEqual(self.build.status, BuildStatus.FAILEDTOBUILD)
4429+
4430+ def test_builder_failing_more_than_job_but_under_fail_threshold(self):
4431+ self.builder.failure_count = Builder.FAILURE_THRESHOLD - 1
4432+
4433+ assessFailureCounts(self.builder, "failnotes")
4434+ self.assertIs(None, self.builder.currentjob)
4435+ self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
4436+ self.assertTrue(self.builder.builderok)
4437+
4438+ def test_builder_failing_more_than_job_but_over_fail_threshold(self):
4439+ self.builder.failure_count = Builder.FAILURE_THRESHOLD
4440+
4441+ assessFailureCounts(self.builder, "failnotes")
4442+ self.assertIs(None, self.builder.currentjob)
4443+ self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
4444+ self.assertFalse(self.builder.builderok)
4445+ self.assertEqual("failnotes", self.builder.failnotes)
4446+
4447+ def test_builder_failing_with_no_attached_job(self):
4448+ self.buildqueue.reset()
4449+ self.builder.failure_count = Builder.FAILURE_THRESHOLD
4450+
4451+ assessFailureCounts(self.builder, "failnotes")
4452+ self.assertFalse(self.builder.builderok)
4453+ self.assertEqual("failnotes", self.builder.failnotes)
4454+
4455+
4456 class TestNewBuilders(TrialTestCase):
4457 """Test detecting of new builders."""
4458
4459- layer = LaunchpadZopelessLayer
4460+ layer = TwistedLaunchpadZopelessLayer
4461
4462 def _getScanner(self, manager=None, clock=None):
4463 return NewBuildersScanner(manager=manager, clock=clock)
4464@@ -1084,11 +555,8 @@
4465 new_builders, builder_scanner.checkForNewBuilders())
4466
4467 def test_scan(self):
4468- # See if scan detects new builders and schedules the next scan.
4469+ # See if scan detects new builders.
4470
4471- # stub out the addScanForBuilders and scheduleScan methods since
4472- # they use callLater; we only want to assert that they get
4473- # called.
4474 def fake_checkForNewBuilders():
4475 return "new_builders"
4476
4477@@ -1104,9 +572,6 @@
4478 builder_scanner.scan()
4479 advance = NewBuildersScanner.SCAN_INTERVAL + 1
4480 clock.advance(advance)
4481- self.assertNotEqual(
4482- 0, builder_scanner.scheduleScan.call_count,
4483- "scheduleScan did not get called")
4484
4485
4486 def is_file_growing(filepath, poll_interval=1, poll_repeat=10):
4487@@ -1147,7 +612,7 @@
4488 return False
4489
4490
4491-class TestBuilddManagerScript(LaunchpadTestCase):
4492+class TestBuilddManagerScript(TestCaseWithFactory):
4493
4494 layer = LaunchpadScriptLayer
4495
4496@@ -1156,6 +621,7 @@
4497 fixture = BuilddManagerTestSetup()
4498 fixture.setUp()
4499 fixture.tearDown()
4500+ self.layer.force_dirty_database()
4501
4502 # XXX Julian 2010-08-06 bug=614275
4503 # These next 2 tests are in the wrong place, they should be near the
4504
4505=== modified file 'lib/lp/buildmaster/tests/test_packagebuild.py'
4506--- lib/lp/buildmaster/tests/test_packagebuild.py 2010-10-02 11:41:43 +0000
4507+++ lib/lp/buildmaster/tests/test_packagebuild.py 2010-10-25 19:14:01 +0000
4508@@ -99,6 +99,8 @@
4509 self.assertRaises(
4510 NotImplementedError, self.package_build.verifySuccessfulUpload)
4511 self.assertRaises(NotImplementedError, self.package_build.notify)
4512+ # XXX 2010-10-18 bug=662631
4513+ # Change this to do non-blocking IO.
4514 self.assertRaises(
4515 NotImplementedError, self.package_build.handleStatus,
4516 None, None, None)
4517@@ -311,6 +313,8 @@
4518 # A filemap with plain filenames should not cause a problem.
4519 # The call to handleStatus will attempt to get the file from
4520 # the slave resulting in a URL error in this test case.
4521+ # XXX 2010-10-18 bug=662631
4522+ # Change this to do non-blocking IO.
4523 self.build.handleStatus('OK', None, {
4524 'filemap': {'myfile.py': 'test_file_hash'},
4525 })
4526@@ -321,6 +325,8 @@
4527 def test_handleStatus_OK_absolute_filepath(self):
4528 # A filemap that tries to write to files outside of
4529 # the upload directory will result in a failed upload.
4530+ # XXX 2010-10-18 bug=662631
4531+ # Change this to do non-blocking IO.
4532 self.build.handleStatus('OK', None, {
4533 'filemap': {'/tmp/myfile.py': 'test_file_hash'},
4534 })
4535@@ -331,6 +337,8 @@
4536 def test_handleStatus_OK_relative_filepath(self):
4537 # A filemap that tries to write to files outside of
4538 # the upload directory will result in a failed upload.
4539+ # XXX 2010-10-18 bug=662631
4540+ # Change this to do non-blocking IO.
4541 self.build.handleStatus('OK', None, {
4542 'filemap': {'../myfile.py': 'test_file_hash'},
4543 })
4544@@ -341,6 +349,8 @@
4545 # The build log is set during handleStatus.
4546 removeSecurityProxy(self.build).log = None
4547 self.assertEqual(None, self.build.log)
4548+ # XXX 2010-10-18 bug=662631
4549+ # Change this to do non-blocking IO.
4550 self.build.handleStatus('OK', None, {
4551 'filemap': {'myfile.py': 'test_file_hash'},
4552 })
4553@@ -350,6 +360,8 @@
4554 # The date finished is updated during handleStatus_OK.
4555 removeSecurityProxy(self.build).date_finished = None
4556 self.assertEqual(None, self.build.date_finished)
4557+ # XXX 2010-10-18 bug=662631
4558+ # Change this to do non-blocking IO.
4559 self.build.handleStatus('OK', None, {
4560 'filemap': {'myfile.py': 'test_file_hash'},
4561 })
4562
4563=== modified file 'lib/lp/code/model/recipebuilder.py'
4564--- lib/lp/code/model/recipebuilder.py 2010-08-20 20:31:18 +0000
4565+++ lib/lp/code/model/recipebuilder.py 2010-10-25 19:14:01 +0000
4566@@ -117,38 +117,42 @@
4567 raise CannotBuild("Unable to find distroarchseries for %s in %s" %
4568 (self._builder.processor.name,
4569 self.build.distroseries.displayname))
4570-
4571+ args = self._extraBuildArgs(distroarchseries, logger)
4572 chroot = distroarchseries.getChroot()
4573 if chroot is None:
4574 raise CannotBuild("Unable to find a chroot for %s" %
4575 distroarchseries.displayname)
4576- self._builder.slave.cacheFile(logger, chroot)
4577-
4578- # Generate a string which can be used to cross-check when obtaining
4579- # results so we know we are referring to the right database object in
4580- # subsequent runs.
4581- buildid = "%s-%s" % (self.build.id, build_queue_id)
4582- cookie = self.buildfarmjob.generateSlaveBuildCookie()
4583- chroot_sha1 = chroot.content.sha1
4584- logger.debug(
4585- "Initiating build %s on %s" % (buildid, self._builder.url))
4586-
4587- args = self._extraBuildArgs(distroarchseries, logger)
4588- status, info = self._builder.slave.build(
4589- cookie, "sourcepackagerecipe", chroot_sha1, {}, args)
4590- message = """%s (%s):
4591- ***** RESULT *****
4592- %s
4593- %s: %s
4594- ******************
4595- """ % (
4596- self._builder.name,
4597- self._builder.url,
4598- args,
4599- status,
4600- info,
4601- )
4602- logger.info(message)
4603+ d = self._builder.slave.cacheFile(logger, chroot)
4604+
4605+ def got_cache_file(ignored):
4606+ # Generate a string which can be used to cross-check when obtaining
4607+ # results so we know we are referring to the right database object in
4608+ # subsequent runs.
4609+ buildid = "%s-%s" % (self.build.id, build_queue_id)
4610+ cookie = self.buildfarmjob.generateSlaveBuildCookie()
4611+ chroot_sha1 = chroot.content.sha1
4612+ logger.debug(
4613+ "Initiating build %s on %s" % (buildid, self._builder.url))
4614+
4615+ return self._builder.slave.build(
4616+ cookie, "sourcepackagerecipe", chroot_sha1, {}, args)
4617+
4618+ def log_build_result((status, info)):
4619+ message = """%s (%s):
4620+ ***** RESULT *****
4621+ %s
4622+ %s: %s
4623+ ******************
4624+ """ % (
4625+ self._builder.name,
4626+ self._builder.url,
4627+ args,
4628+ status,
4629+ info,
4630+ )
4631+ logger.info(message)
4632+
4633+ return d.addCallback(got_cache_file).addCallback(log_build_result)
4634
4635 def verifyBuildRequest(self, logger):
4636 """Assert some pre-build checks.
4637
4638=== modified file 'lib/lp/soyuz/browser/tests/test_builder_views.py'
4639--- lib/lp/soyuz/browser/tests/test_builder_views.py 2010-10-04 19:50:45 +0000
4640+++ lib/lp/soyuz/browser/tests/test_builder_views.py 2010-10-25 19:14:01 +0000
4641@@ -34,7 +34,7 @@
4642 return view
4643
4644 def test_posting_form_doesnt_call_slave_xmlrpc(self):
4645- # Posting the +edit for should not call is_available, which
4646+ # Posting the +edit for should not call isAvailable, which
4647 # would do xmlrpc to a slave builder and is explicitly forbidden
4648 # in a webapp process.
4649 view = self.initialize_view()
4650
4651=== removed file 'lib/lp/soyuz/doc/buildd-dispatching.txt'
4652--- lib/lp/soyuz/doc/buildd-dispatching.txt 2010-10-18 22:24:59 +0000
4653+++ lib/lp/soyuz/doc/buildd-dispatching.txt 1970-01-01 00:00:00 +0000
4654@@ -1,371 +0,0 @@
4655-= Buildd Dispatching =
4656-
4657- >>> import transaction
4658- >>> import logging
4659- >>> logger = logging.getLogger()
4660- >>> logger.setLevel(logging.DEBUG)
4661-
4662-The buildd dispatching basically consists of finding a available
4663-slave in IDLE state, pushing any required files to it, then requesting
4664-that it starts the build procedure. These tasks are implemented by the
4665-BuilderSet and Builder classes.
4666-
4667-Setup the test builder:
4668-
4669- >>> from canonical.buildd.tests import BuilddSlaveTestSetup
4670- >>> fixture = BuilddSlaveTestSetup()
4671- >>> fixture.setUp()
4672-
4673-Setup a suitable chroot for Hoary i386:
4674-
4675- >>> from StringIO import StringIO
4676- >>> from canonical.librarian.interfaces import ILibrarianClient
4677- >>> librarian_client = getUtility(ILibrarianClient)
4678-
4679- >>> content = 'anything'
4680- >>> alias_id = librarian_client.addFile(
4681- ... 'foo.tar.gz', len(content), StringIO(content), 'text/plain')
4682-
4683- >>> from canonical.launchpad.interfaces.librarian import ILibraryFileAliasSet
4684- >>> from lp.registry.interfaces.distribution import IDistributionSet
4685- >>> from lp.registry.interfaces.pocket import PackagePublishingPocket
4686-
4687- >>> hoary = getUtility(IDistributionSet)['ubuntu']['hoary']
4688- >>> hoary_i386 = hoary['i386']
4689-
4690- >>> chroot = getUtility(ILibraryFileAliasSet)[alias_id]
4691- >>> pc = hoary_i386.addOrUpdateChroot(chroot=chroot)
4692-
4693-Activate builders present in sampledata, we need to be logged in as a
4694-member of launchpad-buildd-admin:
4695-
4696- >>> from canonical.launchpad.ftests import login
4697- >>> login('celso.providelo@canonical.com')
4698-
4699-Set IBuilder.builderok of all present builders:
4700-
4701- >>> from lp.buildmaster.interfaces.builder import IBuilderSet
4702- >>> builder_set = getUtility(IBuilderSet)
4703-
4704- >>> builder_set.count()
4705- 2
4706-
4707- >>> from canonical.launchpad.ftests import syncUpdate
4708- >>> for b in builder_set:
4709- ... b.builderok = True
4710- ... syncUpdate(b)
4711-
4712-Clean up previous BuildQueue results from sampledata:
4713-
4714- >>> from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
4715- >>> lost_job = getUtility(IBuildQueueSet).get(1)
4716- >>> lost_job.builder.name
4717- u'bob'
4718- >>> lost_job.destroySelf()
4719- >>> transaction.commit()
4720-
4721-If the specified buildd slave reset command (used inside resumeSlaveHost())
4722-fails, the slave will still be marked as failed.
4723-
4724- >>> from canonical.config import config
4725- >>> reset_fail_config = '''
4726- ... [builddmaster]
4727- ... vm_resume_command: /bin/false'''
4728- >>> config.push('reset fail', reset_fail_config)
4729- >>> frog_builder = builder_set['frog']
4730- >>> frog_builder.handleTimeout(logger, 'The universe just collapsed')
4731- WARNING:root:Resetting builder: http://localhost:9221/ -- The universe just collapsed
4732- ...
4733- WARNING:root:Failed to reset builder: http://localhost:9221/ -- Resuming failed:
4734- ...
4735- WARNING:root:Disabling builder: http://localhost:9221/ -- The universe just collapsed
4736- ...
4737- <BLANKLINE>
4738-
4739-Since we were unable to reset the 'frog' builder it was marked as 'failed'.
4740-
4741- >>> frog_builder.builderok
4742- False
4743-
4744-Restore default value for resume command.
4745-
4746- >>> ignored_config = config.pop('reset fail')
4747-
4748-The 'bob' builder is available for build jobs.
4749-
4750- >>> bob_builder = builder_set['bob']
4751- >>> bob_builder.name
4752- u'bob'
4753- >>> bob_builder.virtualized
4754- False
4755- >>> bob_builder.is_available
4756- True
4757- >>> bob_builder.builderok
4758- True
4759-
4760-
4761-== Builder dispatching API ==
4762-
4763-Now let's check the build candidates which will be considered for the
4764-builder 'bob':
4765-
4766- >>> from zope.security.proxy import removeSecurityProxy
4767- >>> job = removeSecurityProxy(bob_builder)._findBuildCandidate()
4768-
4769-The single BuildQueue found is a non-virtual pending build:
4770-
4771- >>> job.id
4772- 2
4773- >>> from lp.soyuz.interfaces.binarypackagebuild import (
4774- ... IBinaryPackageBuildSet)
4775- >>> build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(job)
4776- >>> build.status.name
4777- 'NEEDSBUILD'
4778- >>> job.builder is None
4779- True
4780- >>> job.date_started is None
4781- True
4782- >>> build.is_virtualized
4783- False
4784-
4785-The build start time is not set yet either.
4786-
4787- >>> print build.date_first_dispatched
4788- None
4789-
4790-Update the SourcePackageReleaseFile corresponding to this job:
4791-
4792- >>> content = 'anything'
4793- >>> alias_id = librarian_client.addFile(
4794- ... 'foo.dsc', len(content), StringIO(content), 'application/dsc')
4795-
4796- >>> sprf = build.source_package_release.files[0]
4797- >>> naked_sprf = removeSecurityProxy(sprf)
4798- >>> naked_sprf.libraryfile = getUtility(ILibraryFileAliasSet)[alias_id]
4799- >>> flush_database_updates()
4800-
4801-Check the dispatching method itself:
4802-
4803- >>> dispatched_job = bob_builder.findAndStartJob()
4804- >>> job == dispatched_job
4805- True
4806- >>> bob_builder.builderok = True
4807-
4808- >>> flush_database_updates()
4809-
4810-Verify if the job (BuildQueue) was updated appropriately:
4811-
4812- >>> job.builder.id == bob_builder.id
4813- True
4814-
4815- >>> dispatched_build = getUtility(
4816- ... IBinaryPackageBuildSet).getByQueueEntry(job)
4817- >>> dispatched_build == build
4818- True
4819-
4820- >>> build.status.name
4821- 'BUILDING'
4822-
4823-Shutdown builder, mark the build record as failed and remove the
4824-buildqueue record, so the build was eliminated:
4825-
4826- >>> fixture.tearDown()
4827-
4828- >>> from lp.buildmaster.enums import BuildStatus
4829- >>> build.status = BuildStatus.FAILEDTOBUILD
4830- >>> job.destroySelf()
4831- >>> flush_database_updates()
4832-
4833-
4834-== PPA build dispatching ==
4835-
4836-Create a new Build record of the same source targeted for a PPA archive:
4837-
4838- >>> from lp.registry.interfaces.person import IPersonSet
4839- >>> cprov = getUtility(IPersonSet).getByName('cprov')
4840-
4841- >>> ppa_build = sprf.sourcepackagerelease.createBuild(
4842- ... hoary_i386, PackagePublishingPocket.RELEASE, cprov.archive)
4843-
4844-Create BuildQueue record and inspect some parameters:
4845-
4846- >>> ppa_job = ppa_build.queueBuild()
4847- >>> ppa_job.id
4848- 3
4849- >>> ppa_job.builder == None
4850- True
4851- >>> ppa_job.date_started == None
4852- True
4853-
4854-The build job's archive requires virtualized builds.
4855-
4856- >>> build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(ppa_job)
4857- >>> build.archive.require_virtualized
4858- True
4859-
4860-But the builder is not virtualized.
4861-
4862- >>> bob_builder.virtualized
4863- False
4864-
4865-Hence, the builder will not be able to pick up the PPA build job created
4866-above.
4867-
4868- >>> bob_builder.vm_host = 'localhost.ppa'
4869- >>> syncUpdate(bob_builder)
4870-
4871- >>> job = removeSecurityProxy(bob_builder)._findBuildCandidate()
4872- >>> print job
4873- None
4874-
4875-In order to enable 'bob' to find and build the PPA job, we have to
4876-change it to virtualized. This is because PPA builds will only build
4877-on virtualized builders. We also need to make sure this build's source
4878-is published, or it will also be ignored (by superseding it). We can
4879-do this by copying the existing publication in Ubuntu.
4880-
4881- >>> from lp.soyuz.model.publishing import (
4882- ... SourcePackagePublishingHistory)
4883- >>> [old_pub] = SourcePackagePublishingHistory.selectBy(
4884- ... distroseries=build.distro_series,
4885- ... sourcepackagerelease=build.source_package_release)
4886- >>> new_pub = old_pub.copyTo(
4887- ... old_pub.distroseries, old_pub.pocket, build.archive)
4888-
4889- >>> bob_builder.virtualized = True
4890- >>> syncUpdate(bob_builder)
4891-
4892- >>> job = removeSecurityProxy(bob_builder)._findBuildCandidate()
4893- >>> ppa_job.id == job.id
4894- True
4895-
4896-For further details regarding IBuilder._findBuildCandidate() please see
4897-lib/lp/soyuz/tests/test_builder.py.
4898-
4899-Start buildd-slave to be able to dispatch jobs.
4900-
4901- >>> fixture = BuilddSlaveTestSetup()
4902- >>> fixture.setUp()
4903-
4904-Before dispatching we can check if the builder is protected against
4905-mistakes in code that results in a attempt to build a virtual job in
4906-a non-virtual build.
4907-
4908- >>> bob_builder.virtualized = False
4909- >>> flush_database_updates()
4910- >>> removeSecurityProxy(bob_builder)._dispatchBuildCandidate(ppa_job)
4911- Traceback (most recent call last):
4912- ...
4913- AssertionError: Attempt to build non-virtual item on a virtual builder.
4914-
4915-Mark the builder as virtual again, so we can dispatch the ppa job
4916-successfully.
4917-
4918- >>> bob_builder.virtualized = True
4919- >>> flush_database_updates()
4920-
4921- >>> dispatched_job = bob_builder.findAndStartJob()
4922- >>> ppa_job == dispatched_job
4923- True
4924-
4925- >>> flush_database_updates()
4926-
4927-PPA job is building.
4928-
4929- >>> ppa_job.builder.name
4930- u'bob'
4931-
4932- >>> build.status.name
4933- 'BUILDING'
4934-
4935-Shutdown builder slave, mark the ppa build record as failed, remove the
4936-buildqueue record and make 'bob' builder non-virtual again, so the
4937-environment is back to the initial state.
4938-
4939- >>> fixture.tearDown()
4940-
4941- >>> build.status = BuildStatus.FAILEDTOBUILD
4942- >>> ppa_job.destroySelf()
4943- >>> bob_builder.virtualized = False
4944- >>> flush_database_updates()
4945-
4946-
4947-== Security build dispatching ==
4948-
4949-Setup chroot for warty/i386.
4950-
4951- >>> warty = getUtility(IDistributionSet)['ubuntu']['warty']
4952- >>> warty_i386 = warty['i386']
4953- >>> pc = warty_i386.addOrUpdateChroot(chroot=chroot)
4954-
4955-Create a new Build record for test source targeted to warty/i386
4956-architecture and SECURITY pocket:
4957-
4958- >>> sec_build = sprf.sourcepackagerelease.createBuild(
4959- ... warty_i386, PackagePublishingPocket.SECURITY, hoary.main_archive)
4960-
4961-Create BuildQueue record and inspect some parameters:
4962-
4963- >>> sec_job = sec_build.queueBuild()
4964- >>> sec_job.id
4965- 4
4966- >>> print sec_job.builder
4967- None
4968- >>> print sec_job.date_started
4969- None
4970- >>> sec_build.is_virtualized
4971- False
4972-
4973-In normal conditions the next available candidate would be the job
4974-targeted to SECURITY pocket. However, the builders are forbidden to
4975-accept such jobs until we have finished the EMBARGOED archive
4976-implementation.
4977-
4978- >>> fixture = BuilddSlaveTestSetup()
4979- >>> fixture.setUp()
4980- >>> removeSecurityProxy(bob_builder)._dispatchBuildCandidate(sec_job)
4981- Traceback (most recent call last):
4982- ...
4983- AssertionError: Soyuz is not yet capable of building SECURITY uploads.
4984- >>> fixture.tearDown()
4985-
4986-To solve this problem temporarily until we start building security
4987-uploads, we will mark builds targeted to the SECURITY pocket as
4988-FAILEDTOBUILD during the _findBuildCandidate look-up.
4989-
4990-We will also create another build candidate in breezy-autotest/i386 to
4991-check if legitimate pending candidates will remain valid.
4992-
4993- >>> breezy = getUtility(IDistributionSet)['ubuntu']['breezy-autotest']
4994- >>> breezy_i386 = breezy['i386']
4995- >>> pc = breezy_i386.addOrUpdateChroot(chroot=chroot)
4996-
4997- >>> pending_build = sprf.sourcepackagerelease.createBuild(
4998- ... breezy_i386, PackagePublishingPocket.UPDATES, hoary.main_archive)
4999- >>> pending_job = pending_build.queueBuild()
5000-
The diff has been truncated for viewing.