On Thursday 19 August 2010 13:37:40 you wrote: > Review: Needs Fixing > Hey Julian, > > This looks like a big step toward build farm reliability -- thank you. > > Some of my comments are asking for further explanation, a few make > suggestions for moving some code around, and the rest are the normal > stylistic gotchas. > > Out of curiosity, did you discuss this change in detail with anyone before > landing? It's been so long since I started it I honestly can't remember. I have vague recollections of mentioning failure counts to various people. > > === modified file 'lib/lp/buildmaster/configure.zcml' > > --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000 > > +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000 > > @@ -50,6 +50,9 @@ > > > > > > > permission="launchpad.Edit" > > set_attributes="status"/> > > > > + > + permission="launchpad.Admin" > > + set_attributes="failure_count"/> > > > > > > > > > component="lp.buildmaster.model.buildfarmjob.BuildFarmJob" > > I guess this means that whatever increments this runs with admin-level > privileges. Is that really such a good idea? For now yes, because it's only ever changed in "zopeless" scripts. That is, Zope requres this to work, but doesn't actually apply any level of security to it. :/ I made it admin to discourage changing it in appservers. There are other script-only-changing declarations like this, but zope.Public, with warnings in the zcml about exposing it externally! I prefer the extra protection up front... > > === modified file 'lib/lp/buildmaster/manager.py' > > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000 > > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000 > > ... > > > @@ -157,6 +158,52 @@ > > > > if job is not None: > > job.reset() > > > > + def _getBuilder(self): > > + # Helper to return the builder given the slave for this request. > > + # Avoiding circular imports. > > + from lp.buildmaster.interfaces.builder import IBuilderSet > > + return getUtility(IBuilderSet)[self.slave.name] > > + > > I notice that there's very similar code below. > > I'd suggest making a top-level function like this: > > def get_builder(slave_name): > # Helper to return the builder given the slave for this request. > # Avoiding circular imports. > from lp.buildmaster.interfaces.builder import IBuilderSet > return getUtility(IBuilderSet)[self.slave.name] > > And using that here instead, as well as in SlaveScanner. Meh, I meant to do that and forgot, thanks. > > > + def assessFailureCounts(self, builder=None): > > + """View builder/job failure_count and work out which needs to > > die. + > > There's spurious whitespace on this line. Grar, my vim setup has gone weird on me, it always used to tell me about trailing spaces. > > > + :return: True if we disabled something, False if we did not. > > This seems like an odd thing to return. It doesn't seem to be used in the > code. What's it for? For the tests. > > > + """ > > + # Avoiding circular imports. > > Not avoiding circular imports here any more :) I suck at cleaning up after myself :) > > > + if builder is None: > > + builder = self._getBuilder() > > + build_job = builder.currentjob.specific_job.build > > + > > + if builder.failure_count == build_job.failure_count: > > + # This is either the first failure for this job on this > > + # builder, or by some chance the job was re-dispatched to > > + # the same builder. This make it impossible to determine > > "makes" > > > + # whether the job or the builder is at fault, so don't fail > > + # either. We reset the builder and job to try again. > > It's unclear to me why the two counts being equal could imply that this is > the first failure for the job -- surely a count of 1 would signify that? My intention was that it would be read as "1 and 1" but since you failed to understand that then I suck and I need to re-write it. > > Perhaps the comment should say something like: > # If the failure count for the builder is the same as the > # failure count for the job being built, then we cannot > # tell whether the job or the builder is at fault. The best > # we can do is try them both again, and hope that the job > # runs against a different builder. > That'll do. > ... > > > @@ -243,8 +282,21 @@ > > > > error = Failure() > > self.logger.info("Scanning failed with: %s\n%s" % > > > > (error.getErrorMessage(), error.getTraceback())) > > > > - # We should probably detect continuous failures here and > > mark - # the builder down. > > + > > + # Avoid circular import. > > + from lp.buildmaster.interfaces.builder import IBuilderSet > > + builder = getUtility(IBuilderSet)[self.builder_name] > > + > > The get_builder helper would be useful here. Fixed. > > > + # Decide if we need to terminate the job or fail the > > + # builder. > > + self._incrementFailureCounts(builder) > > + self.logger.info( > > + "builder failure count: %s, job failure count: %s" % ( > > + builder.failure_count, > > + > > builder.currentjob.specific_job.build.failure_count)) + > > BaseDispatchResult(slave=None).assessFailureCounts(builder) > > This line makes me wonder why assessFailureCounts is on BaseDispatchResult > at all. Perhaps it should be a method on Builder that takes an optional > 'info' parameter? > > The above code would then read: > > builder.gotFailure() > self.logger.info(...) > builder.assessFailureCounts() I don't think we should continue to bloat model classes with code that is only used in one place. This is a buildd-manager-specific function. Anyway, bugger, I had intended to factor it out to a standalone function as I also realised that that line is ridiculous and again forgot :/ > > Also, perhaps there should be a current_build property on Builder. > > @property > def current_build(self): > return self.currentjob.specific_job.build I don't really like that either I'm afraid! It's masking three complicated queries here and dangerous to have as a property as it lulls you into a false sense of it really being a cheap operation. I don't mind making it a regular method though, getCurrentBuildFarmJob() > Since the buildd-manager seems to do builder.currentjob.specific_job.build > an awful lot. A few :) > > > @@ -440,6 +495,11 @@ > > > > self.logger.error('%s resume failure: %s' % (slave, error_text)) > > return self.reset_result(slave, error_text) > > > > + def _incrementFailureCounts(self, builder): > > + # Avoid circular import. > > I don't think you are avoiding a circular import here. Jeez :/ > > > + builder.failure_count += 1 > > + builder.currentjob.specific_job.build.failure_count += 1 > > + > > Perhaps this should be a method on Builder. e.g. > > class Builder: > > # ... > > def gotFailure(self): > self.failure_count += 1 > self.currentjob.specific_job.build.failure_count += 1 > > Likewise, there could also be a gotSuccess() that resets to 0. For the same reasons as before, I don't think this belongs on IBuilder. Not to mention it's not just manipulating builder properties, it's changing the job's properties. > > > @@ -447,6 +507,9 @@ > > > > `FailDispatchResult`, if it was a communication failure, simply > > reset the slave by returning a `ResetDispatchResult`. > > """ > > > > + from lp.buildmaster.interfaces.builder import IBuilderSet > > + builder = getUtility(IBuilderSet)[slave.name] > > + > > As mentioned above: builder = get_builder(slave.name) Yup, fixed. > > > === modified file 'lib/lp/buildmaster/tests/test_builder.py' > > --- lib/lp/buildmaster/tests/test_builder.py 2010-08-12 17:33:08 +0000 > > +++ lib/lp/buildmaster/tests/test_builder.py 2010-08-18 17:01:23 +0000 > > ... > > > @@ -32,7 +34,17 @@ > > > > class TestBuilder(TestCaseWithFactory): > > """Basic unit tests for `Builder`.""" > > > > - layer = LaunchpadZopelessLayer > > + layer = DatabaseFunctionalLayer > > + > > + def test_providesInterface(self): > > + # Builder provides IBuilder > > + builder = self.factory.makeBuilder() > > + self.assertProvides(builder, IBuilder) > > + > > + def test_default_values(self): > > + builder = self.factory.makeBuilder() > > + flush_database_updates() > > + self.assertEqual(0, builder.failure_count) > > Why is flush_database_updates() necessary? Could you please put the answer > in a comment. It makes sure the Storm cache gets the values that the database initialises for new objects. > > === modified file 'lib/lp/buildmaster/tests/test_manager.py' > > --- lib/lp/buildmaster/tests/test_manager.py 2010-08-06 10:48:49 +0000 > > +++ lib/lp/buildmaster/tests/test_manager.py 2010-08-18 17:01:23 +0000 > > ... > > > @@ -360,9 +360,17 @@ > > > > self.assertFalse(result.processed) > > > > return d.addCallback(check_result) > > > > - def testCheckDispatch(self): > > - """`SlaveScanner.checkDispatch` is chained after dispatch > > requests. - > > + def _setUpSlaveAndBuilder(self): > > + # Helper function to set up a builder and its recording slave. > > + slave = RecordingSlave('bob', 'http://foo.buildd:8221/', > > 'foo.host') + bob_builder = getUtility(IBuilderSet)[slave.name] > > + return slave, bob_builder > > + > > There's already a slave called 'bob' in the sampledata? There's a builder called 'bob'. There are no slaves. There's also another called 'frog'. > Could you please > add a constant to lp.testing.sampledata like: > > BUILD_SLAVE = RecordingSlave('bob', 'http://foo.buildd:8221/', > 'foo.host') > > or > > STANDARD_BUILD_SLAVE_NAME = 'bob' > > or whatever is most appropriate. > > Also, it might be a good idea to> have this method guarantee that the > failure counts are zero. I considered that, but the methods that call this one set the counts to varying values. > > > + def test_scan_assesses_failure_exceptions(self): > > + # If scan() fails with an exception, failure_counts should be > > + # incremented and tested. > > + def fake_scan(): > > + raise Exception("fake exception") > > + manager = self._getManager() > > + manager.scan = fake_scan > > Consider calling the function "failing_scan" rather than "fake_scan". Right, done. > > > + manager.scheduleNextScanCycle = FakeMethod() > > + self.patch(BaseDispatchResult, 'assessFailureCounts', > > FakeMethod()) > > Why are you stubbing out assessFailureCounts here? Because at the bottom of the method you'll see that it's checking the call_count. > > > + def assertBuilderIsClean(self, builder): > > + # Check that the builder is ready for a new build. > > + self.assertTrue(builder.builderok) > > + self.assertTrue(builder.failnotes is None) > > + self.assertTrue(builder.currentjob is None) > > Please use assertIs(None, builder.failnotes) instead of assertTrue. Same > for currentjob. It gets you more helpful error messages when something > breaks. Agh, my bad. I knew that. > > > @@ -809,32 +890,82 @@ > > > > result = ResetDispatchResult(slave) > > result() > > > > - self.assertJobIsClean(job_id) > > + buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id) > > + self.assertBuildqueueIsClean(buildqueue) > > > > # XXX Julian > > # Disabled test until bug 586362 is fixed. > > #self.assertFalse(builder.builderok) > > > > - self.assertEqual(None, builder.currentjob) > > + self.assertBuilderIsClean(builder) > > > > def testFailDispatchResult(self): > > - """`FailDispatchResult` excludes the builder from pool. > > - > > - It marks the build as failed (builderok=False) and clean any > > - existing jobs. > > - """ > > + # Test that `FailDispatchResult` calls assessFailureCounts(). > > This comment would be more helpful if it had: > > # `FailDispatchResult` calls `assessFailureCounts()` so that ... > > I don't really know what the "so that" is. Done. > > > builder, job_id = self._getBuilder('bob') > > > > # Setup a interaction to satisfy 'write_transaction' decorator. > > login(ANONYMOUS) > > slave = RecordingSlave(builder.name, builder.url, > > builder.vm_host) result = FailDispatchResult(slave, 'does not > > work!') > > > > + result.assessFailureCounts = FakeMethod() > > Why are you stubbing it out here? So that I can see if it got called. > > > + self.assertEqual(0, result.assessFailureCounts.call_count) > > This assertion seems a little redundant. Well, it's fairly standard testing practice, no? How else do I know that calling result() has changed the count? > > > === modified file 'lib/lp/soyuz/configure.zcml' > > --- lib/lp/soyuz/configure.zcml 2010-08-16 21:34:11 +0000 > > +++ lib/lp/soyuz/configure.zcml 2010-08-18 17:01:23 +0000 > > @@ -511,6 +511,15 @@ > > > > permission="launchpad.Edit" > > set_attributes="log date_finished date_started builder > > > > status dependencies upload_log"/> > > > > + > > + > > + > + permission="launchpad.Admin" > > + set_attributes="failure_count"/> > > + > > Thanks for including the comment. It would be good to say why it is > required now. That's what the bug reference is for ;) But I changed it anyway. Phew, big diff attached. Thanks!