On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
> 
> This looks like a big step toward build farm reliability -- thank you.
> 
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
> 
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?

It's been so long since I started it I honestly can't remember.  I have vague 
recollections of mentioning failure counts to various people.

> > === modified file 'lib/lp/buildmaster/configure.zcml'
> > --- lib/lp/buildmaster/configure.zcml	2010-06-15 14:34:50 +0000
> > +++ lib/lp/buildmaster/configure.zcml	2010-08-18 17:01:23 +0000
> > @@ -50,6 +50,9 @@
> > 
> >          <require
> >          
> >              permission="launchpad.Edit"
> >              set_attributes="status"/>
> > 
> > +        <require
> > +            permission="launchpad.Admin"
> > +            set_attributes="failure_count"/>
> > 
> >      </class>
> >      <securedutility
> >      
> >          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
> 
> I guess this means that whatever increments this runs with admin-level
> privileges.  Is that really such a good idea?

For now yes, because it's only ever changed in "zopeless" scripts.  That is, 
Zope requres this to work, but doesn't actually apply any level of security to 
it.  :/

I made it admin to discourage changing it in appservers.  There are other 
script-only-changing declarations like this, but zope.Public, with warnings in 
the zcml about exposing it externally!  I prefer the extra protection up 
front...

> > === modified file 'lib/lp/buildmaster/manager.py'
> > --- lib/lp/buildmaster/manager.py	2010-08-02 16:00:50 +0000
> > +++ lib/lp/buildmaster/manager.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -157,6 +158,52 @@
> > 
> >          if job is not None:
> >              job.reset()
> > 
> > +    def _getBuilder(self):
> > +        # Helper to return the builder given the slave for this request.
> > +        # Avoiding circular imports.
> > +        from lp.buildmaster.interfaces.builder import IBuilderSet
> > +        return getUtility(IBuilderSet)[self.slave.name]
> > +
> 
> I notice that there's very similar code below.
> 
> I'd suggest making a top-level function like this:
> 
>   def get_builder(slave_name):
>       # Helper to return the builder given the slave for this request.
>       # Avoiding circular imports.
>       from lp.buildmaster.interfaces.builder import IBuilderSet
>       return getUtility(IBuilderSet)[self.slave.name]
> 
> And using that here instead, as well as in SlaveScanner.

Meh, I meant to do that and forgot, thanks.

> 
> > +    def assessFailureCounts(self, builder=None):
> > +        """View builder/job failure_count and work out which needs to
> > die. +
> 
> There's spurious whitespace on this line.

Grar, my vim setup has gone weird on me, it always used to tell me about 
trailing spaces.

> 
> > +        :return: True if we disabled something, False if we did not.
> 
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. What's it for?

For the tests.

> 
> > +        """
> > +        # Avoiding circular imports.
> 
> Not avoiding circular imports here any more :)

I suck at cleaning up after myself :)

> 
> > +        if builder is None:
> > +            builder = self._getBuilder()
> > +        build_job = builder.currentjob.specific_job.build
> > +
> > +        if builder.failure_count == build_job.failure_count:
> > +            # This is either the first failure for this job on this
> > +            # builder, or by some chance the job was re-dispatched to
> > +            # the same builder.  This make it impossible to determine
> 
> "makes"
> 
> > +            # whether the job or the builder is at fault, so don't fail
> > +            # either.  We reset the builder and job to try again.
> 
> It's unclear to me why the two counts being equal could imply that this is
> the first failure for the job -- surely a count of 1 would signify that?

My intention was that it would be read as "1 and 1" but since you failed to 
understand that then I suck and I need to re-write it.

> 
> Perhaps the comment should say something like:
>   # If the failure count for the builder is the same as the
>   # failure count for the job being built, then we cannot
>   # tell whether the job or the builder is at fault. The best
>   # we can do is try them both again, and hope that the job
>   # runs against a different builder.
> 

That'll do.

> ...
> 
> > @@ -243,8 +282,21 @@
> > 
> >              error = Failure()
> >              self.logger.info("Scanning failed with: %s\n%s" %
> >              
> >                  (error.getErrorMessage(), error.getTraceback()))
> > 
> > -            # We should probably detect continuous failures here and
> > mark -            # the builder down.
> > +
> > +            # Avoid circular import.
> > +            from lp.buildmaster.interfaces.builder import IBuilderSet
> > +            builder = getUtility(IBuilderSet)[self.builder_name]
> > +
> 
> The get_builder helper would be useful here.

Fixed.

> 
> > +            # Decide if we need to terminate the job or fail the
> > +            # builder.
> > +            self._incrementFailureCounts(builder)
> > +            self.logger.info(
> > +                "builder failure count: %s, job failure count: %s" % (
> > +                    builder.failure_count,
> > +                   
> > builder.currentjob.specific_job.build.failure_count)) +           
> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
> 
> This line makes me wonder why assessFailureCounts is on BaseDispatchResult
> at all.  Perhaps it should be a method on Builder that takes an optional
> 'info' parameter?
> 
> The above code would then read:
> 
>   builder.gotFailure()
>   self.logger.info(...)
>   builder.assessFailureCounts()

I don't think we should continue to bloat model classes with code that is only 
used in one place.  This is a buildd-manager-specific function.

Anyway, bugger, I had intended to factor it out to a standalone function as I 
also realised that that line is ridiculous and again forgot :/ 

> 
> Also, perhaps there should be a current_build property on Builder.
> 
>   @property
>   def current_build(self):
>       return self.currentjob.specific_job.build

I don't really like that either I'm afraid!  It's masking three complicated 
queries here and dangerous to have as a property as it lulls you into a false 
sense of it really being a cheap operation.

I don't mind making it a regular method though, getCurrentBuildFarmJob()

> Since the buildd-manager seems to do builder.currentjob.specific_job.build
> an awful lot.

A few :)

> 
> > @@ -440,6 +495,11 @@
> > 
> >          self.logger.error('%s resume failure: %s' % (slave, error_text))
> >          return self.reset_result(slave, error_text)
> > 
> > +    def _incrementFailureCounts(self, builder):
> > +        # Avoid circular import.
> 
> I don't think you are avoiding a circular import here.

Jeez :/

> 
> > +        builder.failure_count += 1
> > +        builder.currentjob.specific_job.build.failure_count += 1
> > +
> 
> Perhaps this should be a method on Builder. e.g.
> 
>   class Builder:
> 
>      # ...
> 
>      def gotFailure(self):
>          self.failure_count += 1
>          self.currentjob.specific_job.build.failure_count += 1
> 
> Likewise, there could also be a gotSuccess() that resets to 0.

For the same reasons as before, I don't think this belongs on IBuilder.  Not 
to mention it's not just manipulating builder properties, it's changing the 
job's properties.

> 
> > @@ -447,6 +507,9 @@
> > 
> >          `FailDispatchResult`, if it was a communication failure, simply
> >          reset the slave by returning a `ResetDispatchResult`.
> >          """
> > 
> > +        from lp.buildmaster.interfaces.builder import IBuilderSet
> > +        builder = getUtility(IBuilderSet)[slave.name]
> > +
> 
> As mentioned above: builder = get_builder(slave.name)

Yup, fixed.

> 
> > === modified file 'lib/lp/buildmaster/tests/test_builder.py'
> > --- lib/lp/buildmaster/tests/test_builder.py	2010-08-12 17:33:08 +0000
> > +++ lib/lp/buildmaster/tests/test_builder.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -32,7 +34,17 @@
> > 
> >  class TestBuilder(TestCaseWithFactory):
> >      """Basic unit tests for `Builder`."""
> > 
> > -    layer = LaunchpadZopelessLayer
> > +    layer = DatabaseFunctionalLayer
> > +
> > +    def test_providesInterface(self):
> > +        # Builder provides IBuilder
> > +        builder = self.factory.makeBuilder()
> > +        self.assertProvides(builder, IBuilder)
> > +
> > +    def test_default_values(self):
> > +        builder = self.factory.makeBuilder()
> > +        flush_database_updates()
> > +        self.assertEqual(0, builder.failure_count)
> 
> Why is flush_database_updates() necessary? Could you please put the answer
> in a comment.

It makes sure the Storm cache gets the values that the database initialises 
for new objects.

> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> > --- lib/lp/buildmaster/tests/test_manager.py	2010-08-06 10:48:49 +0000
> > +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -360,9 +360,17 @@
> > 
> >              self.assertFalse(result.processed)
> >          
> >          return d.addCallback(check_result)
> > 
> > -    def testCheckDispatch(self):
> > -        """`SlaveScanner.checkDispatch` is chained after dispatch
> > requests. -
> > +    def _setUpSlaveAndBuilder(self):
> > +        # Helper function to set up a builder and its recording slave.
> > +        slave = RecordingSlave('bob', 'http://foo.buildd:8221/',
> > 'foo.host') +        bob_builder = getUtility(IBuilderSet)[slave.name]
> > +        return slave, bob_builder
> > +
> 
> There's already a slave called 'bob' in the sampledata? 

There's a builder called 'bob'.  There are no slaves.

There's also another called 'frog'.

> Could you please
> add a constant to lp.testing.sampledata like:
> 
>   BUILD_SLAVE = RecordingSlave('bob', 'http://foo.buildd:8221/',
> 'foo.host')
> 
> or
> 
>   STANDARD_BUILD_SLAVE_NAME = 'bob'
> 
> or whatever is most appropriate.
> 
> Also, it might be a good idea to> have this method guarantee that the
> failure counts are zero.

I considered that, but the methods that call this one set the counts to 
varying values.

> 
> > +    def test_scan_assesses_failure_exceptions(self):
> > +        # If scan() fails with an exception, failure_counts should be
> > +        # incremented and tested.
> > +        def fake_scan():
> > +            raise Exception("fake exception")
> > +        manager = self._getManager()
> > +        manager.scan = fake_scan
> 
> Consider calling the function "failing_scan" rather than "fake_scan".

Right, done.

> 
> > +        manager.scheduleNextScanCycle = FakeMethod()
> > +        self.patch(BaseDispatchResult, 'assessFailureCounts',
> > FakeMethod())
> 
> Why are you stubbing out assessFailureCounts here?

Because at the bottom of the method you'll see that it's checking the 
call_count.

> 
> > +    def assertBuilderIsClean(self, builder):
> > +        # Check that the builder is ready for a new build.
> > +        self.assertTrue(builder.builderok)
> > +        self.assertTrue(builder.failnotes is None)
> > +        self.assertTrue(builder.currentjob is None)
> 
> Please use assertIs(None, builder.failnotes) instead of assertTrue. Same
> for currentjob.  It gets you more helpful error messages when something
> breaks.

Agh, my bad.  I knew that.

> 
> > @@ -809,32 +890,82 @@
> > 
> >          result = ResetDispatchResult(slave)
> >          result()
> > 
> > -        self.assertJobIsClean(job_id)
> > +        buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id)
> > +        self.assertBuildqueueIsClean(buildqueue)
> > 
> >          # XXX Julian
> >          # Disabled test until bug 586362 is fixed.
> >          #self.assertFalse(builder.builderok)
> > 
> > -        self.assertEqual(None, builder.currentjob)
> > +        self.assertBuilderIsClean(builder)
> > 
> >      def testFailDispatchResult(self):
> > -        """`FailDispatchResult` excludes the builder from pool.
> > -
> > -        It marks the build as failed (builderok=False) and clean any
> > -        existing jobs.
> > -        """
> > +        # Test that `FailDispatchResult` calls assessFailureCounts().
> 
> This comment would be more helpful if it had:
> 
>   # `FailDispatchResult` calls `assessFailureCounts()` so that ...
> 
> I don't really know what the "so that" is.

Done.

> 
> >          builder, job_id = self._getBuilder('bob')
> >          
> >          # Setup a interaction to satisfy 'write_transaction' decorator.
> >          login(ANONYMOUS)
> >          slave = RecordingSlave(builder.name, builder.url,
> >          builder.vm_host) result = FailDispatchResult(slave, 'does not
> >          work!')
> > 
> > +        result.assessFailureCounts = FakeMethod()
> 
> Why are you stubbing it out here?

So that I can see if it got called.

> 
> > +        self.assertEqual(0, result.assessFailureCounts.call_count)
> 
> This assertion seems a little redundant.

Well, it's fairly standard testing practice, no?  How else do I know that 
calling result() has changed the count?

> 
> > === modified file 'lib/lp/soyuz/configure.zcml'
> > --- lib/lp/soyuz/configure.zcml	2010-08-16 21:34:11 +0000
> > +++ lib/lp/soyuz/configure.zcml	2010-08-18 17:01:23 +0000
> > @@ -511,6 +511,15 @@
> > 
> >              permission="launchpad.Edit"
> >              set_attributes="log date_finished date_started builder
> >              
> >                              status dependencies upload_log"/>
> > 
> > +
> > +        <!-- XXX bigjools 2010-07-27 bug=570939
> > +             This should not be required once the old BuildFarmJob stuff
> > is +             removed when the Translation Template jobs and Recipe
> > jobs +             use the new infrastructure -->
> > +        <require
> > +            permission="launchpad.Admin"
> > +            set_attributes="failure_count"/>
> > +
> 
> Thanks for including the comment.  It would be good to say why it is
> required now.

That's what the bug reference is for ;)  But I changed it anyway.

Phew, big diff attached.  Thanks!