Hey Julian,

The new manager is far, far more readable than before.  The hard work has paid off.

I mention a lot of small things in the comments below.  I'd really appreciate it if you could address them all, since now is the best opportunity we'll have in a while to make this code comprehensible to others.

> # Copyright 2009 Canonical Ltd.  This software is licensed under the
> # GNU Affero General Public License version 3 (see the file LICENSE).
> 
> """Soyuz buildd slave manager logic."""
> 
> __metaclass__ = type
> 
> __all__ = [
>     'BuilddManager',
>     'BUILDD_MANAGER_LOG_NAME',
>     'buildd_success_result_map',
>     ]
> 
> import logging
> 
> import transaction
> from twisted.application import service
> from twisted.internet import (
>     defer,
>     reactor,
>     )
> from twisted.internet.task import LoopingCall
> from twisted.python import log
> from zope.component import getUtility
> 
> from lp.buildmaster.enums import BuildStatus
> from lp.buildmaster.interfaces.buildfarmjobbehavior import (
>     BuildBehaviorMismatch,
>     )
> from lp.buildmaster.interfaces.builder import (
>     BuildDaemonError,
>     BuildSlaveFailure,
>     CannotBuild,
>     CannotResumeHost,
>     )
> 
> 
> BUILDD_MANAGER_LOG_NAME = "slave-scanner"
> 
> 
> buildd_success_result_map = {
>     'ensurepresent': True,
>     'build': 'BuilderStatus.BUILDING',
>     }
>

You can delete this now. Yay. (Don't forget the __all__ too).

> class SlaveScanner:
>     """A manager for a single builder."""
> 
>     SCAN_INTERVAL = 5
>

Can you please add a comment explaining what this means, what unit it's in,
and hinting at why 5 is a good number for it?

>     def startCycle(self):
>         """Scan the builder and dispatch to it or deal with failures."""
>         self.loop = LoopingCall(self._startCycle)
>         self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
>         return self.stopping_deferred
> 
>     def _startCycle(self):
>         # Same as _startCycle but the next cycle is not scheduled.  This
>         # is so tests can initiate a single scan.

This comment is obsolete.  Also, there's probably a better name than
"_startCycle", since this is pretty much doing the scan.  Perhaps 'oneCycle'.

>     def _scanFailed(self, failure):
>         # Trap known exceptions and print a message without a
>         # stack trace in that case, or if we don't know about it,
>         # include the trace.
>

This comment is also obsolete.  Although the traceback/no-traceback logic is
still here, it's hardly the point of the method.

>         # Paranoia.
>         transaction.abort()
>

Can you please explain in the comment exactly what you are being paranoid
about?

>         error_message = failure.getErrorMessage()
>         if failure.check(
>             BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch,
>             CannotResumeHost, BuildDaemonError):
>             self.logger.info("Scanning failed with: %s" % error_message)
>         else:
>             self.logger.info("Scanning failed with: %s\n%s" %
>                 (failure.getErrorMessage(), failure.getTraceback()))
> 
>         builder = get_builder(self.builder_name)
>

Shouldn't this line be in the try: block below?  After all, if the database
disappears...

>         # Decide if we need to terminate the job or fail the
>         # builder.
>         try:
>             builder.gotFailure()
>             builder.getCurrentBuildFarmJob().gotFailure()
>             self.logger.info(
>                 "builder failure count: %s, job failure count: %s" % (
>                     builder.failure_count,
>                     builder.getCurrentBuildFarmJob().failure_count))
>             assessFailureCounts(builder, failure.getErrorMessage())
>             transaction.commit()
>         except:
>             # Catastrophic code failure! Not much we can do.
>             self.logger.error(
>                 "Miserable failure when trying to examine failure counts:\n",
>                 exc_info=True)
>             transaction.abort()
>

>     def scan(self):
>         """Probe the builder and update/dispatch/collect as appropriate.
> 
>         The whole method is wrapped in a transaction, but we do partial
>         commits to avoid holding locks on tables.
> 
>         :return: A `BuilderSlave` if we dispatched a job to it, or None.

Can you please change this to document that it returns a Deferred.

>         """
>         # We need to re-fetch the builder object on each cycle as the
>         # Storm store is invalidated over transaction boundaries.
>

This method is complicated enough that I think it would benefit from a prose
summary of what it does.

For example:

  If the builder is OK [XXX - I don't know what this actually means - jml],
  then update its status [XXX - this seems redundant, didn't we just check
  that it was OK? - jml].  If we think there's an active build on the
  builder, then check to see if it's done (builder.updateBuild).  If it
  is done, then check that it's available and not in manual mode, then
  dispatch the job to the build.

Well, that's my best guess.  I think it's only worth describing the happy
case, since the unusual cases will be easy enough to follow in the code.

>         self.builder = get_builder(self.builder_name)
> 
>         if self.builder.builderok:
>             d = self.builder.updateStatus(self.logger)
>         else:
>             d = defer.succeed(None)
> 
>         def got_update_status(ignored):

This would be better named 'status_updated', since you haven't actually got
anything from anywhere.

>             transaction.commit()
>

Could you please add a comment explaining why you commit here.

>             # See if we think there's an active build on the builder.
>             buildqueue = self.builder.getBuildQueue()
> 
>             # Scan the slave and get the logtail, or collect the build if
>             # it's ready.  Yes, "updateBuild" is a bad name.
>             if buildqueue is not None:
>                 return self.builder.updateBuild(buildqueue)
> 
>         def got_available(available):
>             if not available:
>                 job = self.builder.currentjob
>                 if job is not None and not self.builder.builderok:
>                     self.logger.info(
>                         "%s was made unavailable, resetting attached "
>                         "job" % self.builder.name)
>                     job.reset()
>                     transaction.commit()
>                 return
> 
>             # See if there is a job we can dispatch to the builder slave.
> 
>             d = self.builder.findAndStartJob()
>             def job_started(candidate):
>                 if self.builder.currentjob is not None:
>                     # After a successful dispatch we can reset the
>                     # failure_count.
>                     self.builder.resetFailureCount()
>                     transaction.commit()
>                     return self.builder.slave
>                 else:
>                     return None
>             return d.addCallback(job_started)
>

Since got_available is called after 'got_update_build', it would make this
function read more clearly to move it below got_update_build.

>         def got_update_build(ignored):

build_updated would be a better name, since you don't really "got" anything
here.

>             transaction.commit()
>

Can you please add a comment explaining why you commit here?

>             # If the builder is in manual mode, don't dispatch anything.
>             if self.builder.manual:
>                 self.logger.debug(
>                     '%s is in manual mode, not dispatching.' %
>                     self.builder.name)
>                 return
> 
>             # If the builder is marked unavailable, don't dispatch anything.
>             # Additionaly, because builders can be removed from the pool at
>             # any time, we need to see if we think there was a build running
>             # on it before it was marked unavailable. In this case we reset
>             # the build thusly forcing it to get re-dispatched to another
>             # builder.
> 
>             return self.builder.isAvailable().addCallback(got_available)
> 
>         d.addCallback(got_update_status)
>         d.addCallback(got_update_build)
>         return d
> 

All in all, that's vastly more readable than the version in trunk.
Congratulations.


> class BuilddManager(service.Service):
>     """Main Buildd Manager service class."""
> 
>     def __init__(self, clock=None):
>         self.builder_slaves = []
>         self.logger = self._setupLogger()

Given that _setupLogger changes global state, it's better to put it in
startService.

>         self.new_builders_scanner = NewBuildersScanner(
>             manager=self, clock=clock)
>
>     def _setupLogger(self):
>         """Setup a 'slave-scanner' logger that redirects to twisted.
>

FWIW, "Setup" is a noun. "Set up" is a verb.

>         It is going to be used locally and within the thread running
>         the scan() method.
>

There's no threading here, is there?

>     def startService(self):
>         """Service entry point, called when the application starts."""
> 
>         # Get a list of builders and set up scanners on each one.
> 
>         # Avoiding circular imports.
>         from lp.buildmaster.interfaces.builder import IBuilderSet
>         builder_set = getUtility(IBuilderSet)
>         builders = [builder.name for builder in builder_set]
>         self.addScanForBuilders(builders)
>         self.new_builders_scanner.scheduleScan()
>

If you change the constructor of NewBuildersScanner to set current_builders to
[], rather than doing a query, then you could delete everything but the last
line of this method without sacrificing either correctness or performance.

>         # Events will now fire in the SlaveScanner objects to scan each
>         # builder.
> 
>     def stopService(self):
>         """Callback for when we need to shut down."""
>         # All the SlaveScanner objects need to be halted gracefully.
>         deferreds = [slave.stopping_deferred for slave in self.builder_slaves]
>         deferreds.append(self.new_builders_scanner.stopping_deferred)
> 
>         self.new_builders_scanner.loop.stop()
>         for slave in self.builder_slaves:
>             slave.loop.stop()
>

There's an asymmetry here.  Given that 'slave' and 'new_builders' have
startCycle methods, they should probably have stopCycle methods too, rather
than having their insides poked.

> 
> class TestSlaveScannerScan(TrialTestCase):
>     """Tests `SlaveScanner.scan` method.
> 
>     This method uses the old framework for scanning and dispatching builds.
>     """
>     layer = LaunchpadZopelessLayer
> 
>     def setUp(self):
>         """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest.
> 
>         Also adjust the sampledata in a way a build can be dispatched to
>         'bob' builder.
>         """
>         from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
>         TwistedLayer.testSetUp()

Why not use TwistedLaunchpadZopelessLayer?

>     def _resetBuilder(self, builder):
>         """Reset the given builder and its job."""
> 
>         builder.builderok = True
>         job = builder.currentjob
>         if job is not None:
>             job.reset()
> 
>         transaction.commit()
>

Why isn't this a method on builder?

>     def _assertFailureCounting(self, builder_count, job_count,
>                                expected_builder_count, expected_job_count):
>         # If scan() fails with an exception, failure_counts should be
>         # incremented.  What we do with the results of the failure
>         # counts is tested below separately, this test just makes sure that
>         # scan() is setting the counts.
>         def failing_scan():
>             return defer.fail(Exception("fake exception"))
>         scanner = self._getScanner()
>         scanner.scan = failing_scan
>         scanner.scheduleNextScanCycle = FakeMethod()

Why assign a FakeMethod here?  There's no scanner.scheduleNextScanCycle any
more.  (This is one reason I prefer state verification to behaviour
verification).

>     def test_fail_to_resume_slave_resets_job(self):
>         # If an attempt to resume and dispatch a slave fails, it should
>         # reset the job via job.reset()
> 
>         # Make a slave with a failing resume() method.
>         slave = OkSlave()
>         slave.resume = lambda: deferLater(
>             reactor, 0, defer.fail, Failure(('out', 'err', 1)))
>

Why is deferLater being used here?

> class TestBuilddManager(TrialTestCase):
> 
>     layer = LaunchpadZopelessLayer
> 
>     def _stub_out_scheduleNextScanCycle(self):
>         # stub out the code that adds a callLater, so that later tests
>         # don't get surprises.
>         self.patch(SlaveScanner, 'startCycle', FakeMethod())
>

Rather than stubbing out startCycle, you could instead call
manager.stopService() in the tearDown.

> 
> class TestNewBuilders(TrialTestCase):
>     """Test detecting of new builders."""
> 
>     layer = LaunchpadZopelessLayer
>

This needs to use TwistedLaunchpadZopelessLayer, otherwise it will screw up
signal handlers.

>     def _getScanner(self, manager=None, clock=None):
>         return NewBuildersScanner(manager=manager, clock=clock)
> 
>     def test_init_stores_existing_builders(self):
>         # Make sure that NewBuildersScanner initialises itself properly
>         # by storing a list of existing builders.
>         all_builders = [builder.name for builder in getUtility(IBuilderSet)]
>         builder_scanner = self._getScanner()
>         self.assertEqual(all_builders, builder_scanner.current_builders)
>

As mentioned above, if you change the behaviour of the constructor to not do a
query, then you can delete duplicated code.

>     def test_scheduleScan(self):
>         # Test that scheduleScan calls the "scan" method.
>         clock = task.Clock()
>         builder_scanner = self._getScanner(clock=clock)
>         builder_scanner.scan = FakeMethod()
>         builder_scanner.scheduleScan()
> 
>         advance = NewBuildersScanner.SCAN_INTERVAL + 1
>         clock.advance(advance)
>         self.assertNotEqual(
>             0, builder_scanner.scan.call_count,
>             "scheduleScan did not schedule anything")
>

The mocking in this test doesn't help much.  It's not that much harder to test
that if new builders are added to the database and the clock is advanced then
those builders are registered with the manager.  That's also the behaviour you
are actually interested in.

> 
>     def test_scan(self):
>         # See if scan detects new builders.
> 
>         # stub out the addScanForBuilders and scheduleScan methods since
>         # they use callLater; we only want to assert that they get
>         # called.

This comment is slightly incorrect, since it doesn't stub out scheduleScan.
Also, callLater is only called very indirectly by addScanForBuilders.

>         def fake_checkForNewBuilders():
>             return "new_builders"
> 
>         def fake_addScanForBuilders(new_builders):
>             self.assertEqual("new_builders", new_builders)
> 
>         clock = task.Clock()
>         builder_scanner = self._getScanner(BuilddManager(), clock=clock)
>         builder_scanner.checkForNewBuilders = fake_checkForNewBuilders
>         builder_scanner.manager.addScanForBuilders = fake_addScanForBuilders
>         builder_scanner.scheduleScan = FakeMethod()
> 
>         builder_scanner.scan()
>         advance = NewBuildersScanner.SCAN_INTERVAL + 1
>         clock.advance(advance)

This never actually checks that checkForNewBuilders or addScanForBuilders is
called.

There's missing test coverage here:
  * There are no tests that demonstrate that the slave scanners'
    cycles are started when addScanForBuilders is called.
  * There are no tests that demonstrate that new builders found in
    scan() will no longer be new in the next run of scan()
  * assessFailureCounts is completely untested
  * stopService has no tests.

I guess there's also a general theme of using fewer FakeMethods.  FakeMethod is a useful testing tool, but that every time we use it we incur a risk, because we are changing the behaviour of the System Under Test in order to test its behaviour.  Better to test the output, rather than method calls, since the output is where the value is.

I look forward to your reply.

jml