Code review comment for lp:~julian-edwards/launchpad/builderslave-resume

Revision history for this message
Jonathan Lange (jml) wrote :

On Mon, Oct 18, 2010 at 4:49 PM, Julian Edwards
<email address hidden> wrote:
> On Monday 18 October 2010 13:21:45 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> The new manager is far, far more readable than before.  The hard work has
>> paid off.
>
> \o/
>
>> I mention a lot of small things in the comments below.  I'd really
>> appreciate it if you could address them all, since now is the best
>> opportunity we'll have in a while to make this code comprehensible to
>> others.
>
> I shall endeavour to do so.
>
...
>>
>> > class SlaveScanner:
>> >     """A manager for a single builder."""
>> >
>> >     SCAN_INTERVAL = 5
>>
>> Can you please add a comment explaining what this means, what unit it's in,
>> and hinting at why 5 is a good number for it?
>
> Done.
>

...
>>
>> >         """
>> >         # We need to re-fetch the builder object on each cycle as the
>> >         # Storm store is invalidated over transaction boundaries.
>>
>> This method is complicated enough that I think it would benefit from a
>> prose summary of what it does.
>>
>> For example:
>>
>>   If the builder is OK [XXX - I don't know what this actually means - jml],
>>   then update its status [XXX - this seems redundant, didn't we just check
>>   that it was OK? - jml].  If we think there's an active build on the
>>   builder, then check to see if it's done (builder.updateBuild).  If it
>>   is done, then check that it's available and not in manual mode, then
>>   dispatch the job to the build.
>>
>> Well, that's my best guess.  I think it's only worth describing the happy
>> case, since the unusual cases will be easy enough to follow in the code.
>
> I've written something - let me know if you think it's easy enough to follow.
> It does require some knowledge of how the farm works and I don't think code
> comments are the best place for explaining that.
>

It's good. Much more helpful, thanks.

>> > class BuilddManager(service.Service):
>> >     """Main Buildd Manager service class."""
>> >
>> >     def __init__(self, clock=None):
>> >         self.builder_slaves = []
>> >         self.logger = self._setupLogger()
>>
>> Given that _setupLogger changes global state, it's better to put it in
>> startService.
>
> Nocando (which is a bit near Katmandu) - the NewBuildersScanner needs it
> there.  (More below on this section of code)
>

>>
>> >         self.new_builders_scanner = NewBuildersScanner(
>> >
>> >             manager=self, clock=clock)
>> >
>> >     def _setupLogger(self):
>> >         """Setup a 'slave-scanner' logger that redirects to twisted.
>>
>> FWIW, "Setup" is a noun. "Set up" is a verb.
>
> Setup is not a word, it's a typo.
>

>> > class TestSlaveScannerScan(TrialTestCase):
>> >     """Tests `SlaveScanner.scan` method.
>> >
>> >     This method uses the old framework for scanning and dispatching
>> >     builds. """
>> >     layer = LaunchpadZopelessLayer
>> >
>> >     def setUp(self):
>> >         """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest.
>> >
>> >         Also adjust the sampledata in a way a build can be dispatched to
>> >         'bob' builder.
>> >         """
>> >         from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
>> >         TwistedLayer.testSetUp()
>>
>> Why not use TwistedLaunchpadZopelessLayer?
>
> Why not indeed.
>

Note that once you use TwistedLaunchpadZopelessLayer, you don't need
to call testSetUp. Layers do that for you.

>>
>> >     def test_fail_to_resume_slave_resets_job(self):
>> >         # If an attempt to resume and dispatch a slave fails, it should
>> >         # reset the job via job.reset()
>> >
>> >         # Make a slave with a failing resume() method.
>> >         slave = OkSlave()
>> >         slave.resume = lambda: deferLater(
>> >
>> >             reactor, 0, defer.fail, Failure(('out', 'err', 1)))
>>
>> Why is deferLater being used here?
>
> I've explained this before :-)
>
> I don't recall the exact details.  I just remember getting thoroughly pissed
> off in Prague for 2 whole days trying to work out why the test was failing.  I
> narrowed it down to defer.fail firing immediately rather than later.
>

I just tried changing it to
       slave.resume = lambda: defer.fail(Failure(('out', 'err', 1)))
and it works fine.

If you either change it or add a comment, you'll never have to explain
it again. :)

>>
>> > class TestBuilddManager(TrialTestCase):
>> >     layer = LaunchpadZopelessLayer
>> >
>> >     def _stub_out_scheduleNextScanCycle(self):
>> >         # stub out the code that adds a callLater, so that later tests
>> >         # don't get surprises.
>> >         self.patch(SlaveScanner, 'startCycle', FakeMethod())
>>
>> Rather than stubbing out startCycle, you could instead call
>> manager.stopService() in the tearDown.
>
> I tried that but got a million* failures.  I'd like to fix this but I don't
> have the energy to do it now :(
>

Fair enough. Please mark with a XXX.

>>
>> > class TestNewBuilders(TrialTestCase):
>> >     """Test detecting of new builders."""
>> >
>> >     layer = LaunchpadZopelessLayer
>>
>> This needs to use TwistedLaunchpadZopelessLayer, otherwise it will screw up
>> signal handlers.
>
> Fixed.
>
> What's the difference anyway?
>

The Twisted* layers do some work to save and restore signal &
threadpool state that Twisted's own test wrappers don't do.
Technically, trial should do that, but no one has fixed the bug and
we've got a work-around.

>> >     def test_scheduleScan(self):
>> >         # Test that scheduleScan calls the "scan" method.
>> >         clock = task.Clock()
>> >         builder_scanner = self._getScanner(clock=clock)
>> >         builder_scanner.scan = FakeMethod()
>> >         builder_scanner.scheduleScan()
>> >
>> >         advance = NewBuildersScanner.SCAN_INTERVAL + 1
>> >         clock.advance(advance)
>> >         self.assertNotEqual(
>> >
>> >             0, builder_scanner.scan.call_count,
>> >             "scheduleScan did not schedule anything")
>>
>> The mocking in this test doesn't help much.  It's not that much harder to
>> test that if new builders are added to the database and the clock is
>> advanced then those builders are registered with the manager.  That's also
>> the behaviour you are actually interested in.
>
> The behaviour is tested in test_scan; this test makes sure that test_scan is
> actually called.  I don't understand your point. :(
>

It's not a big deal, but essentially, no one cares if scan() is
called. We only care about the results.

For the effort required to set up the mock correctly you could also
just test the results of the scan.

>>
>> >         def fake_checkForNewBuilders():
>> >             return "new_builders"
>> >
>> >         def fake_addScanForBuilders(new_builders):
>> >             self.assertEqual("new_builders", new_builders)
>> >
>> >         clock = task.Clock()
>> >         builder_scanner = self._getScanner(BuilddManager(), clock=clock)
>> >         builder_scanner.checkForNewBuilders = fake_checkForNewBuilders
>> >         builder_scanner.manager.addScanForBuilders =
>> >         fake_addScanForBuilders builder_scanner.scheduleScan =
>> >         FakeMethod()
>> >
>> >         builder_scanner.scan()
>> >         advance = NewBuildersScanner.SCAN_INTERVAL + 1
>> >         clock.advance(advance)
>>
>> This never actually checks that checkForNewBuilders or addScanForBuilders
>> is called.
>>
>> There's missing test coverage here:
>>   * There are no tests that demonstrate that the slave scanners'
>>     cycles are started when addScanForBuilders is called.
>>   * There are no tests that demonstrate that new builders found in
>>     scan() will no longer be new in the next run of scan()
>
> It was brand new code that you previously reviewed in the last "multi-scan"
> branch.  I appreciate your thoroughness, but I've got zero energy to fix this
> code right now, sorry. :(
>

Heh. I understand. Please mark it with a XXX.

>>   * assessFailureCounts is completely untested
>
> The counts are very thoroughly tested in test_manager, but there's no test for
> actions based on the counts.  I intend to change the behaviour anyway so I'll
> sort that out at the same time.
>

Seems fair enough.

>>   * stopService has no tests.
>
> We've already discussed that this is hard / impossible.  I'd be very happy if
> you could help though.
>

Testing the sigterm stuff is hard, testing the stopService stuff less
so. We can probably sort something out. Again, please XXX.

>> I guess there's also a general theme of using fewer FakeMethods.
>> FakeMethod is a useful testing tool, but that every time we use it we
>> incur a risk, because we are changing the behaviour of the System Under
>> Test in order to test its behaviour.  Better to test the output, rather
>> than method calls, since the output is where the value is.
>>
>> I look forward to your reply.
>
>
> Stuff that's left to do in this branch:
>  1. Change assessFailureCounts() to cope with lack of a job, more builder
> leniency, and add a test.
>  2. Fix the remaining places that catch errors, to send them up to
> _scanFailed() instead.
>
> Finally - sorry about some of the negativity above, I'm coming down with
> something and feeling a bit crap which has made my brain turn to jelly.  I'm
> also getting a bit fed up with this branch now.
>

That's ok, it's not too negative, and I can sympathize with being sick
of this branch. :)

Every XXX here is fresh technical debt. On the whole, it's way less
than previous, but it's still there. It would be good to keep up
momentum and land lots of small branches after this dealing with all
of the XXXs, rather than letting them lie.

jml

« Back to merge proposal