Merge lp:~julian-edwards/launchpad/buildd-failure-counting into lp:launchpad/db-devel
| Status: | Merged |
|---|---|
| Merged at revision: | 9702 |
| Proposed branch: | lp:~julian-edwards/launchpad/buildd-failure-counting |
| Merge into: | lp:launchpad/db-devel |
| Prerequisite: | lp:~julian-edwards/launchpad/buildd-manager-parallel-scan |
| Diff against target: | 0 lines |
| To merge this branch: | bzr merge lp:~julian-edwards/launchpad/buildd-failure-counting |
| Related bugs: |
|
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Jonathan Lange (community) | 2010-08-02 | Approve on 2010-08-24 | |
| Stuart Bishop | db | 2010-08-02 | Approve on 2010-08-03 |
| Robert Collins | db | 2010-08-02 | Pending |
|
Review via email:
|
|||
Description of the Change
Add failure counting on builder and jobs in the build farm, to work out which one is the nasty one when Stuff Goes Wrong. The builder failure count is reset when we know the dispatch was successul, so it's only there to catch actual dispatch problems, not problems while building.
| Jonathan Lange (jml) wrote : | # |
Hey Julian,
This looks like a big step toward build farm reliability -- thank you.
Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.
Out of curiosity, did you discuss this change in detail with anyone before
landing?
cheers,
jml
> === modified file 'lib/lp/
> --- lib/lp/
> +++ lib/lp/
> @@ -50,6 +50,9 @@
> <require
> permission=
> set_attributes=
> + <require
> + permission=
> + set_attributes=
> </class>
> <securedutility
> component=
>
I guess this means that whatever increments this runs with admin-level
privileges. Is that really such a good idea?
> === modified file 'lib/lp/
> --- lib/lp/
> +++ lib/lp/
...
> @@ -157,6 +158,52 @@
> if job is not None:
> job.reset()
>
> + def _getBuilder(self):
> + # Helper to return the builder given the slave for this request.
> + # Avoiding circular imports.
> + from lp.buildmaster.
> + return getUtility(
> +
I notice that there's very similar code below.
I'd suggest making a top-level function like this:
def get_builder(
# Helper to return the builder given the slave for this request.
# Avoiding circular imports.
from lp.buildmaster.
return getUtility(
And using that here instead, as well as in SlaveScanner.
> + def assessFailureCo
> + """View builder/job failure_count and work out which needs to die.
> +
There's spurious whitespace on this line.
> + :return: True if we disabled something, False if we did not.
This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?
> + """
> + # Avoiding circular imports.
Not avoiding circular imports here any more :)
> + if builder is None:
> + builder = self._getBuilder()
> + build_job = builder.
> +
> + if builder.
> + # This is either the first failure for this job on this
> + # builder, or by some chance the job was re-dispatched to
> + # the same builder. This make it impossible to determine
"makes"
> + # whether the job or the builder is at fault, so don't fail
> + # either. We reset the builder and job to try again.
It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?
Perhaps the comment should say something like:
# If the failure count ...
| Julian Edwards (julian-edwards) wrote : | # |
On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
>
> This looks like a big step toward build farm reliability -- thank you.
>
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
>
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?
It's been so long since I started it I honestly can't remember. I have vague
recollections of mentioning failure counts to various people.
> > === modified file 'lib/lp/
> > --- lib/lp/
> > +++ lib/lp/
> > @@ -50,6 +50,9 @@
> >
> > <require
> >
> > permission=
> > set_attributes=
> >
> > + <require
> > + permission=
> > + set_attributes=
> >
> > </class>
> > <securedutility
> >
> > component=
>
> I guess this means that whatever increments this runs with admin-level
> privileges. Is that really such a good idea?
For now yes, because it's only ever changed in "zopeless" scripts. That is,
Zope requres this to work, but doesn't actually apply any level of security to
it. :/
I made it admin to discourage changing it in appservers. There are other
script-
the zcml about exposing it externally! I prefer the extra protection up
front...
> > === modified file 'lib/lp/
> > --- lib/lp/
> > +++ lib/lp/
>
> ...
>
> > @@ -157,6 +158,52 @@
> >
> > if job is not None:
> > job.reset()
> >
> > + def _getBuilder(self):
> > + # Helper to return the builder given the slave for this request.
> > + # Avoiding circular imports.
> > + from lp.buildmaster.
> > + return getUtility(
> > +
>
> I notice that there's very similar code below.
>
> I'd suggest making a top-level function like this:
>
> def get_builder(
> # Helper to return the builder given the slave for this request.
> # Avoiding circular imports.
> from lp.buildmaster.
> return getUtility(
>
> And using that here instead, as well as in SlaveScanner.
Meh, I meant to do that and forgot, thanks.
>
> > + def assessFailureCo
> > + """View builder/job failure_count and work out which needs to
> > die. +
>
> There's spurious whitespace on this line.
Grar, my vim setup has gone weird on me, it always used to tell me about
trailing spaces.
>
> > + :return: True if we disabled something, False if we did not.
>
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. W...
| Stuart Bishop (stub) wrote : | # |
Allocated new DB patch number patch-2208-04-0.sql
| Jonathan Lange (jml) wrote : | # |
On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
<email address hidden> wrote:
> On Thursday 19 August 2010 13:37:40 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> This looks like a big step toward build farm reliability -- thank you.
>>
Thanks for the changes. There are a few questions below, but I reckon
this'll be the last round.
>> Some of my comments are asking for further explanation, a few make
>> suggestions for moving some code around, and the rest are the normal
>> stylistic gotchas.
>>
>> Out of curiosity, did you discuss this change in detail with anyone before
>> landing?
>
> It's been so long since I started it I honestly can't remember. I have vague
> recollections of mentioning failure counts to various people.
>
No worries. I ask because we have failure counting mechanisms in other
places in Launchpad (e.g. the branch puller) and it might have been
interesting to share patterns, variable names and the like.
>> > === modified file 'lib/lp/
>> > --- lib/lp/
>> > +++ lib/lp/
>> > @@ -50,6 +50,9 @@
>> >
>> > <require
>> >
>> > permission=
>> > set_attributes
>> >
>> > + <require
>> > + permission=
>> > + set_attributes
>> >
>> > </class>
>> > <securedutility
>> >
>> > component=
>>
>> I guess this means that whatever increments this runs with admin-level
>> privileges. Is that really such a good idea?
>
> For now yes, because it's only ever changed in "zopeless" scripts. That is,
> Zope requres this to work, but doesn't actually apply any level of security to
> it. :/
>
> I made it admin to discourage changing it in appservers. There are other
> script-
> the zcml about exposing it externally! I prefer the extra protection up
> front...
>
Makes sense. I mean, a better permission would be some kind of
"Forbidden" permission, I guess.
>> > === modified file 'lib/lp/
>> > --- lib/lp/
>> > +++ lib/lp/
...
>> > + :return: True if we disabled something, False if we did not.
>>
>> This seems like an odd thing to return. It doesn't seem to be used in the
>> code. What's it for?
>
> For the tests.
>
Yeah, but what about the tests? Or rather, why do the tests care about
something that the code doesn't?
>> > + # Decide if we need to terminate the job or fail the
>> > + # builder.
>> > + self._
>> > + self.logger.info(
>> > + "builder failure count: %s, job failure count: %s" % (
>> > + builder.
>> > +
>> > builder.
>> > BaseDispatchRes
>>
>> This line makes me wonder why ass...
| Julian Edwards (julian-edwards) wrote : | # |
On Friday 20 August 2010 11:43:40 you wrote:
> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>
> <email address hidden> wrote:
> > On Thursday 19 August 2010 13:37:40 you wrote:
> >> Review: Needs Fixing
> >> Hey Julian,
> >>
> >> This looks like a big step toward build farm reliability -- thank you.
>
> Thanks for the changes. There are a few questions below, but I reckon
> this'll be the last round.
I'm not so sure :)
>
> >> Some of my comments are asking for further explanation, a few make
> >> suggestions for moving some code around, and the rest are the normal
> >> stylistic gotchas.
> >>
> >> Out of curiosity, did you discuss this change in detail with anyone
> >> before landing?
> >
> > It's been so long since I started it I honestly can't remember. I have
> > vague recollections of mentioning failure counts to various people.
>
> No worries. I ask because we have failure counting mechanisms in other
> places in Launchpad (e.g. the branch puller) and it might have been
> interesting to share patterns, variable names and the like.
I had no idea about that, that's a shame :(
> >> > === modified file 'lib/lp/
> >> > --- lib/lp/
> >> > +++ lib/lp/
>
> ...
>
> >> > + :return: True if we disabled something, False if we did not.
> >>
> >> This seems like an odd thing to return. It doesn't seem to be used in
> >> the code. What's it for?
> >
> > For the tests.
>
> Yeah, but what about the tests? Or rather, why do the tests care about
> something that the code doesn't?
Well lots of our tests return stuff that the code doesn't need, like Deferreds
for example. But fine, I've got rid of it.
> >> > + # Decide if we need to terminate the job or fail the
> >> > + # builder.
> >> > + self._increment
> >> > + self.logger.info(
> >> > + "builder failure count: %s, job failure count: %s" %
> >> > ( + builder.
> >> > +
> >> > builder.
> >> > BaseDispatchRes
> >>
> >> This line makes me wonder why assessFailureCounts is on
> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
> >> that takes an optional 'info' parameter?
> >>
> >> The above code would then read:
> >>
> >> builder.
> >> self.logger.
> >> builder.
> >
> > I don't think we should continue to bloat model classes with code that is
> > only used in one place. This is a buildd-
> >
> > Anyway, bugger, I had intended to factor it out to a standalone function
> > as I also realised that that line is ridiculous and again forgot :/
>
> A standalone function would be a definite improvement, and fine by me.
> I personally don't think it's bloat to say "a builder knows how to
> handle builder failures", but I'm not going to push it.
By the same argument, anything that deals with object X should be a method on
that object X.
The difference he...
| Jonathan Lange (jml) wrote : | # |
On Fri, Aug 20, 2010 at 1:16 PM, Julian Edwards
<email address hidden> wrote:
> On Friday 20 August 2010 11:43:40 you wrote:
>> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>> <email address hidden> wrote:
>> > On Thursday 19 August 2010 13:37:40 you wrote:
...
>> >> > === modified file 'lib/lp/
>> >> > --- lib/lp/
>> >> > +++ lib/lp/
>>
>> ...
>>
>> >> > + :return: True if we disabled something, False if we did not.
>> >>
>> >> This seems like an odd thing to return. It doesn't seem to be used in
>> >> the code. What's it for?
>> >
>> > For the tests.
>>
>> Yeah, but what about the tests? Or rather, why do the tests care about
>> something that the code doesn't?
>
> Well lots of our tests return stuff that the code doesn't need, like Deferreds
> for example. But fine, I've got rid of it.
>
Thanks.
>> >> > + # Decide if we need to terminate the job or fail the
>> >> > + # builder.
>> >> > + self._
>> >> > + self.logger.info(
>> >> > + "builder failure count: %s, job failure count: %s" %
>> >> > ( + builder.
>> >> > +
>> >> > builder.
>> >> > BaseDispatchRes
>> >>
>> >> This line makes me wonder why assessFailureCounts is on
>> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
>> >> that takes an optional 'info' parameter?
>> >>
>> >> The above code would then read:
>> >>
>> >> builder.
>> >> self.logger.
>> >> builder.
>> >
>> > I don't think we should continue to bloat model classes with code that is
>> > only used in one place. This is a buildd-
>> >
>> > Anyway, bugger, I had intended to factor it out to a standalone function
>> > as I also realised that that line is ridiculous and again forgot :/
>>
>> A standalone function would be a definite improvement, and fine by me.
>> I personally don't think it's bloat to say "a builder knows how to
>> handle builder failures", but I'm not going to push it.
>
> By the same argument, anything that deals with object X should be a method on
> that object X.
>
I would argue that most things that manipulate state on object X
should be methods of object X. I would not say it's a universal rule.
> The difference here is that it's not dealing with builder failures
> exclusively, it's dealing with a combination of builder and job failures, and
> I would personally find it odd that a builder was trying to modify job
> properties.
>
> Generally, I think the boundaries of responsibility in our code are poor and
> should be less like this.
>
> Did I convince you yet? Probably not :)
>
I think you have, in this case, particularly given the concessions
below to have methods on Job (however it's spelled) and Builder that
do the failure incrementing. That seems like a clean division of
responsibility.
>>
>> >> > + builder.
>> >> ...
| Julian Edwards (julian-edwards) wrote : | # |
On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
> I would argue that most things that manipulate state on object X
> should be methods of object X. I would not say it's a universal rule.
If we boil it down to the basic manipulation required, I completely agree.
> I think you have, in this case, particularly given the concessions
> below to have methods on Job (however it's spelled) and Builder that
> do the failure incrementing. That seems like a clean division of
> responsibility.
Woot! :)
> > I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and
> > have the build manager call each in turn.
> >
> > Let me know if you think that is acceptable and I'll do that change (it's
> > not in the attached diff).
>
> I think that's a great idea.
Ok it's done.
> Re robustness, I think you lose one kind of robustness and gain another.
I guess there are different ways of looking at it...
> "Was this method called?" tests are robust against behaviour change.
> If you change what assessFailureCounts is supposed to do, then you
> won't have to change these tests, only the assessFailureCounts ones.
> (Some people call this "Mockist")
>
> Tests that check state are robust against implementation change. If
> you change the details of how you've implemented the code (as you've
> done during this review process), you won't have to change any tests.
> (Some people call this "Classic")
>
> As a rule, I lean heavily toward the latter, since I change
> implementation details much more frequently than behaviour. I use
> mocks in situations where there's a very clearly defined, fairly
> stable interface and when using some other kind of double would be
> prohibitively difficult.
>
> See http://
> discussion on the whole topic.
>
> As far as this diff goes, I think using a mock is less preferable but
> still sound, and I'm happy to see it merged.
I think I'm in total agreement with the sentiment above. In my case here I am
trying to abstract away from the "how failures are counted" as much as
possible since I forsee some changes in how we make it work. I suspect that
this branch will not be the final word on the matter :)
> It seems to me that Builder.currentjob ought to be replaced with a
> method asap, since it's doing a query, and naive readers might well
> miss that fact.
Agreed, I've filed a bug though as after grepping around, it will be a big
diff and I don't want to pollute this branch.
> If it's doing dupe queries and you aren't going to change currentjob
> to a method in this branch, then please add a comment saying something
> like:
>
> # builder.currentjob runs a query. Don't run it twice.
Done!
See partial diff.
Thanks for sticking with me :)
| Jonathan Lange (jml) wrote : | # |
On Tue, Aug 24, 2010 at 11:34 AM, Julian Edwards
<email address hidden> wrote:
> On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
>> I would argue that most things that manipulate state on object X
>> should be methods of object X. I would not say it's a universal rule.
>
> If we boil it down to the basic manipulation required, I completely agree.
>
>> I think you have, in this case, particularly given the concessions
>> below to have methods on Job (however it's spelled) and Builder that
>> do the failure incrementing. That seems like a clean division of
>> responsibility.
>
> Woot! :)
...
> See partial diff.
>
>
Rockin. I'd spell resetFailureCount() as gotSuccessfulDi
ok to land as is.
jml

Fine. patch-2207- 78-0.sql.