Launchpad itself

Code review comment for lp:~abentley/launchpad/builder-limits

builder-limits
Merge into devel

Revision history for this message

Данило Шеган (danilo) wrote on 2010-11-18:

For reference. It'd be good to keep an eye on https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts after deployment.

<danilos> abentley, isn't addressable memory used for other things than just actual memory? for instance, mmap files can take up a lot of AS
abentley, (I don't know much about RLIMIT_AS, so I am just wondering)
<abentley> danilos: that is interesting, but we don't generally mmap things in bzr, and 1 GB is still huge. Our example https://code.dogfood.launchpad.net/~abentley/+recipe/test/+build/4803 took 1 hour 28 minutes to fail, so I think there is lots of breathing room.
<danilos> abentley, sure, I can see this is only a limit for recipe builders, but I wonder what'd happen if somebody tried to do things like language pack builds where source package itself is a few hundred megs (probably just like qtwebkit)
<abentley> danilos: Also, our python doesn't provide RLIMIT_VMEM, so we don't have a lot of choice.
<danilos> abentley, (though, this is unrelated to my first comment: I guess we want to fail early, so perhaps it's good anyway)
abentley, ok, it sounds good, but I wonder if we have a way to find out if we have been too aggressive
abentley, would we just track 'too many builds are getting killed because of this' or should we have something more specific in place?
<abentley> danilos: We have lotsa logs.
<danilos> abentley, I know, but I am sure we don't have a way to track these easily, and that's one thing I suggest: i.e. figure out a way to track these, especially right after it's rolled out
abentley, unless you count something like "grep SIGKILL buildd-manager.log" as "easy" :)
<abentley> danilos: I view this as a necessary evil. The current behaviour is catastrophic: https://wiki.canonical.com/IncidentReports/2010-11-17-buildd-manager-disabling-builders
<danilos> abentley, yes, I agree, I am just thinking a bit more forward into "what if we killed too many builds that would have succeeded"
abentley, basically, I'm giving you my r=danilo, as long as we have some strategy in place to figure out that we were not too aggressive
abentley, i.e. something that will tell us later that 1GB was the right cut-off point (I trust your judgement in choosing it, it's just that it'd be nice to have a way to confirm it as a good choice later, when we can't do it now)
<abentley> danilos: what would you consider an adequate strategy?
<danilos> abentley, I don't know, a graph tracking number of builds failed because of this particular reason for instance, and a promise to look at it in say week's or two-weeks' time
<abentley> danilos: I don't know how to generate a graph of that.
danilos: You'd have to scrape the builder logs.
<danilos> abentley, right, so is there a way to have this fail in a more specific way?
<abentley> danilos: It's conceivable that there might be.
<danilos> abentley, or, alternatively, at least a promise to do a one-time scraping of the logs so we know we haven't cocked up in significant way (if it's too serious we'll know it anyway, but what if we kill something like 15% of the builds that have worked in the past - how will we know?)
<danilos> abentley, it doesn't have to be too formal, except that imo, it needs to happen
<abentley> danilos: Well, a jump in "failed to build" here is a good hint: https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts/
<danilos> abentley, ok, that's probably good enough if it's regularly tracked, even though it has quite some variation
<abentley> danilos: We can certainly keep our eyes on it after the deployment.
<danilos> abentley, that'd be great then, thanks, r=danilo

For reference.  It'd be good to keep an eye on https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts after deployment.

<danilos> abentley, isn't addressable memory used for other things than just actual memory? for instance, mmap files can take up a lot of AS
 abentley, (I don't know much about RLIMIT_AS, so I am just wondering)
<abentley> danilos: that is interesting, but we don't generally mmap things in bzr, and 1 GB is still huge.  Our example https://code.dogfood.launchpad.net/~abentley/+recipe/test/+build/4803 took 1 hour 28 minutes to fail, so I think there is lots of breathing room.
<danilos> abentley, sure, I can see this is only a limit for recipe builders, but I wonder what'd happen if somebody tried to do things like language pack builds where source package itself is a few hundred megs (probably just like qtwebkit)
<abentley> danilos: Also, our python doesn't provide RLIMIT_VMEM, so we don't have a lot of choice.
<danilos> abentley, (though, this is unrelated to my first comment: I guess we want to fail early, so perhaps it's good anyway)
 abentley, ok, it sounds good, but I wonder if we have a way to find out if we have been too aggressive
 abentley, would we just track 'too many builds are getting killed because of this' or should we have something more specific in place?
<abentley> danilos: We have lotsa logs.
<danilos> abentley, I know, but I am sure we don't have a way to track these easily, and that's one thing I suggest: i.e. figure out a way to track these, especially right after it's rolled out
 abentley, unless you count something like "grep SIGKILL buildd-manager.log" as "easy" :)
<abentley> danilos: I view this as a necessary evil.  The current behaviour is catastrophic: https://wiki.canonical.com/IncidentReports/2010-11-17-buildd-manager-disabling-builders
<danilos> abentley, yes, I agree, I am just thinking a bit more forward into "what if we killed too many builds that would have succeeded"
 abentley, basically, I'm giving you my r=danilo, as long as we have some strategy in place to figure out that we were not too aggressive
 abentley, i.e. something that will tell us later that 1GB was the right cut-off point (I trust your judgement in choosing it, it's just that it'd be nice to have a way to confirm it as a good choice later, when we can't do it now)
<abentley> danilos: what would you consider an adequate strategy?
<danilos> abentley, I don't know, a graph tracking number of builds failed because of this particular reason for instance, and a promise to look at it in say week's or two-weeks' time
<abentley> danilos: I don't know how to generate a graph of that.
 danilos: You'd have to scrape the builder logs.
<danilos> abentley, right, so is there a way to have this fail in a more specific way?
<abentley> danilos: It's conceivable that there might be.
<danilos> abentley, or, alternatively, at least a promise to do a one-time scraping of the logs so we know we haven't cocked up in significant way (if it's too serious we'll know it anyway, but what if we kill something like 15% of the builds that have worked in the past - how will we know?)
<danilos> abentley, it doesn't have to be too formal, except that imo, it needs to happen
<abentley> danilos: Well, a jump in "failed to build" here is a good hint: https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts/
<danilos> abentley, ok, that's probably good enough if it's regularly tracked, even though it has quite some variation
<abentley> danilos: We can certainly keep our eyes on it after the deployment.
<danilos> abentley, that'd be great then, thanks, r=danilo

review: Approve

« Back to merge proposal