Launchpad itself

Merge lp:~abentley/launchpad/builder-limits into lp:launchpad

builder-limits
Merge into devel

Proposed by Aaron Bentley on 2010-11-18

Status:

Merged

Approved by:

Данило Шеган on 2010-11-18

Approved revision:

no longer in the source branch.

Merged at revision:

11943

Proposed branch:

lp:~abentley/launchpad/builder-limits

Merge into:

lp:launchpad

Diff against target:

19 lines (+2/-0)

1 file modified

lib/canonical/buildd/buildrecipe (+2/-0)

To merge this branch:

bzr merge lp:~abentley/launchpad/builder-limits

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Данило Шеган (community)		2010-11-18	Approve on 2010-11-18
Review via email: mp+41211@code.launchpad.net

Commit message

Memory-limit recipe builds.

Description of the change

= Summary =
Fix bug #676657: recipe builds can use too much memory

== Proposed fix ==
Restrict virtual memory use by a recipe build to 1 GB. This will allow some
swapping, but not excessive swapping.

== Pre-implementation notes ==
None

== Implementation details ==
There are really two problems:
1. Recipe builds use too much memory.
2. The build farm behaves badly when builds use too much memory.

Both issues should be addressed. This change addresses 2 by killing builds
that use excessive amounts of memory before they can cause real harm. (The
builders only have 1GB of memory on average.)

== Tests ==
None

== Demo and Q/A ==
Create a recipe using qtwebkit.
See https://code.launchpad.net/~rohangarg/+recipe/qtwebkit

Request a build of the recipe. It should die with a memory error.

= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
lib/canonical/buildd/buildrecipe

Revision history for this message

Данило Шеган (danilo) wrote on 2010-11-18:

Download full text (3.6 KiB)

For reference. It'd be good to keep an eye on https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts after deployment.

<danilos> abentley, isn't addressable memory used for other things than just actual memory? for instance, mmap files can take up a lot of AS
abentley, (I don't know much about RLIMIT_AS, so I am just wondering)
<abentley> danilos: that is interesting, but we don't generally mmap things in bzr, and 1 GB is still huge. Our example https://code.dogfood.launchpad.net/~abentley/+recipe/test/+build/4803 took 1 hour 28 minutes to fail, so I think there is lots of breathing room.
<danilos> abentley, sure, I can see this is only a limit for recipe builders, but I wonder what'd happen if somebody tried to do things like language pack builds where source package itself is a few hundred megs (probably just like qtwebkit)
<abentley> danilos: Also, our python doesn't provide RLIMIT_VMEM, so we don't have a lot of choice.
<danilos> abentley, (though, this is unrelated to my first comment: I guess we want to fail early, so perhaps it's good anyway)
abentley, ok, it sounds good, but I wonder if we have a way to find out if we have been too aggressive
abentley, would we just track 'too many builds are getting killed because of this' or should we have something more specific in place?
<abentley> danilos: We have lotsa logs.
<danilos> abentley, I know, but I am sure we don't have a way to track these easily, and that's one thing I suggest: i.e. figure out a way to track these, especially right after it's rolled out
abentley, unless you count something like "grep SIGKILL buildd-manager.log" as "easy" :)
<abentley> danilos: I view this as a necessary evil. The current behaviour is catastrophic: https://wiki.canonical.com/IncidentReports/2010-11-17-buildd-manager-disabling-builders
<danilos> abentley, yes, I agree, I am just thinking a bit more forward into "what if we killed too many builds that would have succeeded"
abentley, basically, I'm giving you my r=danilo, as long as we have some strategy in place to figure out that we were not too aggressive
abentley, i.e. something that will tell us later that 1GB was the right cut-off point (I trust your judgement in choosing it, it's just that it'd be nice to have a way to confirm it as a good choice later, when we can't do it now)
<abentley> danilos: what would you consider an adequate strategy?
<danilos> abentley, I don't know, a graph tracking number of builds failed because of this particular reason for instance, and a promise to look at it in say week's or two-weeks' time
<abentley> danilos: I don't know how to generate a graph of that.
danilos: You'd have to scrape the builder logs.
<danilos> abentley, right, so is there a way to have this fail in a more specific way?
<abentley> danilos: It's conceivable that there might be.
<danilos> abentley, or, alternatively, at least a promise to do a one-time scraping of the logs so we know we haven't cocked up in significant way (if it's too serious we'll know it anyway, but what if we kill something like 15% of the builds that have worked in the past - how will we know?)
<danilos> abentley, it doesn't have to be too formal, ex...

For reference.  It'd be good to keep an eye on https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts after deployment.

<danilos> abentley, isn't addressable memory used for other things than just actual memory? for instance, mmap files can take up a lot of AS
 abentley, (I don't know much about RLIMIT_AS, so I am just wondering)
<abentley> danilos: that is interesting, but we don't generally mmap things in bzr, and 1 GB is still huge.  Our example https://code.dogfood.launchpad.net/~abentley/+recipe/test/+build/4803 took 1 hour 28 minutes to fail, so I think there is lots of breathing room.
<danilos> abentley, sure, I can see this is only a limit for recipe builders, but I wonder what'd happen if somebody tried to do things like language pack builds where source package itself is a few hundred megs (probably just like qtwebkit)
<abentley> danilos: Also, our python doesn't provide RLIMIT_VMEM, so we don't have a lot of choice.
<danilos> abentley, (though, this is unrelated to my first comment: I guess we want to fail early, so perhaps it's good anyway)
 abentley, ok, it sounds good, but I wonder if we have a way to find out if we have been too aggressive
 abentley, would we just track 'too many builds are getting killed because of this' or should we have something more specific in place?
<abentley> danilos: We have lotsa logs.
<danilos> abentley, I know, but I am sure we don't have a way to track these easily, and that's one thing I suggest: i.e. figure out a way to track these, especially right after it's rolled out
 abentley, unless you count something like "grep SIGKILL buildd-manager.log" as "easy" :)
<abentley> danilos: I view this as a necessary evil.  The current behaviour is catastrophic: https://wiki.canonical.com/IncidentReports/2010-11-17-buildd-manager-disabling-builders
<danilos> abentley, yes, I agree, I am just thinking a bit more forward into "what if we killed too many builds that would have succeeded"
 abentley, basically, I'm giving you my r=danilo, as long as we have some strategy in place to figure out that we were not too aggressive
 abentley, i.e. something that will tell us later that 1GB was the right cut-off point (I trust your judgement in choosing it, it's just that it'd be nice to have a way to confirm it as a good choice later, when we can't do it now)
<abentley> danilos: what would you consider an adequate strategy?
<danilos> abentley, I don't know, a graph tracking number of builds failed because of this particular reason for instance, and a promise to look at it in say week's or two-weeks' time
<abentley> danilos: I don't know how to generate a graph of that.
 danilos: You'd have to scrape the builder logs.
<danilos> abentley, right, so is there a way to have this fail in a more specific way?
<abentley> danilos: It's conceivable that there might be.
<danilos> abentley, or, alternatively, at least a promise to do a one-time scraping of the logs so we know we haven't cocked up in significant way (if it's too serious we'll know it anyway, but what if we kill something like 15% of the builds that have worked in the past - how will we know?)
<danilos> abentley, it doesn't have to be too formal, except that imo, it needs to happen
<abentley> danilos: Well, a jump in "failed to build" here is a good hint: https://lpstats.canonical.com/graphs/CodeRecipeBuildsDailyStatusCounts/
<danilos> abentley, ok, that's probably good enough if it's regularly tracked, even though it has quite some variation
<abentley> danilos: We can certainly keep our eyes on it after the deployment.
<danilos> abentley, that'd be great then, thanks, r=danilo

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

1	=== modified file 'lib/canonical/buildd/buildrecipe'
2	--- lib/canonical/buildd/buildrecipe 2010-09-30 20:22:15 +0000
3	+++ lib/canonical/buildd/buildrecipe 2010-11-18 18:23:36 +0000
4	@@ -11,6 +11,7 @@
5	import os
6	import pwd
7	import re
8	+from resource import RLIMIT_AS, setrlimit
9	import socket
10	from subprocess import call, Popen, PIPE
11	import sys
12	@@ -206,6 +207,7 @@
13
14
15	if __name__ == '__main__':
16	+ setrlimit(RLIMIT_AS, (1000000000, -1))
17	builder = RecipeBuilder(*sys.argv[1:])
18	if builder.buildTree() != 0:
19	sys.exit(RETCODE_FAILURE_BUILD_TREE)