Merge lp:~mbp/launchpad/690021-rlimit into lp:launchpad
| Status: | Merged | ||||
|---|---|---|---|---|---|
| Approved by: | Robert Collins on 2011-04-06 | ||||
| Approved revision: | no longer in the source branch. | ||||
| Merged at revision: | 12991 | ||||
| Proposed branch: | lp:~mbp/launchpad/690021-rlimit | ||||
| Merge into: | lp:launchpad | ||||
| Diff against target: |
21 lines (+5/-0) 1 file modified
cronscripts/scan_branches.py (+5/-0) |
||||
| To merge this branch: | bzr merge lp:~mbp/launchpad/690021-rlimit | ||||
| Related bugs: |
|
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Robert Collins (community) | Approve on 2011-04-06 | ||
| Martin Pool (community) | Disapprove on 2010-12-22 | ||
| Tim Penhey (community) | 2010-12-16 | Needs Information on 2010-12-20 | |
| Michael Hudson-Doyle | 2010-12-15 | Approve on 2010-12-15 | |
|
Review via email:
|
|||
Commit Message
[r=lifeless,
Description of the Change
This change limits the address space of scan_branches.py to 2GB. If it tries to allocate more, it will crash. See <https:/
After some discussion with spm and mwh on irc this seemed like a good idea. It is consistent with what we do in other scripts.
2Gb is a compromise between I hope being small enough to stop the machine bogging down, but not so small that scans that would otherwise succeed will now fail.
It would be nice if the limit was not hard coded, but this seems like the most pragmatic thing for now. I added a note on https:/
Obviously this will not actually fix whatever the bad scaling behaviour is in bug 690021; we should probably split that to a separate bug.
I would welcome opinions on how to test this. Do these jobs run on staging or qastaging? Perhaps we should just let it get to production and then watch for it to fail cleanly.
| Martin Pool (mbp) wrote : | # |
| Martin Pool (mbp) wrote : | # |
I get an error message
PQMException: 'Failed to verify signature: gpgv exited with error code 2'
could somebody else please sponsor landing of this for me?
| Aaron Bentley (abentley) wrote : | # |
What's the impact of this on the job being run? Does it get left in a running state? Does it get re-run on the next script invocation?
| Martin Pool (mbp) wrote : | # |
On 16 December 2010 02:37, Aaron Bentley <email address hidden> wrote:
> What's the impact of this on the job being run? Does it get left in a running state? Does it get re-run on the next script invocation?
Great question. I had assumed that jobs that fail would get
restarted, but I don't know that to be true, and it may well not be.
If they don't, presumably that problem is happening anyhow now when
spm kills them? But automatic killing may make this more common.
--
Martin
| Martin Pool (mbp) wrote : | # |
One data point here is that spm manually killed the scanner the other day (see bug 690021), and to judge from the logs shown on <https:/
To judge from the code and from <https:/
Ideally there would be some infrastructure that made sure that jobs that were repeatedly attempted and never completed were marked as stuck/abandoned. Apparently that does happen with imports. I don't know if branch scanners will get the same behaviour. spm says that tim has fixed thing previously such that they will.
I think I'll wait for a review from Tim next week.
| Tim Penhey (thumper) wrote : | # |
If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.
Care to add a test?
| Martin Pool (mbp) wrote : | # |
On 20 December 2010 12:16, Tim Penhey <email address hidden> wrote:
> Review: Needs Information
> If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.
We could move it there; putting it at a lower level would perhaps make
it easier to call from a test suite.
On the other hand, other code that sets rlimits does it from the cron
script; and I don't have a specific test in mind to run. In some ways
this should actually be at a higher level, outside of the cron script,
within whatever runs it.
> Care to add a test?
Sure. What do you think we should test? That the rlimit gets set?
I think we could do some general work across all scripts so that
* jobs cope well with being killed off
* limits can be controlled by losas
* limits are set in generic job framework
but I don't want to get into that now. This was more of a drive-by
fix to get it to the same level as other jobs.
--
Martin
| Martin Pool (mbp) wrote : | # |
Robert suggests, on irc, monkey-patching out setrlimit to verify that it's called. That's one of those tests I find pretty unsatisfying: the risk here is not that straight-line code will fail to call it, but rather that we're misunderstanding the effects of calling it.
| Tim Penhey (thumper) wrote : | # |
On Tue, 21 Dec 2010 21:58:52 you wrote:
> Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> called. That's one of those tests I find pretty unsatisfying: the risk
> here is not that straight-line code will fail to call it, but rather that
> we're misunderstanding the effects of calling it.
My fear is that by adding the rlimit, we aren't solving the problem at all as
the branch will continue to fail every minute and not get marked as failing.
Until we are sure that this is not the case, I'm hesitant to land this.
| Martin Pool (mbp) wrote : | # |
> On Tue, 21 Dec 2010 21:58:52 you wrote:
> > Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> > called. That's one of those tests I find pretty unsatisfying: the risk
> > here is not that straight-line code will fail to call it, but rather that
> > we're misunderstanding the effects of calling it.
>
> My fear is that by adding the rlimit, we aren't solving the problem at all as
> the branch will continue to fail every minute and not get marked as failing.
> Until we are sure that this is not the case, I'm hesitant to land this.
I filed bug 693241 and bug 693243 for (some of) the larger work here. Given Tim's comments I'm just going to reject this.
| Robert Collins (lifeless) wrote : | # |
I'd like to see something like this, limits are good to prevent
runaway processes.
AIUI python is pretty good about converting malloc failures into
exceptions. Maybe not perfect, but certainly not terrible.
As far as repeated failure detection goes, we can and should use
opaque job failures to halt things automatically, and set a trap that
it needs looking at.
| Robert Collins (lifeless) wrote : | # |
The ayes have it = with the caveat that we need to watch for potential fail-retry-forever loops (but I think we'll be fine because we already have to deal with arbitrary errors)

I tested this locally, and it didn't explode:
mbp@lp-lucid% ./cronscripts/ scan_branches. py launchpad- runscanbranches .lock scan_branches. py 7.02s user 0.55s system 71% cpu 10.547 total
2010-12-15 07:53:37 INFO Creating lockfile: /var/lock/
2010-12-15 07:53:43 INFO Running synchronously.
./cronscripts/
I think it's ok to merge. Apparently we now have pqm permission, but I have not yet tried it.