Merge lp:~mars/launchpad/fix-test_on_merge-578886 into lp:launchpad
| Status: | Superseded |
|---|---|
| Proposed branch: | lp:~mars/launchpad/fix-test_on_merge-578886 |
| Merge into: | lp:launchpad |
| Diff against target: |
321 lines (+175/-64) 2 files modified
buildout-templates/bin/test.in (+0/-9) test_on_merge.py (+175/-55) |
| To merge this branch: | bzr merge lp:~mars/launchpad/fix-test_on_merge-578886 |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Māris Fogels (community) | Resubmit on 2010-06-20 | ||
| Michael Hudson-Doyle | 2010-05-25 | Approve on 2010-05-26 | |
|
Review via email:
|
|||
This proposal has been superseded by a proposal from 2010-07-28.
Description of the Change
Hi,
This branch fixes test_on_merge.py so that it now terminates the test suite properly after 10 minutes. This means EC2 instances will no longer hang when they kill the test suite.
This branch changes how the testrunner watchdog handles the process groups of itself and its children. The diff explains why the process group fiddling takes place. In order for the testrunner watchdog to control the process groups I had to remove process group handling from bin/test.
I changed the code a fair bit by introducing many comments and some new functions to encapsulate the existing code. I found these changes necessary to understand what the code was doing, as I only had basic knowledge of Unix process handling before this change. This code should now be easier for others to understand.
I dropped the testrunner timeout to 10 minutes. No test should run that long without output.
I cleaned up the print statements a bit, personal preference.
I added some detailed output to the testrunner kill procedure. This is failing loudly, with the hope that the script's behaviour will be less opaque when things go wrong in the code that handles things going wrong.
This branch was tested with a hand-crafted windmill testrunner harness. The harness forces an hour-long sleep() in one of the windmill tests. I used this harness to test both a local test suite timeout and an ec2 test suite timeout, and to ensure that they can kill off an entire process group.
I will complete a full "make check" run and a full "ec2 test" run with this branch before landing.
Pre-implementation call with: no one
Test command: see above
Maris
| Māris Fogels (mars) wrote : | # |
As discussed on IRC, I added in a call to os.fork() that will get us a fresh PID to use as the process group leader. This enables the group-swapping trick to work when the script is run directly.
To make this new bit of complexity clearer I ended up splitting the major actions of main() out into their own functions. This revealed to yet more fixes that were hidden beneath the main() beast's bulk:
* main()'s docstring was dead wrong. I fixed it.
* The 'here' variable was being reused and redefined throughout the function. I made it global.
* The tabnanny code is just plain wrong. It catches tab problems, but misses the rest of the errors that the script may generate. I marked this with an XXX.
The code also makes heavy use of function return codes, which are ugly but effective. I left them.
Everything is being re-tested before submission.
| Michael Hudson-Doyle (mwhudson) wrote : | # |
134 + pid = os.fork()
135 + if pid != 0:
136 + # We are the parent process, so we'll wait for our child process to
137 + # do the heavy lifting for us.
138 + pid, status = os.wait()
Why call wait() and not waitpid() here? Alternatively, is it worth checking the pids match? Clearly they _should_ match, but if they didn't it could be very confusing :-)
I like the new approach overall btw.
| Māris Fogels (mars) wrote : | # |
I have had to re-examine a number of assumptions about this code, such as the need to aggressively kill the entire process tree, and the use of os.fork() instead of using Popen's preexec_fn argument.
I'm withdrawing this proposal until the code can be revisited.

Gary asked me to have a look at this.
I think it broadly looks reasonable -- certainly it terms of readability it's way better than what went before -- and it's great that you've tested that it works. The proof is in the pudding here!
Just to be clear, the reason this stopped working is because there's now a "xvfb-run" process 'between' the test_on_merge.py script and the bin/test script?
I have a little apprehension about simply dying if we're the process leader. For a start, it means you can't just run "./test_ on_merge. py" from the shell! It also might not work on buildbot -- I think it actually will, unless we change our configuration to specify usePTY=True, but that seems rather fragile. Did you consider re-execing the process if we find we're the process group leader? I've pushed an implementation of this idea to lp:~mwhudson/launchpad/fix-test_on_merge-578886 and it seems to work.
If you merge my change in or do something similar, I think this is good to land.