Merge lp:~mars/launchpad/fix-test_on_merge-578886 into lp:launchpad
| Status: | Merged |
|---|---|
| Approved by: | Māris Fogels on 2010-08-13 |
| Approved revision: | no longer in the source branch. |
| Merged at revision: | 11372 |
| Proposed branch: | lp:~mars/launchpad/fix-test_on_merge-578886 |
| Merge into: | lp:launchpad |
| Diff against target: |
254 lines (+108/-73) 2 files modified
buildout-templates/bin/test.in (+0/-9) test_on_merge.py (+108/-64) |
| To merge this branch: | bzr merge lp:~mars/launchpad/fix-test_on_merge-578886 |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Michael Hudson-Doyle | 2010-07-28 | Approve on 2010-08-04 | |
| Robert Collins | 2010-07-28 | Pending | |
| Māris Fogels | Pending | ||
|
Review via email:
|
|||
This proposal supersedes a proposal from 2010-05-25.
Commit Message
Clean up the test_on_merge.py code to make it handle suite errors correctly. Removed the tabnanny checker.
Description of the Change
Hello,
This branch contains simplified code for playing process group shenanigans, and it simplifies the nice_killpg() function. After discussion with Robert, this is a best effort approach.
I am particularly keen to see that the cleanup_
I tested this code as best I could with harnesses that reproduce test suite hangs. However, I think the process group shenanigans alone should be enough to keep the watchdog from dying, and therefore to keep the suite sane.
I have included the original proposal text below. It still holds true for this patch.
Maris
=====
Hi,
This branch fixes test_on_merge.py so that it now terminates the test suite properly after 10 minutes. This means EC2 instances will no longer hang when they kill the test suite.
This branch changes how the testrunner watchdog handles the process groups of itself and its children. The diff explains why the process group fiddling takes place. In order for the testrunner watchdog to control the process groups I had to remove process group handling from bin/test.
I changed the code a fair bit by introducing many comments and some new functions to encapsulate the existing code. I found these changes necessary to understand what the code was doing, as I only had basic knowledge of Unix process handling before this change. This code should now be easier for others to understand.
I dropped the testrunner timeout to 10 minutes. No test should run that long without output.
I cleaned up the print statements a bit, personal preference.
I added some detailed output to the testrunner kill procedure. This is failing loudly, with the hope that the script's behaviour will be less opaque when things go wrong in the code that handles things going wrong.
This branch was tested with a hand-crafted windmill testrunner harness. The harness forces an hour-long sleep() in one of the windmill tests. I used this harness to test both a local test suite timeout and an ec2 test suite timeout, and to ensure that they can kill off an entire process group.
I will complete a full "make check" run and a full "ec2 test" run with this branch before landing.
Pre-implementation call with: no one
Test command: see above
Maris
| Māris Fogels (mars) wrote : | # |
As discussed on IRC, I added in a call to os.fork() that will get us a fresh PID to use as the process group leader. This enables the group-swapping trick to work when the script is run directly.
To make this new bit of complexity clearer I ended up splitting the major actions of main() out into their own functions. This revealed to yet more fixes that were hidden beneath the main() beast's bulk:
* main()'s docstring was dead wrong. I fixed it.
* The 'here' variable was being reused and redefined throughout the function. I made it global.
* The tabnanny code is just plain wrong. It catches tab problems, but misses the rest of the errors that the script may generate. I marked this with an XXX.
The code also makes heavy use of function return codes, which are ugly but effective. I left them.
Everything is being re-tested before submission.
| Michael Hudson-Doyle (mwhudson) wrote : | # |
134 + pid = os.fork()
135 + if pid != 0:
136 + # We are the parent process, so we'll wait for our child process to
137 + # do the heavy lifting for us.
138 + pid, status = os.wait()
Why call wait() and not waitpid() here? Alternatively, is it worth checking the pids match? Clearly they _should_ match, but if they didn't it could be very confusing :-)
I like the new approach overall btw.
| Māris Fogels (mars) wrote : | # |
I have had to re-examine a number of assumptions about this code, such as the need to aggressively kill the entire process tree, and the use of os.fork() instead of using Popen's preexec_fn argument.
I'm withdrawing this proposal until the code can be revisited.
| Michael Hudson-Doyle (mwhudson) wrote : | # |
I think this looks fine. I assume you've tested it -- I haven't, this time around.

Gary asked me to have a look at this.
I think it broadly looks reasonable -- certainly it terms of readability it's way better than what went before -- and it's great that you've tested that it works. The proof is in the pudding here!
Just to be clear, the reason this stopped working is because there's now a "xvfb-run" process 'between' the test_on_merge.py script and the bin/test script?
I have a little apprehension about simply dying if we're the process leader. For a start, it means you can't just run "./test_ on_merge. py" from the shell! It also might not work on buildbot -- I think it actually will, unless we change our configuration to specify usePTY=True, but that seems rather fragile. Did you consider re-execing the process if we find we're the process group leader? I've pushed an implementation of this idea to lp:~mwhudson/launchpad/fix-test_on_merge-578886 and it seems to work.
If you merge my change in or do something similar, I think this is good to land.