Merge lp:~jml/launchpad/test-timeout-505913 into lp:launchpad
Status: | Merged | ||||
---|---|---|---|---|---|
Approved by: | Aaron Bentley | ||||
Approved revision: | no longer in the source branch. | ||||
Merged at revision: | 12660 | ||||
Proposed branch: | lp:~jml/launchpad/test-timeout-505913 | ||||
Merge into: | lp:launchpad | ||||
Diff against target: |
373 lines (+143/-62) 4 files modified
lib/lp/services/job/runner.py (+45/-11) lib/lp/services/job/tests/test_job.py (+4/-9) lib/lp/services/job/tests/test_runner.py (+93/-40) lib/lp/testing/__init__.py (+1/-2) |
||||
To merge this branch: | bzr merge lp:~jml/launchpad/test-timeout-505913 | ||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Aaron Bentley (community) | Approve | ||
j.c.sackett (community) | Approve | ||
Review via email: mp+54598@code.launchpad.net |
Commit message
[r=abentley,
Description of the change
This branch, I believe, fixes one of the intermittently failing tests. The linked bug has most of the relevant details. Essentially, the timeout was happening at random times. When it happened while we were actually running the job, everything was good. When it happened during some of the surrounding code (e.g. while the child process was unmarshalling input to turn it into a potential request to do a job), then it was ignored.
This is because the timeout mechanism was to raise an exception. In Twisted, when a SIGHUP signal handler raises an exception, it just raises in the code that's currently running. If we're in the reactor loop at the time, then it is logged as an unhandled error.
Instead of raising an exception, this branch just exits the worker process with a particular exit code (TwistedJobRunn
The testing strategy for this was to duplicate the test_timeout test that was failing intermittently. One of them would have the timeout bumped up a little (from 1 to 5) to make it more likely to exercise the case where the timeout happens during an actual job and not in surrounding machinery. The other has a much, much shorter timeout (0.05), which is guaranteed to hit the machinery.
In the branch, you'll see lots of other comments and cleanups. It's been a meandering experience, and I've tried to leave what I've learned marked on the walls of the labyrinth.
This looks good. I'm curious about one piece of code though:
+ def _logTimeout(self, job): (oops.id)
+ try:
+ raise TimeoutError
+ except TimeoutError:
+ oops = self._doOops(job, sys.exc_info())
+ self._logOopsId
Given all that's happening is an exception being raised in the try block, which immediately goes to the exception, couldn't this all be replaced with the following?
+ def _logTimeout(self, job): (oops.id)
+ oops = self._doOops(job, sys.exc_info())
+ self._logOopsId
Or is there some sort of async thing here I'm missing?
The code as written certainly works, so this isn't a blocker to approval; I'm just assuming I'm missing something, and want to know what.