Merge lp:~roadmr/checkbox/fix-backend-protocol into lp:checkbox
Status: | Merged |
---|---|
Merged at revision: | 954 |
Proposed branch: | lp:~roadmr/checkbox/fix-backend-protocol |
Merge into: | lp:checkbox |
Diff against target: |
146 lines (+58/-14) 3 files modified
backend (+3/-0) debian/changelog (+6/-1) plugins/backend_info.py (+49/-13) |
To merge this branch: | bzr merge lp:~roadmr/checkbox/fix-backend-protocol |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Marc Tardif (community) | Approve | ||
Review via email: mp+68091@code.launchpad.net |
Description of the change
So my previous "improvements" to the frontend/backend were not so good, because reads from the backend timed out after 15 seconds. This meant that the frontend wouldn't get stuck waiting for backend to start, but also that if the backend took longer than 15 seconds to complete a job, the frontend would just leave it behind and go on its merry way. At the end of execution the frontend exits, but the backend keeps working on whatever it was doing, so we'd have "lingering backends" again.
This code fixes that. I built packages with these fixes rolled in, and tested them with a default.whitelist run on Oneiric.
I'd welcome input on the code itself, and particularly on the logging strings I used, especially the FAIL result and comment we return when there's no backend to run the jobs. Maybe this should be "unsupported" instead?.
Even though this is all expressed in the code, here's a small explanation of what it does.
We try to spawn the backend. We fork it, then the frontend issues a "ping" to the FIFO. If it gets a "pong" reply within one minute, the backend has started, so we continue execution.
If there's no "pong" reply, we reattempt to spawn. We try this 3 times. If after 3 tries (3 minutes) there's no backend, we assume a problem with it and continue without. If we can't spawn, all backended jobs will return Fail with an error message indicating absence of backend.
Prior to sending a job to the backend, we reping it. If we get a reply, we go ahead with job execution. If not, we do NOT try to respawn it, we just fail the job execution, and a flag is set to mark the fact that the backend died in between jobs, so further backend jobs will fail quickly rather than take one minute to realize the backend has died, as no attempt will be made to revive the backend.
When we send a job to the backend, the reading will timeout after one minute. In this case, since we assume the backend is still chugging on the job, we just print to the logfile and keep waiting indefinitely. The log file at least shows we're waiting on something.
With this design, pretty much the only situation where things can hang up indefinitely is if the backend dies during execution of a job, in which case the frontend will just spin waiting forever. However since the backend is really simple, this is an unlikely case.
Can't wait to try it!