Merge lp:~cprov/launchpad/bug-404693-resuming-builder-with-timeout into lp:launchpad/db-devel

Proposed by Celso Providelo
Status: Merged
Merged at revision: not available
Proposed branch: lp:~cprov/launchpad/bug-404693-resuming-builder-with-timeout
Merge into: lp:launchpad/db-devel
Diff against target: None lines
To merge this branch: bzr merge lp:~cprov/launchpad/bug-404693-resuming-builder-with-timeout
Reviewer Review Type Date Requested Status
Michael Hudson-Doyle Approve
Christian Reis release-critical Pending
Review via email: mp+9291@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Celso Providelo (cprov) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

= Summary =

This branch fixes https://bugs.edge.launchpad.net/soyuz/+bug/404693. Resuming broken slaves will not block the buildd-manager anymore.

== Proposed fix ==

I've created `run_process_with_timeout` entry point in twistedsupport which uses the existing and tested `ProcessMonitorProtocolWithTimeout` and allows us to safely abort (SIGINT then SIGKILL) processes taking longer than expected.

== Implementation details ==

Originally we were using the script output when logging problems, I've removed it. Mainly because it doesn't add much and implementing the same behavior with the existing protocol would consume extra time that we don't have.

When integrating `run_process_with_timeout` in RecordingSlave.resumeHost, I've re-used the current 'socket_timeout' value from the buildmaster configuration.

I've done this simply because I didn't want to add another mysterious value to the already confusing configuration section. I'm happy to create one if it'd make things clearer.

== Tests ==

./bin/test -vv -t TestRecordingSlaves -t TestRunProcessWithTimeout

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/lp/buildmaster/tests/test_manager.py
  lib/canonical/twistedsupport/processmonitor.py
  lib/canonical/twistedsupport/tests/test_processmonitor.py
  lib/lp/buildmaster/manager.py
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkpsafEACgkQ7KBXuXyZSjDD3wCeOb3/lVEYnXW0N7jlEfjVo30I
rfQAnivz65cpO81BfOp0cCezptpBwnfB
=EJRw
-----END PGP SIGNATURE-----

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Looks great.

review: Approve
Revision history for this message
Christian Reis (kiko) wrote :

On Mon, Jul 27, 2009 at 06:11:22PM -0000, Celso Providelo wrote:
> This branch fixes https://bugs.edge.launchpad.net/soyuz/+bug/404693.
> Resuming broken slaves will not block the buildd-manager anymore.

I like this patch, but it looks a bit dodgy for a last-minute RC. Can we
get some serious QA done on dogfood and then request a CP onto the right
box for proper deployment?
--
Christian Robottom Reis | [+55 16] 3376 0125 | http://launchpad.net/~kiko
                        | [+55 16] 9112 6430 | http://async.com.br/~kiko

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/twistedsupport/processmonitor.py'
2--- lib/canonical/twistedsupport/processmonitor.py 2009-06-25 05:30:52 +0000
3+++ lib/canonical/twistedsupport/processmonitor.py 2009-07-26 14:19:49 +0000
4@@ -4,7 +4,11 @@
5 """Helpers for running a child process and communicating things about it."""
6
7 __metaclass__ = type
8-__all__ = ['ProcessMonitorProtocol', 'ProcessMonitorProtocolWithTimeout']
9+__all__ = [
10+ 'ProcessMonitorProtocol',
11+ 'ProcessMonitorProtocolWithTimeout',
12+ 'run_process_with_timeout',
13+ ]
14
15
16 from twisted.internet import defer, error, reactor
17@@ -232,3 +236,20 @@
18 self.setTimeout(None)
19 ProcessMonitorProtocol.processEnded(self, reason)
20
21+
22+def run_process_with_timeout(args, timeout=5, clock=None):
23+ """Run the given process with the specificed timeout.
24+
25+ :param args: tuple with the command-line arguments.
26+ :param timeout: command timeout in seconds, defaults to 5.
27+ :param clock: Passed to `ProcessMonitorProtocolWithTimeout.__init__`.
28+
29+ :return: a `Deferred` of the spawed process using
30+ `ProcessMonitorProtocolWithTimeout`
31+ """
32+ assert isinstance(args, tuple), "'args' must be a tuple."
33+ d = defer.Deferred()
34+ p = ProcessMonitorProtocolWithTimeout(d, timeout, clock)
35+ executable = args[0]
36+ reactor.spawnProcess(p, executable, args)
37+ return d
38
39=== modified file 'lib/canonical/twistedsupport/tests/test_processmonitor.py'
40--- lib/canonical/twistedsupport/tests/test_processmonitor.py 2009-06-25 05:30:52 +0000
41+++ lib/canonical/twistedsupport/tests/test_processmonitor.py 2009-07-26 14:19:49 +0000
42@@ -18,7 +18,7 @@
43 from canonical.twistedsupport import suppress_stderr
44 from canonical.twistedsupport.processmonitor import (
45 ProcessMonitorProtocol, ProcessMonitorProtocolWithTimeout,
46- ProcessProtocolWithTwoStageKill)
47+ ProcessProtocolWithTwoStageKill, run_process_with_timeout)
48
49
50 def makeFailure(exception_factory, *args, **kwargs):
51@@ -312,5 +312,48 @@
52 deferred.callback(None)
53 return self.termination_deferred
54
55+
56+class TestRunProcessWithTimeout(TrialTestCase):
57+ """Tests for `run_process_with_timeout`."""
58+
59+ layer = TwistedLayer
60+
61+ def test_run_process_with_timeout_invalid_args(self):
62+ # `run_process_with_timeout` expects the process 'args' to be a
63+ # tuple.
64+ self.assertRaises(
65+ AssertionError, run_process_with_timeout, 'true')
66+
67+ def test_run_proces_with_timeout_success(self):
68+ # On success, i.e process succeeded before the specified timeout,
69+ # callback is fired with 'None'.
70+ d = run_process_with_timeout(('true',))
71+ def check_success_result(result):
72+ self.assertEquals(result, None, "Success result is not None.")
73+ d.addCallback(check_success_result)
74+ return d
75+
76+ def test_run_process_with_timeout_failure(self):
77+ # On failed process, the errback is fired with a `ProcessTerminated`
78+ # failure.
79+ d = run_process_with_timeout(('false',))
80+ return self.assertFailure(d, error.ProcessTerminated)
81+
82+ def test_run_process_with_timeout_broken(self):
83+ # On broken process, the errback is fired with a `ProcessTerminated`
84+ # failure.
85+ d = run_process_with_timeout(('does-not-exist',))
86+ return self.assertFailure(d, error.ProcessTerminated)
87+
88+ def test_run_process_with_timeout_timeout(self):
89+ # On process timeout, the errback is fired with `TimeoutError`
90+ # failure.
91+ clock = task.Clock()
92+ d = run_process_with_timeout(
93+ ('sleep', '2'), timeout=1, clock=clock)
94+ clock.advance(2)
95+ return self.assertFailure(d, error.TimeoutError)
96+
97+
98 def test_suite():
99 return unittest.TestLoader().loadTestsFromName(__name__)
100
101=== modified file 'lib/lp/buildmaster/manager.py'
102--- lib/lp/buildmaster/manager.py 2009-06-25 00:46:29 +0000
103+++ lib/lp/buildmaster/manager.py 2009-07-26 14:19:49 +0000
104@@ -15,11 +15,10 @@
105 ]
106
107 import logging
108-import os
109 import transaction
110
111 from twisted.application import service
112-from twisted.internet import reactor, utils, defer
113+from twisted.internet import reactor, defer
114 from twisted.protocols.policies import TimeoutMixin
115 from twisted.python import log
116 from twisted.python.failure import Failure
117@@ -31,6 +30,7 @@
118 from canonical.config import config
119 from canonical.launchpad.webapp import urlappend
120 from canonical.librarian.db import write_transaction
121+from canonical.twistedsupport.processmonitor import run_process_with_timeout
122
123
124 buildd_success_result_map = {
125@@ -109,13 +109,17 @@
126 Used the configuration command-line in the same way
127 `BuilddSlave.resume` does.
128
129+ Also use the builddmaster configuration 'socket_timeout' as
130+ the process timeout.
131+
132 :return: a Deferred
133 """
134 resume_command = config.builddmaster.vm_resume_command % {
135 'vm_host': self.vm_host}
136 # Twisted API require string and the configuration provides unicode.
137 resume_argv = [str(term) for term in resume_command.split()]
138- d = utils.getProcessOutputAndValue(resume_argv[0], resume_argv[1:])
139+ d = run_process_with_timeout(
140+ tuple(resume_argv), timeout=config.builddmaster.socket_timeout)
141 return d
142
143
144@@ -358,12 +362,11 @@
145 If it failed, it returns a corresponding `ResetDispatchResult`
146 dispatch result.
147 """
148- out, err, code = response
149- if code == os.EX_OK:
150+ if not isinstance(response, Failure):
151 return None
152
153 self.logger.error(
154- '%s resume failure:\nOUT: %s\nErr: %s' % (slave, out, err))
155+ '%s resume failure: %s' % (slave, response.getErrorMessage()))
156 self.slaveDone(slave)
157 return self.reset_result(slave)
158
159
160=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
161--- lib/lp/buildmaster/tests/test_manager.py 2009-06-25 00:46:29 +0000
162+++ lib/lp/buildmaster/tests/test_manager.py 2009-07-26 14:19:49 +0000
163@@ -8,7 +8,8 @@
164 import unittest
165
166 from twisted.internet import defer
167-from twisted.internet.error import ConnectionClosed
168+from twisted.internet.error import (
169+ ConnectionClosed, ProcessTerminated, TimeoutError)
170 from twisted.python.failure import Failure
171 from twisted.trial.unittest import TestCase as TrialTestCase
172
173@@ -43,14 +44,14 @@
174 self.slave = RecordingSlave(
175 'foo', 'http://foo:8221/rpc', 'foo.host')
176
177- def testInstantiation(self):
178+ def test_representation(self):
179 """`RecordingSlave` has a custom representation.
180
181 It encloses builder name and xmlrpc url for debug purposes.
182 """
183 self.assertEqual('<foo:http://foo:8221/rpc>', repr(self.slave))
184
185- def testEnsurePresent(self):
186+ def test_ensurepresent(self):
187 """`RecordingSlave.ensurepresent` always succeeds.
188
189 It returns the expected succeed code and records the interaction
190@@ -63,7 +64,7 @@
191 [('ensurepresent', ('boing', 'bar', 'baz'))],
192 self.slave.calls)
193
194- def testBuild(self):
195+ def test_build(self):
196 """`RecordingSlave.build` always succeeds.
197
198 It returns the expected succeed code and records the interaction
199@@ -76,8 +77,8 @@
200 [('build', ('boing', 'bar', 'baz'))],
201 self.slave.calls)
202
203- def testResume(self):
204- """`RecordingSlave.resumeHost` returns a deferred resume request."""
205+ def test_resume(self):
206+ """`RecordingSlave.resume` always returns successs."""
207 # Resume isn't requested in a just-instantiated RecordingSlave.
208 self.assertFalse(self.slave.resume_requested)
209
210@@ -86,38 +87,63 @@
211 self.assertEqual(['', '', os.EX_OK], self.slave.resume())
212 self.assertTrue(self.slave.resume_requested)
213
214+ def test_resumeHost_success(self):
215+ # On a successful resume resumeHost() fires the returned deferred
216+ # callback with 'None'.
217+
218 # The configuration testing command-line.
219 self.assertEqual(
220 'echo %(vm_host)s', config.builddmaster.vm_resume_command)
221
222- # When executed it returns the expected output.
223+ # On success the response is None.
224 def check_resume_success(response):
225- out, err, code = response
226- self.assertEqual(os.EX_OK, code)
227- self.assertEqual('', err)
228- self.assertEqual('foo.host', out.strip())
229+ self.assertEquals(None, response)
230+ d = self.slave.resumeSlave()
231+ d.addCallback(check_resume_success)
232+ return d
233
234- d1 = self.slave.resumeSlave()
235- d1.addCallback(check_resume_success)
236+ def test_resumeHost_failure(self):
237+ # On a failed resume, 'resumeHost' fires the returned deferred
238+ # errorback with the `ProcessTerminated` failure.
239
240 # Override the configuration command-line with one that will fail.
241 failed_config = """
242 [builddmaster]
243 vm_resume_command: test "%(vm_host)s = 'no-sir'"
244 """
245- config.push('vm_resume_command', failed_config)
246-
247- def check_resume_failure(response):
248- out, err, code = response
249- self.assertNotEqual(os.EX_OK, code)
250- self.assertEqual('', err)
251- self.assertEqual('', out)
252- config.pop('vm_resume_command')
253-
254- d2 = self.slave.resumeSlave()
255- d2.addCallback(check_resume_failure)
256-
257- return defer.DeferredList([d1, d2])
258+ config.push('failed_resume_command', failed_config)
259+
260+ # On failures, the response is a twisted `Failure` object containing
261+ # a `ProcessTerminated` error.
262+ def check_resume_failure(failure):
263+ self.assertIsInstance(failure, Failure)
264+ self.assertIsInstance(failure.value, ProcessTerminated)
265+ config.pop('failed_resume_command')
266+ d = self.slave.resumeSlave()
267+ d.addErrback(check_resume_failure)
268+ return d
269+
270+ def test_resumeHost_timeout(self):
271+ # On a resume timeouts, 'resumeHost' fires the returned deferred
272+ # errorback with the `TimeoutError` failure.
273+
274+ # Override the configuration command-line with one that will timeout.
275+ timeout_config = """
276+ [builddmaster]
277+ vm_resume_command: sleep 5
278+ socket_timeout: 1
279+ """
280+ config.push('timeout_resume_command', timeout_config)
281+
282+ # On timeouts, the response is a twisted `Failure` object containing
283+ # a `TimeoutError` error.
284+ def check_resume_timeout(failure):
285+ self.assertIsInstance(failure, Failure)
286+ self.assertIsInstance(failure.value, TimeoutError)
287+ config.pop('timeout_resume_command')
288+ d = self.slave.resumeSlave()
289+ d.addErrback(check_resume_timeout)
290+ return d
291
292
293 class TestingXMLRPCProxy:

Subscribers

People subscribed via source and target branches

to status/vote changes: