Merge lp:~jameinel/launchpad/lp-serve-child-hangup into lp:launchpad

Proposed by John A Meinel
Status: Superseded
Proposed branch: lp:~jameinel/launchpad/lp-serve-child-hangup
Merge into: lp:launchpad
Prerequisite: lp:~jameinel/launchpad/lp-forking-serve-cleaner-childre
Diff against target: 312 lines (+180/-22)
2 files modified
bzrplugins/lpserve/__init__.py (+94/-19)
bzrplugins/lpserve/test_lpserve.py (+86/-3)
To merge this branch: bzr merge lp:~jameinel/launchpad/lp-serve-child-hangup
Reviewer Review Type Date Requested Status
Launchpad code reviewers Pending
Review via email: mp+50055@code.launchpad.net

This proposal has been superseded by a proposal from 2011-03-02.

Commit message

Children spawned from launchpad-forking-service should hangup if nobody has talked to them in a couple minutes.

Description of the change

This branch builds on my earlier branch, which has the children spawned by LPForkingService clean themselves up more readily.

This adds code so that the blocking calls will be unblocked after 2 minutes. At the moment, if the Conch server requests a fork, but is unable to actually connect to the child, that child ends up hanging forever on a file handle that it will never open.

Originally, I got this working using a loop an O_NONBLOCK. But after doing some testing, fcntl.fcntl(...) was unable to actually change the file descriptor to be blocking again. It looks good in man, and it does unset the flag, but I still get EAGAIN failures in the smart server code.

I don't really like spawning a thread to send SIGUSR1 back to myself, but it seemed the best tradeoff. If we want, we could even make it SIGTERM or something, since we know we are going to kill the process if the connection fails.

I figured 2 minutes was reasonable. This is the time from a successful ssh handshake until we actually connect to a newly spawned child. Even the worst-case time that I've seen was about 30s for a child to be spawned. So this gives us a 4x margin of error.

This is cleanup related to bug #717345, but it doesn't directly fix the problem. That will be in my next branch.

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote :

On 17 February 2011 08:44, John A Meinel <email address hidden> wrote:
> John A Meinel has proposed merging lp:~jameinel/launchpad/lp-serve-child-hangup into lp:launchpad with lp:~jameinel/launchpad/lp-forking-serve-cleaner-childre as a prerequisite.
>
> Requested reviews:
>  Launchpad code reviewers (launchpad-reviewers)
>
> For more details, see:
> https://code.launchpad.net/~jameinel/launchpad/lp-serve-child-hangup/+merge/50055
>
> This branch builds on my earlier branch, which has the children spawned by LPForkingService clean themselves up more readily.
>
> This adds code so that the blocking calls will be unblocked after 2 minutes. At the moment, if the Conch server requests a fork, but is unable to actually connect to the child, that child ends up hanging forever on a file handle that it will never open.
>
> Originally, I got this working using a loop an O_NONBLOCK. But after doing some testing, fcntl.fcntl(...) was unable to actually change the file descriptor to be blocking again. It looks good in man, and it does unset the flag, but I still get EAGAIN failures in the smart server code.
>
> I don't really like spawning a thread to send SIGUSR1 back to myself, but it seemed the best tradeoff. If we want, we could even make it SIGTERM or something, since we know we are going to kill the process if the connection fails.

I don't really like that either, especially this is already getting to
be a fairly complex set of processes.

Why can't you make the file handles nonblocking and just leave them
so? Is it because they're later inherited by some other process?

If you do need a signal to interrupt things, how about just using alarm(2)?

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> I don't really like spawning a thread to send SIGUSR1 back to myself, but it seemed the best tradeoff. If we want, we could even make it SIGTERM or something, since we know we are going to kill the process if the connection fails.
>
> I don't really like that either, especially this is already getting to
> be a fairly complex set of processes.
>
> Why can't you make the file handles nonblocking and just leave them
> so? Is it because they're later inherited by some other process?

Because the Smart server code expects blocking file handles.
SmartServerPipeStreamMedium IIRC.

>
> If you do need a signal to interrupt things, how about just using alarm(2)?

Because alarm doesn't seem to be exposed to python, and I want the
ability to cancel the callback. threading.Timer() gives me that exact
functionality.

I would be fine using SIGALRM if you prefer that to SIGUSR1.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cS8UACgkQJdeBCYSNAAMupgCgjFH3eTQXY6vxSoZkp3inrYs7
CjQAnREQQ/NXmuuyz9Rkl8tUMNIFEED6
=iiu5
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

What about using dup2 to duplicate the fd, make one copy nonblocking
and use that until we're up and running?

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 4:28 PM, Robert Collins wrote:
> What about using dup2 to duplicate the fd, make one copy nonblocking
> and use that until we're up and running?
>

The issue is we want non-blocking in the 'os.open()' call, until
everything is open, then we want to reset it to blocking.

You can use tricks with fcntl.fcntl(... F_SETFD) according to the man
pages. But testing showed it doesn't actually set the file back to
blocking mode.

I would expect something similar from dup2.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cUS8ACgkQJdeBCYSNAAND1ACfZtlcchz8RU9L+0XIw/MZdsep
XscAmwRiaaebSzfpYa6dvAxNGU4jQNfP
=Fj9R
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
> I would expect something similar from dup2.

I would expect dup2 to workaround those bugs :) - its why I suggested it.

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

On 17 February 2011 10:01, Robert Collins <email address hidden> wrote:
> On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
>> I would expect something similar from dup2.
>
> I would expect dup2 to workaround those bugs :) - its why I suggested it.

We want to do a nonblocking open, so we don't have an fd to dup until
we've done that.

After opening it, maybe we could dup it, then make the new copy
blocking, then close the original. It's possible that would work
around it. On the other hand if in general we can't make open fds
blocking, it may not fix it.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 5:43 PM, Martin Pool wrote:
> On 17 February 2011 10:01, Robert Collins <email address hidden> wrote:
>> On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
>>> I would expect something similar from dup2.
>>
>> I would expect dup2 to workaround those bugs :) - its why I suggested it.
>
> We want to do a nonblocking open, so we don't have an fd to dup until
> we've done that.
>
> After opening it, maybe we could dup it, then make the new copy
> blocking, then close the original. It's possible that would work
> around it. On the other hand if in general we can't make open fds
> blocking, it may not fix it.
>

Right, it *might* work to, open non-blocking, set it to blocking, then
dup2 it and use the new duplicated one. I can try it, but honestly, the
code is much clearer with the Timer rather than the
blocking/non-blocking dance. I have to write a loop that times out, etc.
Rather than just getting a SIG* after 120s of inactivity.

If you prefer SIGALRM, or alarm(2), then we can use ctypes/pyrex/whatever.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1dN90ACgkQJdeBCYSNAAP5nwCgkPAFSuCwQP6+l96tNXjB1PAP
XBQAnj5JBVESjYtPmxJiz13EZ7iJeAt3
=KHsN
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>
> If you do need a signal to interrupt things, how about just using alarm(2)?
>

It turns out it is exposed, but only conditionally, and not where I
expected it.

import signal
signal.alarm()

If you prefer, I'd be happy to use that instead. Should I be catching
SIGALRM, or should I just let it propagate to kill the child?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1j/c0ACgkQJdeBCYSNAANc0QCeLE9a+01g55AbrIa5EVWGmNdh
tukAn3Ybnl0SCYTqoxNKIu0VHvz95+r7
=LF2C
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

> It turns out it is exposed, but only conditionally, and not where I
> expected it.
>
> import signal
> signal.alarm()
>
> If you prefer, I'd be happy to use that instead. Should I be catching
> SIGALRM, or should I just let it propagate to kill the child?

Since it's supposed to be an unusual case that the connection fails
half way through, I think just letting it kill the client should be
fine. The parent should wait on them and log their exit status.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrplugins/lpserve/__init__.py'
2--- bzrplugins/lpserve/__init__.py 2011-02-17 21:47:07 +0000
3+++ bzrplugins/lpserve/__init__.py 2011-03-02 13:07:04 +0000
4@@ -15,6 +15,7 @@
5
6
7 import errno
8+import fcntl
9 import logging
10 import os
11 import resource
12@@ -31,6 +32,7 @@
13 from bzrlib.option import Option
14 from bzrlib import (
15 commands,
16+ errors,
17 lockdir,
18 osutils,
19 trace,
20@@ -309,6 +311,11 @@
21 SLEEP_FOR_CHILDREN_TIMEOUT = 1.0
22 WAIT_FOR_REQUEST_TIMEOUT = 1.0 # No request should take longer than this to
23 # be read
24+ CHILD_CONNECT_TIMEOUT = 120 # If we get a fork() request, but nobody
25+ # connects just exit
26+ # On a heavily loaded server, it could take a
27+ # couple secs, but it should never take
28+ # minutes
29
30 _fork_function = os.fork
31
32@@ -324,6 +331,7 @@
33 # Map from pid => (temp_path_for_handles, request_socket)
34 self._child_processes = {}
35 self._children_spawned = 0
36+ self._child_connect_timeout = self.CHILD_CONNECT_TIMEOUT
37
38 def _create_master_socket(self):
39 self._server_socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
40@@ -372,28 +380,94 @@
41 signal.signal(signal.SIGCHLD, signal.SIG_DFL)
42 signal.signal(signal.SIGTERM, signal.SIG_DFL)
43
44+ def _compute_paths(self, base_path):
45+ stdin_path = os.path.join(base_path, 'stdin')
46+ stdout_path = os.path.join(base_path, 'stdout')
47+ stderr_path = os.path.join(base_path, 'stderr')
48+ return (stdin_path, stdout_path, stderr_path)
49+
50 def _create_child_file_descriptors(self, base_path):
51- stdin_path = os.path.join(base_path, 'stdin')
52- stdout_path = os.path.join(base_path, 'stdout')
53- stderr_path = os.path.join(base_path, 'stderr')
54+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
55 os.mkfifo(stdin_path)
56 os.mkfifo(stdout_path)
57 os.mkfifo(stderr_path)
58
59- def _bind_child_file_descriptors(self, base_path):
60- stdin_path = os.path.join(base_path, 'stdin')
61- stdout_path = os.path.join(base_path, 'stdout')
62- stderr_path = os.path.join(base_path, 'stderr')
63+ def _set_blocking(self, fd):
64+ """Change the file descriptor to unset the O_NONBLOCK flag."""
65+ flags = fcntl.fcntl(fd, fcntl.F_GETFD)
66+ flags = flags & (~os.O_NONBLOCK)
67+ fcntl.fcntl(fd, fcntl.F_SETFD, flags)
68+
69+ def _get_child_connect_timeout(self):
70+ """signal.alarm only supports 1s granularity.
71+
72+ We have to make sure we don't ever send 0, which would not generate an
73+ alarm.
74+ """
75+ timeout = int(self._child_connect_timeout)
76+ if timeout <= 0:
77+ timeout = 1
78+ return timeout
79+
80+ def _open_handles(self, base_path):
81+ """Open the given file handles.
82+
83+ This will attempt to open all of these file handles, but will not block
84+ while opening them, timing out after self._child_connect_timeout
85+ seconds.
86+
87+ :param base_path: The directory where all FIFOs are located
88+ :return: (stdin_fid, stdout_fid, stderr_fid)
89+ """
90+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
91 # These open calls will block until another process connects (which
92 # must connect in the same order)
93- stdin_fid = os.open(stdin_path, os.O_RDONLY)
94- stdout_fid = os.open(stdout_path, os.O_WRONLY)
95- stderr_fid = os.open(stderr_path, os.O_WRONLY)
96+ fids = []
97+ to_open = [(stdin_path, os.O_RDONLY), (stdout_path, os.O_WRONLY),
98+ (stderr_path, os.O_WRONLY)]
99+ signal.alarm(self._get_child_connect_timeout())
100+ tstart = time.time()
101+ for path, flags in to_open:
102+ try:
103+ fids.append(os.open(path, flags))
104+ except OSError, e:
105+ if e.errno == errno.EINTR:
106+ error = ('After %.3fs we failed to open %s, exiting'
107+ % (time.time() - tstart, path,))
108+ trace.warning(error)
109+ for fid in fids:
110+ try:
111+ os.close(fid)
112+ except OSError:
113+ pass
114+ self._cleanup_fifos(base_path)
115+ raise errors.BzrError(error)
116+ raise
117+ # If we get to here, that means all the handles were opened
118+ # successfully, so cancel the wakeup call.
119+ signal.alarm(0)
120+ return fids
121+
122+ def _cleanup_fifos(self, base_path):
123+ """Remove the FIFO objects and directory from disk."""
124+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
125+ # Now that we've opened the handles, delete everything so that we don't
126+ # leave garbage around. Because the open() is done in blocking mode, we
127+ # know that someone has already connected to them, and we don't want
128+ # anyone else getting confused and connecting.
129+ # See [Decision #5]
130+ os.remove(stdin_path)
131+ os.remove(stdout_path)
132+ os.remove(stderr_path)
133+ os.rmdir(base_path)
134+
135+ def _bind_child_file_descriptors(self, base_path):
136 # Note: by this point bzrlib has opened stderr for logging
137 # (as part of starting the service process in the first place).
138 # As such, it has a stream handler that writes to stderr. logging
139 # tries to flush and close that, but the file is already closed.
140 # This just supresses that exception
141+ stdin_fid, stdout_fid, stderr_fid = self._open_handles(base_path)
142 logging.raiseExceptions = False
143 sys.stdin.close()
144 sys.stdout.close()
145@@ -407,15 +481,7 @@
146 ui.ui_factory.stdin = sys.stdin
147 ui.ui_factory.stdout = sys.stdout
148 ui.ui_factory.stderr = sys.stderr
149- # Now that we've opened the handles, delete everything so that we don't
150- # leave garbage around. Because the open() is done in blocking mode, we
151- # know that someone has already connected to them, and we don't want
152- # anyone else getting confused and connecting.
153- # See [Decision #5]
154- os.remove(stderr_path)
155- os.remove(stdout_path)
156- os.remove(stdin_path)
157- os.rmdir(base_path)
158+ self._cleanup_fifos(base_path)
159
160 def _close_child_file_descriptors(self):
161 sys.stdin.close()
162@@ -724,6 +790,15 @@
163 self._should_terminate.set()
164 conn.sendall('ok\nquit command requested... exiting\n')
165 conn.close()
166+ elif request.startswith('child_connect_timeout '):
167+ try:
168+ value = int(request.split(' ', 1)[1])
169+ except ValueError, e:
170+ conn.sendall('FAILURE: %r\n' % (e,))
171+ else:
172+ self._child_connect_timeout = value
173+ conn.sendall('ok\n')
174+ conn.close()
175 elif request.startswith('fork ') or request.startswith('fork-env '):
176 command_argv, env = self._parse_fork_request(conn, client_addr,
177 request)
178
179=== modified file 'bzrplugins/lpserve/test_lpserve.py'
180--- bzrplugins/lpserve/test_lpserve.py 2010-11-08 22:43:58 +0000
181+++ bzrplugins/lpserve/test_lpserve.py 2011-03-02 13:07:04 +0000
182@@ -3,6 +3,7 @@
183
184 import errno
185 import os
186+import shutil
187 import signal
188 import socket
189 import subprocess
190@@ -13,6 +14,7 @@
191 from testtools import content
192
193 from bzrlib import (
194+ errors,
195 osutils,
196 tests,
197 trace,
198@@ -263,6 +265,46 @@
199 one_byte_at_a_time=True)
200 self.assertStartsWith(response, 'FAILURE\n')
201
202+ def test_child_connection_timeout(self):
203+ self.assertEqual(self.service.CHILD_CONNECT_TIMEOUT,
204+ self.service._child_connect_timeout)
205+ response = self.send_message_to_service('child_connect_timeout 1\n')
206+ self.assertEqual('ok\n', response)
207+ self.assertEqual(1, self.service._child_connect_timeout)
208+
209+ def test_child_connection_timeout_bad_float(self):
210+ self.assertEqual(self.service.CHILD_CONNECT_TIMEOUT,
211+ self.service._child_connect_timeout)
212+ response = self.send_message_to_service('child_connect_timeout 1.2\n')
213+ self.assertStartsWith(response, 'FAILURE:')
214+
215+ def test_child_connection_timeout_no_val(self):
216+ response = self.send_message_to_service('child_connect_timeout \n')
217+ self.assertStartsWith(response, 'FAILURE:')
218+
219+ def test_child_connection_timeout_bad_val(self):
220+ response = self.send_message_to_service('child_connect_timeout b\n')
221+ self.assertStartsWith(response, 'FAILURE:')
222+
223+ def test__open_handles_will_timeout(self):
224+ # signal.alarm() has only 1-second granularity. :(
225+ self.service._child_connect_timeout = 1
226+ tempdir = tempfile.mkdtemp(prefix='testlpserve-')
227+ self.addCleanup(shutil.rmtree, tempdir, ignore_errors=True)
228+ os.mkfifo(os.path.join(tempdir, 'stdin'))
229+ os.mkfifo(os.path.join(tempdir, 'stdout'))
230+ os.mkfifo(os.path.join(tempdir, 'stderr'))
231+ def noop_on_alarm(signal, frame):
232+ return
233+ signal.signal(signal.SIGALRM, noop_on_alarm)
234+ self.addCleanup(signal.signal, signal.SIGALRM, signal.SIG_DFL)
235+ e = self.assertRaises(errors.BzrError,
236+ self.service._open_handles, tempdir)
237+ self.assertContainsRe(str(e), r'After \d+.\d+s we failed to open.*')
238+ # Even though it timed out, we still cleanup the temp dir
239+ self.assertFalse(os.path.exists(tempdir))
240+
241+
242
243 class TestCaseWithSubprocess(tests.TestCaseWithTransport):
244 """Override the bzr start_bzr_subprocess command.
245@@ -452,9 +494,9 @@
246 stderr_path = os.path.join(path, 'stderr')
247 # The ordering must match the ordering of the service or we get a
248 # deadlock.
249- child_stdin = open(stdin_path, 'wb')
250- child_stdout = open(stdout_path, 'rb')
251- child_stderr = open(stderr_path, 'rb')
252+ child_stdin = open(stdin_path, 'wb', 0)
253+ child_stdout = open(stdout_path, 'rb', 0)
254+ child_stderr = open(stderr_path, 'rb', 0)
255 return child_stdin, child_stdout, child_stderr
256
257 def communicate_with_fork(self, path, stdin=None):
258@@ -484,6 +526,24 @@
259 self.assertEqual('', stderr_content)
260 self.assertReturnCode(0, sock)
261
262+ def DONT_test_fork_lp_serve_multiple_hello(self):
263+ # This ensures that the fifos are all set to blocking mode
264+ # We can't actually run this test, because by default 'bzr serve
265+ # --inet' does not flush after each message. So we end up blocking
266+ # forever waiting for the server to finish responding to the first
267+ # request.
268+ path, _, sock = self.send_fork_request('lp-serve --inet 2')
269+ child_stdin, child_stdout, child_stderr = self._get_fork_handles(path)
270+ child_stdin.write('hello\n')
271+ child_stdin.flush()
272+ self.assertEqual('ok\x012\n', child_stdout.read())
273+ child_stdin.write('hello\n')
274+ self.assertEqual('ok\x012\n', child_stdout.read())
275+ child_stdin.close()
276+ self.assertEqual('', child_stderr.read())
277+ child_stdout.close()
278+ child_stderr.close()
279+
280 def test_fork_replay(self):
281 path, _, sock = self.send_fork_request('launchpad-replay')
282 stdout_content, stderr_content = self.communicate_with_fork(path,
283@@ -540,6 +600,29 @@
284 def test_sigint_exits_nicely(self):
285 self._check_exits_nicely(signal.SIGINT)
286
287+ def test_child_exits_eventually(self):
288+ # We won't ever bind to the socket the child wants, and after some
289+ # time, the child should exit cleanly.
290+ # First, tell the subprocess that we want children to exit quickly.
291+ # *sigh* signal.alarm only has 1s resolution, so this test is slow.
292+ response = self.send_message_to_service('child_connect_timeout 1\n')
293+ self.assertEqual('ok\n', response)
294+ # Now request a fork
295+ path, pid, sock = self.send_fork_request('rocks')
296+ # # Open one handle, but not all of them
297+ stdin_path = os.path.join(path, 'stdin')
298+ stdout_path = os.path.join(path, 'stdout')
299+ stderr_path = os.path.join(path, 'stderr')
300+ child_stdin = open(stdin_path, 'wb')
301+ # We started opening the child, but stop before we get all handles
302+ # open. After 1 second, the child should get signaled and die.
303+ # The master process should notice, and tell us the status of the
304+ # exited child.
305+ val = sock.recv(4096)
306+ self.assertEqual('exited\n%s\n' % (signal.SIGALRM,), val)
307+ # The master process should clean up after the now deceased child.
308+ self.failIfExists(path)
309+
310
311 class TestCaseWithLPForkingServiceDaemon(
312 TestCaseWithLPForkingServiceSubprocess):