Merge lp:~jameinel/launchpad/lp-serve-child-hangup into lp:launchpad

Proposed by John A Meinel
Status: Merged
Merged at revision: 12534
Proposed branch: lp:~jameinel/launchpad/lp-serve-child-hangup
Merge into: lp:launchpad
Diff against target: 328 lines (+177/-27)
2 files modified
bzrplugins/lpserve/__init__.py (+91/-24)
bzrplugins/lpserve/test_lpserve.py (+86/-3)
To merge this branch: bzr merge lp:~jameinel/launchpad/lp-serve-child-hangup
Reviewer Review Type Date Requested Status
Benji York (community) release-critical Approve
Gavin Panella (community) Approve
Review via email: mp+51893@code.launchpad.net

This proposal supersedes a proposal from 2011-02-16.

Commit message

[release-critical=benji][r=allenap][no-qa] Forking service children will now die if they are not connected to within two minutes.

Description of the change

I'm resubmitting, since it has been a while since the previous conversation. I now use 'signal.alarm' to kill the child after 120s if it hasn't finished opening its handles. (sigalarm was preferred to threading.Timer, presumably because you don't really want to mix threading and signals, and because python threads tend to be flaky anyway because of GIL issues.)

Thinking about it, I really wonder if it should be more like 30s or even less. If Conch asks for a process, it should be able to connect to that process immediately. It may take a while for the process to start, etc. But all my timings show fork + start working happening in <100ms, not 30s. The original 30s to spawn a child was including the ssh handshake, and all the other work. However, this is supposed to be a rather exceptional condition, so I'll leave it at 2min for now.

This is the final step in fixing bug #717345 that I know of. The other patches make sure we clean up as much as we can, and this closes the one last handle the master process holds on to. The socket that tells it the child has exited. It doesn't close with the others, because it is connected to the master forking service process, waiting for child exit codes, etc.

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

On 17 February 2011 08:44, John A Meinel <email address hidden> wrote:
> John A Meinel has proposed merging lp:~jameinel/launchpad/lp-serve-child-hangup into lp:launchpad with lp:~jameinel/launchpad/lp-forking-serve-cleaner-childre as a prerequisite.
>
> Requested reviews:
>  Launchpad code reviewers (launchpad-reviewers)
>
> For more details, see:
> https://code.launchpad.net/~jameinel/launchpad/lp-serve-child-hangup/+merge/50055
>
> This branch builds on my earlier branch, which has the children spawned by LPForkingService clean themselves up more readily.
>
> This adds code so that the blocking calls will be unblocked after 2 minutes. At the moment, if the Conch server requests a fork, but is unable to actually connect to the child, that child ends up hanging forever on a file handle that it will never open.
>
> Originally, I got this working using a loop an O_NONBLOCK. But after doing some testing, fcntl.fcntl(...) was unable to actually change the file descriptor to be blocking again. It looks good in man, and it does unset the flag, but I still get EAGAIN failures in the smart server code.
>
> I don't really like spawning a thread to send SIGUSR1 back to myself, but it seemed the best tradeoff. If we want, we could even make it SIGTERM or something, since we know we are going to kill the process if the connection fails.

I don't really like that either, especially this is already getting to
be a fairly complex set of processes.

Why can't you make the file handles nonblocking and just leave them
so? Is it because they're later inherited by some other process?

If you do need a signal to interrupt things, how about just using alarm(2)?

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> I don't really like spawning a thread to send SIGUSR1 back to myself, but it seemed the best tradeoff. If we want, we could even make it SIGTERM or something, since we know we are going to kill the process if the connection fails.
>
> I don't really like that either, especially this is already getting to
> be a fairly complex set of processes.
>
> Why can't you make the file handles nonblocking and just leave them
> so? Is it because they're later inherited by some other process?

Because the Smart server code expects blocking file handles.
SmartServerPipeStreamMedium IIRC.

>
> If you do need a signal to interrupt things, how about just using alarm(2)?

Because alarm doesn't seem to be exposed to python, and I want the
ability to cancel the callback. threading.Timer() gives me that exact
functionality.

I would be fine using SIGALRM if you prefer that to SIGUSR1.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cS8UACgkQJdeBCYSNAAMupgCgjFH3eTQXY6vxSoZkp3inrYs7
CjQAnREQQ/NXmuuyz9Rkl8tUMNIFEED6
=iiu5
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

What about using dup2 to duplicate the fd, make one copy nonblocking
and use that until we're up and running?

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 4:28 PM, Robert Collins wrote:
> What about using dup2 to duplicate the fd, make one copy nonblocking
> and use that until we're up and running?
>

The issue is we want non-blocking in the 'os.open()' call, until
everything is open, then we want to reset it to blocking.

You can use tricks with fcntl.fcntl(... F_SETFD) according to the man
pages. But testing showed it doesn't actually set the file back to
blocking mode.

I would expect something similar from dup2.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cUS8ACgkQJdeBCYSNAAND1ACfZtlcchz8RU9L+0XIw/MZdsep
XscAmwRiaaebSzfpYa6dvAxNGU4jQNfP
=Fj9R
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
> I would expect something similar from dup2.

I would expect dup2 to workaround those bugs :) - its why I suggested it.

-Rob

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

On 17 February 2011 10:01, Robert Collins <email address hidden> wrote:
> On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
>> I would expect something similar from dup2.
>
> I would expect dup2 to workaround those bugs :) - its why I suggested it.

We want to do a nonblocking open, so we don't have an fd to dup until
we've done that.

After opening it, maybe we could dup it, then make the new copy
blocking, then close the original. It's possible that would work
around it. On the other hand if in general we can't make open fds
blocking, it may not fix it.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 5:43 PM, Martin Pool wrote:
> On 17 February 2011 10:01, Robert Collins <email address hidden> wrote:
>> On Thu, Feb 17, 2011 at 11:38 AM, John A Meinel <email address hidden> wrote:
>>> I would expect something similar from dup2.
>>
>> I would expect dup2 to workaround those bugs :) - its why I suggested it.
>
> We want to do a nonblocking open, so we don't have an fd to dup until
> we've done that.
>
> After opening it, maybe we could dup it, then make the new copy
> blocking, then close the original. It's possible that would work
> around it. On the other hand if in general we can't make open fds
> blocking, it may not fix it.
>

Right, it *might* work to, open non-blocking, set it to blocking, then
dup2 it and use the new duplicated one. I can try it, but honestly, the
code is much clearer with the Timer rather than the
blocking/non-blocking dance. I have to write a loop that times out, etc.
Rather than just getting a SIG* after 120s of inactivity.

If you prefer SIGALRM, or alarm(2), then we can use ctypes/pyrex/whatever.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1dN90ACgkQJdeBCYSNAAP5nwCgkPAFSuCwQP6+l96tNXjB1PAP
XBQAnj5JBVESjYtPmxJiz13EZ7iJeAt3
=KHsN
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>
> If you do need a signal to interrupt things, how about just using alarm(2)?
>

It turns out it is exposed, but only conditionally, and not where I
expected it.

import signal
signal.alarm()

If you prefer, I'd be happy to use that instead. Should I be catching
SIGALRM, or should I just let it propagate to kill the child?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1j/c0ACgkQJdeBCYSNAANc0QCeLE9a+01g55AbrIa5EVWGmNdh
tukAn3Ybnl0SCYTqoxNKIu0VHvz95+r7
=LF2C
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

> It turns out it is exposed, but only conditionally, and not where I
> expected it.
>
> import signal
> signal.alarm()
>
> If you prefer, I'd be happy to use that instead. Should I be catching
> SIGALRM, or should I just let it propagate to kill the child?

Since it's supposed to be an unusual case that the connection fails
half way through, I think just letting it kill the client should be
fine. The parent should wait on them and log their exit status.

Revision history for this message
Gavin Panella (allenap) wrote :

Cool :)

[1]

+ def _get_child_connect_timeout(self):
+ """signal.alarm only supports 1s granularity.
+
+ We have to make sure we don't ever send 0, which would not generate an
+ alarm.
+ """
+ timeout = int(self._child_connect_timeout)
+ if timeout <= 0:
+ timeout = 1
+ return timeout

I can't see where _child_connect_timeout might be set to something
other than an int, so you could remove this and prevent problems
another way:

        signal.alarm(max(1, self._child_connect_timeout))

But I don't really have a strong argument why you ought to change it
other than it's more code to read, more stuff to brain-load when
coming in to maintain this code.

[2]

+ # Even though it timed out, we still cleanup the temp dir
+ self.assertFalse(os.path.exists(tempdir))

I think this only happens because SIGALRM was neutered. If no handler
was installed I doubt the temporary directory would be cleared
out. Perhaps there needs to be a handler for SIGALRM anyway, that
raises a custom exception perhaps, or signal.default_int_handler, or
whatever :)

[3]

+ # We can't actually run this test, because by default 'bzr serve
+ # --inet' does not flush after each message.

Out of interest, why doesn't it flush? Can this test be fixed and
enabled later? Consider removing it if it's just going to bit-rot.

[4]

+ # The master process should clean up after the now deceased child.
+ self.failIfExists(path)

If, for some reason, the master believes the child has failed, but
there's just been a misunderstanding and the child is really still
running, could this mean that both the master and the child try to
clean up the fifos directory? _cleanup_fifos() in the child code
doesn't swallow errors, so this might fail.

Consider using shutil instead:

  shutil.rmtree(path, ignore_errors=True)

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.6 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/3/2011 4:43 PM, Gavin Panella wrote:
> Review: Approve
> Cool :)
>
>
> [1]
>
> + def _get_child_connect_timeout(self):
> + """signal.alarm only supports 1s granularity.
> +
> + We have to make sure we don't ever send 0, which would not generate an
> + alarm.
> + """
> + timeout = int(self._child_connect_timeout)
> + if timeout <= 0:
> + timeout = 1
> + return timeout
>
> I can't see where _child_connect_timeout might be set to something
> other than an int, so you could remove this and prevent problems
> another way:
>
> signal.alarm(max(1, self._child_connect_timeout))
>
> But I don't really have a strong argument why you ought to change it
> other than it's more code to read, more stuff to brain-load when
> coming in to maintain this code.

It used to be a float in the code, and I often set it to sub second in
the test suite, etc. I was just being cautious, because setting it to
0.1s caused it to get set to 0, which *unsets* alarms rather than
setting them.

>
>
> [2]
>
> + # Even though it timed out, we still cleanup the temp dir
> + self.assertFalse(os.path.exists(tempdir))
>
> I think this only happens because SIGALRM was neutered. If no handler
> was installed I doubt the temporary directory would be cleared
> out. Perhaps there needs to be a handler for SIGALRM anyway, that
> raises a custom exception perhaps, or signal.default_int_handler, or
> whatever :)
>

It used to be true that the signal was trapped. If we don't trap it at
all, it kills the test suite. I used to install a signal handler (when I
wasn't using signal.alarm). I can take out that assertion, as the parent
process will handle cleaning up the child. All I *really* care about is
that the child doesn't get hung indefinitely trying to open fifos that
will never open.

>
> [3]
>
> + # We can't actually run this test, because by default 'bzr serve
> + # --inet' does not flush after each message.
>
> Out of interest, why doesn't it flush? Can this test be fixed and
> enabled later? Consider removing it if it's just going to bit-rot.
>

At the moment, there is a layering issue. The part that writes the data
only gets a single 'write' callable. Rather than getting both 'write'
and 'flush'. We don't want to flush all the time, just in-between
messages. So that any buffering is used appropriately.

As it stands, I can't really test that blocking mode works correctly.

>
> [4]
>
> + # The master process should clean up after the now deceased child.
> + self.failIfExists(path)
>
> If, for some reason, the master believes the child has failed, but
> there's just been a misunderstanding and the child is really still
> running, could this mean that both the master and the child try to
> clean up the fifos directory? _cleanup_fifos() in the child code
> doesn't swallow errors, so this might fail.
>
> Consider using shutil instead:
>
> shutil.rmtree(path, ignore_errors=True)
>

We use os.waitpid() to tell that the child has exited. It will be pretty
hard for that to give a false positive. And if the f...

Read more...

Revision history for this message
Gavin Panella (allenap) wrote :

[2]

> I can take out that assertion, as the parent process will handle
> cleaning up the child.

I think that's the best thing to do.

[4]

> We use os.waitpid() to tell that the child has exited. It will be
> pretty hard for that to give a false positive. And if the file
> handles get deleted from under the child, it is going to crash
> anyway, because the remote process won't be able to successfully
> connect to the child.

Ah, fair enough, thanks for the explanation.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/3/2011 5:28 PM, Gavin Panella wrote:
> [2]
>
>> I can take out that assertion, as the parent process will handle
>> cleaning up the child.
>
> I think that's the best thing to do.
>
>
> [4]
>
>> We use os.waitpid() to tell that the child has exited. It will be
>> pretty hard for that to give a false positive. And if the file
>> handles get deleted from under the child, it is going to crash
>> anyway, because the remote process won't be able to successfully
>> connect to the child.
>
> Ah, fair enough, thanks for the explanation.
>

Your requests should now be addressed.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1vyTsACgkQJdeBCYSNAAM2/QCgsm9HqRHKvf4w+w47REzw5+MQ
XM4AoMz2vazQl4Mx6Zl7qVRd7yBPFN7D
=0Zuf
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

That looks pretty good to me. It's a bit simpler than having a separate thread, and I would say runs less risk of unexpected environmental problems.

One more question:

+ for path, flags in to_open:
+ try:
+ fids.append(os.open(path, flags))
+ except OSError, e:
+ if e.errno == errno.EINTR:
+ error = ('After %.3fs we failed to open %s, exiting'
+ % (time.time() - tstart, path,))
+ trace.warning(error)
+ for fid in fids:
+ try:
+ os.close(fid)
+ except OSError:
+ pass
+ self._cleanup_fifos(base_path)
+ raise errors.BzrError(error)
+ raise

It seems to me that normally you won't actually hit that except block, because the sigalarm will normally immediately terminate the process. Am I wrong? However, you will hit it when it's run from a test where you do set a handler. I think this behaviour is ok, but I would like a comment in the except block warning that it doesn't necessarily run.

(Alternatively, on the main code path, you could set sigalarm to a do-nothing handler, which would probably suffice to get you an eintr. But maybe not: istr mgz discovering that Python has some strange behaviour about automatically restarting some system calls. So on the whole I wouldn't do that.)

so, from me, just 'tweak' to add a comment. annelap may care to review it again.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/4/2011 4:10 AM, Martin Pool wrote:
> That looks pretty good to me. It's a bit simpler than having a separate thread, and I would say runs less risk of unexpected environmental problems.
>
> One more question:
>
> + for path, flags in to_open:
> + try:
> + fids.append(os.open(path, flags))
> + except OSError, e:
> + if e.errno == errno.EINTR:
> + error = ('After %.3fs we failed to open %s, exiting'
> + % (time.time() - tstart, path,))
> + trace.warning(error)
> + for fid in fids:
> + try:
> + os.close(fid)
> + except OSError:
> + pass
> + self._cleanup_fifos(base_path)
> + raise errors.BzrError(error)
> + raise
>
> It seems to me that normally you won't actually hit that except block, because the sigalarm will normally immediately terminate the process. Am I wrong? However, you will hit it when it's run from a test where you do set a handler. I think this behaviour is ok, but I would like a comment in the except block warning that it doesn't necessarily run.
>
> (Alternatively, on the main code path, you could set sigalarm to a do-nothing handler, which would probably suffice to get you an eintr. But maybe not: istr mgz discovering that Python has some strange behaviour about automatically restarting some system calls. So on the whole I wouldn't do that.)
>
> so, from me, just 'tweak' to add a comment. annelap may care to review it again.

Setting SIGALRM to a do-nothing handler does, indeed, cause me to get
EINTR. There is a test that already exercises it, because it is what was
used with the old SIGUSR1 handler.

Would you rather just see it removed, since SIGALRM is killing the child
completely by default?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1wjhsACgkQJdeBCYSNAAO25wCgvBXZiEQtDl+6P4XnNVWyRDBX
6W0AoI9Cw5qFsDPFSqCT0ARrLRVwwzLx
=EfcM
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

On 4 March 2011 18:00, John Arbash Meinel <email address hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 3/4/2011 4:10 AM, Martin Pool wrote:
>> That looks pretty good to me.  It's a bit simpler than having a separate thread, and I would say runs less risk of unexpected environmental problems.
>>
>> One more question:
>>
>> +        for path, flags in to_open:
>> +            try:
>> +                fids.append(os.open(path, flags))
>> +            except OSError, e:
>> +                if e.errno == errno.EINTR:
>> +                    error = ('After %.3fs we failed to open %s, exiting'
>> +                             % (time.time() - tstart, path,))
>> +                    trace.warning(error)
>> +                    for fid in fids:
>> +                        try:
>> +                            os.close(fid)
>> +                        except OSError:
>> +                            pass
>> +                    self._cleanup_fifos(base_path)
>> +                    raise errors.BzrError(error)
>> +                raise
>>
>> It seems to me that normally you won't actually hit that except block, because the sigalarm will normally immediately terminate the process.  Am I wrong?  However, you will hit it when it's run from a test where you do set a handler.  I think this behaviour is ok, but I would like a comment in the except block warning that it doesn't necessarily run.
>>
>> (Alternatively, on the main code path, you could set sigalarm to a do-nothing handler, which would probably suffice to get you an eintr.  But maybe not: istr mgz discovering that Python has some strange behaviour about automatically restarting some system calls.  So on the whole I wouldn't do that.)
>>
>> so, from me, just 'tweak' to add a comment.  annelap may care to review it again.
>

> Setting SIGALRM to a do-nothing handler does, indeed, cause me to get
> EINTR. There is a test that already exercises it, because it is what was
> used with the old SIGUSR1 handler.
>
> Would you rather just see it removed, since SIGALRM is killing the child
> completely by default?

My order of preference is

1- remove it and count on sigalrm actually killing it, including from the test
2- leave it and just add a comment it's not normally hit
3- leave it and also register a do-nothing handler to make sure it
actually is hit

I don't like 2 and 3 so much because there's some chance it will break
if something else changes, and there may be other reasons we get an
eintr earlier than we actually wanted to time out. (The latter is not
very likely admittedly.) Also, all else being equal, less code is
better.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>> Would you rather just see it removed, since SIGALRM is killing the child
>> completely by default?
>
> My order of preference is
>
> 1- remove it and count on sigalrm actually killing it, including from the test
> 2- leave it and just add a comment it's not normally hit
> 3- leave it and also register a do-nothing handler to make sure it
> actually is hit
>
> I don't like 2 and 3 so much because there's some chance it will break
> if something else changes, and there may be other reasons we get an
> eintr earlier than we actually wanted to time out. (The latter is not
> very likely admittedly.) Also, all else being equal, less code is
> better.
>

If we want to test the function directly, the test suite has to have a
sigalrm handler, or it will kill the test run. So I left in a bit that
ensures we at least do a bit of cleanup, and raise an exception right
away. I suppose we could go one step further, and just not trap the
OSError and close the file descriptors. However, that causes the test
suite to leak file descriptors. Only 1 or 2, but it isn't very nice.

We do have a test that a spawned process dies with SIGALRM (because it
is part of the return code.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1yCfgACgkQJdeBCYSNAANRPwCgqHAvtGiUDweF6oYhfjul1N6x
jS8AoKVB83ldRCiCxsI2qPbgkwRpEUAL
=C2vK
-----END PGP SIGNATURE-----

Revision history for this message
Benji York (benji) :
review: Approve (release-critical)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrplugins/lpserve/__init__.py'
2--- bzrplugins/lpserve/__init__.py 2011-02-17 21:47:07 +0000
3+++ bzrplugins/lpserve/__init__.py 2011-03-06 00:03:56 +0000
4@@ -15,6 +15,7 @@
5
6
7 import errno
8+import fcntl
9 import logging
10 import os
11 import resource
12@@ -31,6 +32,7 @@
13 from bzrlib.option import Option
14 from bzrlib import (
15 commands,
16+ errors,
17 lockdir,
18 osutils,
19 trace,
20@@ -309,6 +311,11 @@
21 SLEEP_FOR_CHILDREN_TIMEOUT = 1.0
22 WAIT_FOR_REQUEST_TIMEOUT = 1.0 # No request should take longer than this to
23 # be read
24+ CHILD_CONNECT_TIMEOUT = 120 # If we get a fork() request, but nobody
25+ # connects just exit
26+ # On a heavily loaded server, it could take a
27+ # couple secs, but it should never take
28+ # minutes
29
30 _fork_function = os.fork
31
32@@ -324,6 +331,7 @@
33 # Map from pid => (temp_path_for_handles, request_socket)
34 self._child_processes = {}
35 self._children_spawned = 0
36+ self._child_connect_timeout = self.CHILD_CONNECT_TIMEOUT
37
38 def _create_master_socket(self):
39 self._server_socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
40@@ -372,28 +380,84 @@
41 signal.signal(signal.SIGCHLD, signal.SIG_DFL)
42 signal.signal(signal.SIGTERM, signal.SIG_DFL)
43
44+ def _compute_paths(self, base_path):
45+ stdin_path = os.path.join(base_path, 'stdin')
46+ stdout_path = os.path.join(base_path, 'stdout')
47+ stderr_path = os.path.join(base_path, 'stderr')
48+ return (stdin_path, stdout_path, stderr_path)
49+
50 def _create_child_file_descriptors(self, base_path):
51- stdin_path = os.path.join(base_path, 'stdin')
52- stdout_path = os.path.join(base_path, 'stdout')
53- stderr_path = os.path.join(base_path, 'stderr')
54+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
55 os.mkfifo(stdin_path)
56 os.mkfifo(stdout_path)
57 os.mkfifo(stderr_path)
58
59- def _bind_child_file_descriptors(self, base_path):
60- stdin_path = os.path.join(base_path, 'stdin')
61- stdout_path = os.path.join(base_path, 'stdout')
62- stderr_path = os.path.join(base_path, 'stderr')
63+ def _set_blocking(self, fd):
64+ """Change the file descriptor to unset the O_NONBLOCK flag."""
65+ flags = fcntl.fcntl(fd, fcntl.F_GETFD)
66+ flags = flags & (~os.O_NONBLOCK)
67+ fcntl.fcntl(fd, fcntl.F_SETFD, flags)
68+
69+ def _open_handles(self, base_path):
70+ """Open the given file handles.
71+
72+ This will attempt to open all of these file handles, but will not block
73+ while opening them, timing out after self._child_connect_timeout
74+ seconds.
75+
76+ :param base_path: The directory where all FIFOs are located
77+ :return: (stdin_fid, stdout_fid, stderr_fid)
78+ """
79+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
80 # These open calls will block until another process connects (which
81 # must connect in the same order)
82- stdin_fid = os.open(stdin_path, os.O_RDONLY)
83- stdout_fid = os.open(stdout_path, os.O_WRONLY)
84- stderr_fid = os.open(stderr_path, os.O_WRONLY)
85+ fids = []
86+ to_open = [(stdin_path, os.O_RDONLY), (stdout_path, os.O_WRONLY),
87+ (stderr_path, os.O_WRONLY)]
88+ # If we set it to 0, we won't get an alarm, so require some time > 0.
89+ signal.alarm(max(1, self._child_connect_timeout))
90+ tstart = time.time()
91+ for path, flags in to_open:
92+ try:
93+ fids.append(os.open(path, flags))
94+ except OSError, e:
95+ # In production code, signal.alarm will generally just kill
96+ # us. But if something installs a signal handler for SIGALRM,
97+ # do what we can to die gracefully.
98+ error = ('After %.3fs we failed to open %s, exiting'
99+ % (time.time() - tstart, path,))
100+ trace.warning(error)
101+ for fid in fids:
102+ try:
103+ os.close(fid)
104+ except OSError:
105+ pass
106+ raise errors.BzrError(error)
107+ # If we get to here, that means all the handles were opened
108+ # successfully, so cancel the wakeup call.
109+ signal.alarm(0)
110+ return fids
111+
112+ def _cleanup_fifos(self, base_path):
113+ """Remove the FIFO objects and directory from disk."""
114+ stdin_path, stdout_path, stderr_path = self._compute_paths(base_path)
115+ # Now that we've opened the handles, delete everything so that we don't
116+ # leave garbage around. Because the open() is done in blocking mode, we
117+ # know that someone has already connected to them, and we don't want
118+ # anyone else getting confused and connecting.
119+ # See [Decision #5]
120+ os.remove(stdin_path)
121+ os.remove(stdout_path)
122+ os.remove(stderr_path)
123+ os.rmdir(base_path)
124+
125+ def _bind_child_file_descriptors(self, base_path):
126 # Note: by this point bzrlib has opened stderr for logging
127 # (as part of starting the service process in the first place).
128 # As such, it has a stream handler that writes to stderr. logging
129 # tries to flush and close that, but the file is already closed.
130 # This just supresses that exception
131+ stdin_fid, stdout_fid, stderr_fid = self._open_handles(base_path)
132 logging.raiseExceptions = False
133 sys.stdin.close()
134 sys.stdout.close()
135@@ -407,15 +471,7 @@
136 ui.ui_factory.stdin = sys.stdin
137 ui.ui_factory.stdout = sys.stdout
138 ui.ui_factory.stderr = sys.stderr
139- # Now that we've opened the handles, delete everything so that we don't
140- # leave garbage around. Because the open() is done in blocking mode, we
141- # know that someone has already connected to them, and we don't want
142- # anyone else getting confused and connecting.
143- # See [Decision #5]
144- os.remove(stderr_path)
145- os.remove(stdout_path)
146- os.remove(stdin_path)
147- os.rmdir(base_path)
148+ self._cleanup_fifos(base_path)
149
150 def _close_child_file_descriptors(self):
151 sys.stdin.close()
152@@ -634,6 +690,13 @@
153 c_path, sock = self._child_processes.pop(c_id)
154 trace.mutter('%s exited %s and usage: %s'
155 % (c_id, exit_code, rusage))
156+ # Cleanup the child path, before mentioning it exited to the
157+ # caller. This avoids a race condition in the test suite.
158+ if os.path.exists(c_path):
159+ # The child failed to cleanup after itself, do the work here
160+ trace.warning('Had to clean up after child %d: %s\n'
161+ % (c_id, c_path))
162+ shutil.rmtree(c_path, ignore_errors=True)
163 # See [Decision #4]
164 try:
165 sock.sendall('exited\n%s\n' % (exit_code,))
166@@ -643,11 +706,6 @@
167 trace.mutter('%s\'s socket already closed: %s' % (c_id, e))
168 else:
169 sock.close()
170- if os.path.exists(c_path):
171- # The child failed to cleanup after itself, do the work here
172- trace.warning('Had to clean up after child %d: %s\n'
173- % (c_id, c_path))
174- shutil.rmtree(c_path, ignore_errors=True)
175
176 def _wait_for_children(self, secs):
177 start = time.time()
178@@ -724,6 +782,15 @@
179 self._should_terminate.set()
180 conn.sendall('ok\nquit command requested... exiting\n')
181 conn.close()
182+ elif request.startswith('child_connect_timeout '):
183+ try:
184+ value = int(request.split(' ', 1)[1])
185+ except ValueError, e:
186+ conn.sendall('FAILURE: %r\n' % (e,))
187+ else:
188+ self._child_connect_timeout = value
189+ conn.sendall('ok\n')
190+ conn.close()
191 elif request.startswith('fork ') or request.startswith('fork-env '):
192 command_argv, env = self._parse_fork_request(conn, client_addr,
193 request)
194
195=== modified file 'bzrplugins/lpserve/test_lpserve.py'
196--- bzrplugins/lpserve/test_lpserve.py 2010-11-08 22:43:58 +0000
197+++ bzrplugins/lpserve/test_lpserve.py 2011-03-06 00:03:56 +0000
198@@ -3,6 +3,7 @@
199
200 import errno
201 import os
202+import shutil
203 import signal
204 import socket
205 import subprocess
206@@ -13,6 +14,7 @@
207 from testtools import content
208
209 from bzrlib import (
210+ errors,
211 osutils,
212 tests,
213 trace,
214@@ -263,6 +265,46 @@
215 one_byte_at_a_time=True)
216 self.assertStartsWith(response, 'FAILURE\n')
217
218+ def test_child_connection_timeout(self):
219+ self.assertEqual(self.service.CHILD_CONNECT_TIMEOUT,
220+ self.service._child_connect_timeout)
221+ response = self.send_message_to_service('child_connect_timeout 1\n')
222+ self.assertEqual('ok\n', response)
223+ self.assertEqual(1, self.service._child_connect_timeout)
224+
225+ def test_child_connection_timeout_bad_float(self):
226+ self.assertEqual(self.service.CHILD_CONNECT_TIMEOUT,
227+ self.service._child_connect_timeout)
228+ response = self.send_message_to_service('child_connect_timeout 1.2\n')
229+ self.assertStartsWith(response, 'FAILURE:')
230+
231+ def test_child_connection_timeout_no_val(self):
232+ response = self.send_message_to_service('child_connect_timeout \n')
233+ self.assertStartsWith(response, 'FAILURE:')
234+
235+ def test_child_connection_timeout_bad_val(self):
236+ response = self.send_message_to_service('child_connect_timeout b\n')
237+ self.assertStartsWith(response, 'FAILURE:')
238+
239+ def test__open_handles_will_timeout(self):
240+ # signal.alarm() has only 1-second granularity. :(
241+ self.service._child_connect_timeout = 1
242+ tempdir = tempfile.mkdtemp(prefix='testlpserve-')
243+ self.addCleanup(shutil.rmtree, tempdir, ignore_errors=True)
244+ os.mkfifo(os.path.join(tempdir, 'stdin'))
245+ os.mkfifo(os.path.join(tempdir, 'stdout'))
246+ os.mkfifo(os.path.join(tempdir, 'stderr'))
247+ # catch SIGALRM so we don't stop the test suite. It will still
248+ # interupt the blocking open() calls.
249+ def noop_on_alarm(signal, frame):
250+ return
251+ signal.signal(signal.SIGALRM, noop_on_alarm)
252+ self.addCleanup(signal.signal, signal.SIGALRM, signal.SIG_DFL)
253+ e = self.assertRaises(errors.BzrError,
254+ self.service._open_handles, tempdir)
255+ self.assertContainsRe(str(e), r'After \d+.\d+s we failed to open.*')
256+
257+
258
259 class TestCaseWithSubprocess(tests.TestCaseWithTransport):
260 """Override the bzr start_bzr_subprocess command.
261@@ -452,9 +494,9 @@
262 stderr_path = os.path.join(path, 'stderr')
263 # The ordering must match the ordering of the service or we get a
264 # deadlock.
265- child_stdin = open(stdin_path, 'wb')
266- child_stdout = open(stdout_path, 'rb')
267- child_stderr = open(stderr_path, 'rb')
268+ child_stdin = open(stdin_path, 'wb', 0)
269+ child_stdout = open(stdout_path, 'rb', 0)
270+ child_stderr = open(stderr_path, 'rb', 0)
271 return child_stdin, child_stdout, child_stderr
272
273 def communicate_with_fork(self, path, stdin=None):
274@@ -484,6 +526,24 @@
275 self.assertEqual('', stderr_content)
276 self.assertReturnCode(0, sock)
277
278+ def DONT_test_fork_lp_serve_multiple_hello(self):
279+ # This ensures that the fifos are all set to blocking mode
280+ # We can't actually run this test, because by default 'bzr serve
281+ # --inet' does not flush after each message. So we end up blocking
282+ # forever waiting for the server to finish responding to the first
283+ # request.
284+ path, _, sock = self.send_fork_request('lp-serve --inet 2')
285+ child_stdin, child_stdout, child_stderr = self._get_fork_handles(path)
286+ child_stdin.write('hello\n')
287+ child_stdin.flush()
288+ self.assertEqual('ok\x012\n', child_stdout.read())
289+ child_stdin.write('hello\n')
290+ self.assertEqual('ok\x012\n', child_stdout.read())
291+ child_stdin.close()
292+ self.assertEqual('', child_stderr.read())
293+ child_stdout.close()
294+ child_stderr.close()
295+
296 def test_fork_replay(self):
297 path, _, sock = self.send_fork_request('launchpad-replay')
298 stdout_content, stderr_content = self.communicate_with_fork(path,
299@@ -540,6 +600,29 @@
300 def test_sigint_exits_nicely(self):
301 self._check_exits_nicely(signal.SIGINT)
302
303+ def test_child_exits_eventually(self):
304+ # We won't ever bind to the socket the child wants, and after some
305+ # time, the child should exit cleanly.
306+ # First, tell the subprocess that we want children to exit quickly.
307+ # *sigh* signal.alarm only has 1s resolution, so this test is slow.
308+ response = self.send_message_to_service('child_connect_timeout 1\n')
309+ self.assertEqual('ok\n', response)
310+ # Now request a fork
311+ path, pid, sock = self.send_fork_request('rocks')
312+ # # Open one handle, but not all of them
313+ stdin_path = os.path.join(path, 'stdin')
314+ stdout_path = os.path.join(path, 'stdout')
315+ stderr_path = os.path.join(path, 'stderr')
316+ child_stdin = open(stdin_path, 'wb')
317+ # We started opening the child, but stop before we get all handles
318+ # open. After 1 second, the child should get signaled and die.
319+ # The master process should notice, and tell us the status of the
320+ # exited child.
321+ val = sock.recv(4096)
322+ self.assertEqual('exited\n%s\n' % (signal.SIGALRM,), val)
323+ # The master process should clean up after the now deceased child.
324+ self.failIfExists(path)
325+
326
327 class TestCaseWithLPForkingServiceDaemon(
328 TestCaseWithLPForkingServiceSubprocess):