Merge lp:~jameinel/launchpad/twisted-close-on-failure into lp:launchpad

Proposed by John A Meinel
Status: Merged
Approved by: Robert Collins
Approved revision: no longer in the source branch.
Merged at revision: 12415
Proposed branch: lp:~jameinel/launchpad/twisted-close-on-failure
Merge into: lp:launchpad
Diff against target: 137 lines (+62/-31)
1 file modified
lib/lp/codehosting/sshserver/session.py (+62/-31)
To merge this branch: bzr merge lp:~jameinel/launchpad/twisted-close-on-failure
Reviewer Review Type Date Requested Status
Brad Crittenden (community) code Approve
Benji York (community) code* Approve
Review via email: mp+50067@code.launchpad.net

Commit message

[r=bac,benji][ui=none] [r=bac,benji][bug=717345][no-qa] If we fail to open all handles, close the ones we did succeed in opening. (bug #717345)

Description of the change

This is the final step I've put together for bug #717345.

This changes the spawning code so that it more directly handles when we fail to actually connect to a newly spawned child. This should decrease the chance of leaking file handles. And with https://code.launchpad.net/~jameinel/launchpad/lp-serve-child-hangup/+merge/50055 I see that even after forcing the server to run out of handles for a while, it recovers and all remaining handles get closed.

I tested this using 'ulimit -n 100', and then 'hammer_ssh.py --load=20'. So it starts running out of handles, but serves what it can on the remaining connections. Then once the load is gone, I can see all the remaining handles get closed, and the process is back to a clean state.

To post a comment you must log in.
Revision history for this message
Benji York (benji) wrote :

This branch looks like it will greatly help prevent resource exhaustion.
Approved* with a couple of questions/observations for your consideration:

(* I'm doing mentored reviews, so Brad will be reviewing my review and
giving final approval.)

I'm curious why the check (on line 26 of the diff) that the content of
"response" not start with "FAILURE" was removed? Was it because the
assert will also fire in that case? If so, note two things: 1) that's
an abuse of assert in my book: assert is for asserting the programmer
beliefs about preconditions, postconditions, and invariants, not
checking for erroneous inputs, 2) asserts are compiled out if run with
-O or -OO, so it sets up a failure mode that won't be easy for someone
to track down.

The original "assert ok == 'ok'" was fine because there was a previous
check for ok == 'FAILURE' and the intent was that the only two possible
values for ok were "ok" and "FAILURE" so at that point in the code any
value other than "ok" signaled a mismatch between the programmer's
internal model and reality.

This bit of code confuses me (it's the body of openHandleFailures sans
docstring):

    fd = os.open(path, flags)
    call_on_failure.append((os.close, fd))
    p = proc_class(reactor, self, child_fd, fd)
    # The ProcessReader/Writer now handles this fileno, don't close
    # directly anymore
    call_on_failure[-1] = (p.connectionLost, (None,))
    self.pipes[child_fd] = p

In particular, it appends a tuple to call_on_failure, makes a function
call that appears to be unrelated to call_on_failure and then replaces
the just-added item at the end of call_on_failure with a new tuple.

Oh wait! Is this because the responsibility for closing fd moves to p
after it is successfully created? Is that what the comment is trying to
tell me? If so, maybe a different phrasing would help. Something like:

    # Now that p has been successfully created it takes on the
    # responsibility of closing fd, so we'll replace the explicit close
    # of fp with the connectionLost handler of p.

In the following code (lines 88 to 98 of the diff), is it possible that
any of the functions in call_on_failure might raise an exception? If
so, is another layer of try/excepts prudent?

    try:
        ...
    except:
        for func, args in call_on_failure:
            func(*args)
        raise

review: Approve (code*)
Revision history for this message
John A Meinel (jameinel) wrote :

The FAILURE check was removed because the code we called already checked for it.

The code was, roughly:

def _sendRequest(...):

  response = socket.recv()
  if response.startswith("FAILURE"):
    # raise exception
  return response

def _spawn(...):

  response = self._sendRequest()
  if response.startswith('FAILURE'):
    # raise exception

So I was just removing duplication.

Yes, we no longer close the fd, because the ProcessWriter/ProcessReader will close it. I can switch the comment if you find it easier to understand.

For the cleanup code... we could suppress all exceptions at that point, I guess. Its a little ugly, but that may not be preventable. FWIW, bzr has cleanup code for this sort of thing, except it always runs the cleanups (but suppresses failures if there is a current exception running.)

Revision history for this message
Benji York (benji) wrote :

On Fri, Feb 18, 2011 at 11:03 AM, John A Meinel <email address hidden> wrote:
> The FAILURE check was removed because the code we called already
> checked for it.
[...]
> So I was just removing duplication.

Ah! That makes sense.

> Yes, we no longer close the fd, because the
> ProcessWriter/ProcessReader will close it. I can switch the comment if
> you find it easier to understand.

I do find it clearer. That being said, I eventually figured it out so
the original comment can't be all bad.

> For the cleanup code... we could suppress all exceptions at that
> point, I guess. Its a little ugly, but that may not be preventable.
> FWIW, bzr has cleanup code for this sort of thing, except it always
> runs the cleanups (but suppresses failures if there is a current
> exception running.)

Right, the central question is: should a failure of the cleanup code
cause an exception to bubble up or should it be silently ignored. It's
possible that that failure could cause file handles to be leaked, which
is what you're trying hard to prevent. On the other hand, a rare
failure when cleaning up probably shouldn't cause the entire process to
end, so ignoring exceptions from cleanup code seems like the lesser
evil.
--
Benji York

Revision history for this message
Brad Crittenden (bac) wrote :

I concur with Benji's review. I had the same questions regarding the asserts and think Benji's suggestion for rephrasing the comments makes the code more clear.

The code does look to be much more robust than what we currently have.

I'll mark this branch mentored (approved) and let you and Benji sort out the questions he raised.

review: Approve (code)
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/18/2011 10:07 AM, Benji York wrote:
> On Fri, Feb 18, 2011 at 11:03 AM, John A Meinel <email address hidden> wrote:
>> The FAILURE check was removed because the code we called already
>> checked for it.
> [...]
>> So I was just removing duplication.
>
> Ah! That makes sense.
>
>> Yes, we no longer close the fd, because the
>> ProcessWriter/ProcessReader will close it. I can switch the comment if
>> you find it easier to understand.
>
> I do find it clearer. That being said, I eventually figured it out so
> the original comment can't be all bad.

Done.

>
>> For the cleanup code... we could suppress all exceptions at that
>> point, I guess. Its a little ugly, but that may not be preventable.
>> FWIW, bzr has cleanup code for this sort of thing, except it always
>> runs the cleanups (but suppresses failures if there is a current
>> exception running.)
>
> Right, the central question is: should a failure of the cleanup code
> cause an exception to bubble up or should it be silently ignored. It's
> possible that that failure could cause file handles to be leaked, which
> is what you're trying hard to prevent. On the other hand, a rare
> failure when cleaning up probably shouldn't cause the entire process to
> end, so ignoring exceptions from cleanup code seems like the lesser
> evil.

I changed it to trap the original exception using sys.exc_info(), then
iterate the cleanups, and if there is another exception just call log.err()
Then raise the original exception again.

Which seemed a reasonable tradeoff.

Committed and pushed.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1em8QACgkQJdeBCYSNAAOipACeJJOl3BSzYoRelLS2JrSGt9YU
4xgAnR/lWKNZutRqJB4enGW9Q/WOALQU
=+lBJ
-----END PGP SIGNATURE-----

Revision history for this message
Benji York (benji) wrote :

Looks good!

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/codehosting/sshserver/session.py'
2--- lib/lp/codehosting/sshserver/session.py 2010-10-07 23:43:40 +0000
3+++ lib/lp/codehosting/sshserver/session.py 2011-02-18 16:23:00 +0000
4@@ -11,6 +11,7 @@
5 import os
6 import signal
7 import socket
8+import sys
9 import urlparse
10
11 from zope.event import notify
12@@ -21,7 +22,8 @@
13 interfaces,
14 process,
15 )
16-from twisted.python import log
17+from twisted.python import log, failure
18+from twisted.internet.main import CONNECTION_LOST, CONNECTION_DONE
19
20 from canonical.config import config
21 from lp.codehosting import get_bzr_path
22@@ -129,6 +131,7 @@
23 log.err('Connection failed: %s' % (e,))
24 raise
25 if response.startswith("FAILURE"):
26+ client_sock.close()
27 raise RuntimeError('Failed to send message: %r' % (response,))
28 return response, client_sock
29
30@@ -154,36 +157,62 @@
31 message.append('end\n')
32 message = ''.join(message)
33 response, sock = self._sendMessageToService(message)
34- if response.startswith('FAILURE'):
35- # TODO: Is there a better error to raise?
36- raise RuntimeError("Failed while sending message to forking "
37- "service. message: %r, failure: %r"
38- % (message, response))
39- ok, pid, path, tail = response.split('\n')
40- assert ok == 'ok'
41- assert tail == ''
42- pid = int(pid)
43- log.msg('Forking returned pid: %d, path: %s' % (pid, path))
44+ try:
45+ ok, pid, path, tail = response.split('\n')
46+ assert ok == 'ok'
47+ assert tail == ''
48+ pid = int(pid)
49+ log.msg('Forking returned pid: %d, path: %s' % (pid, path))
50+ except:
51+ sock.close()
52+ raise
53 return pid, path, sock
54
55+ def _openHandleFailures(self, call_on_failure, path, flags, proc_class,
56+ reactor, child_fd):
57+ """Open the given path, adding a cleanup as appropriate.
58+
59+ :param call_on_failure: A list holding (callback, args) tuples. We will
60+ append new entries for things that we open
61+ :param path: The path to open
62+ :param flags: Flags to pass to os.open
63+ :param proc_class: The ProcessWriter/ProcessReader class to wrap this
64+ connection.
65+ :param reactor: The Twisted reactor we are connecting to.
66+ :param child_fd: The child file descriptor number passed to proc_class
67+ """
68+ fd = os.open(path, flags)
69+ call_on_failure.append((os.close, fd))
70+ p = proc_class(reactor, self, child_fd, fd)
71+ # Now that p has been created, it will close fd for us. So switch the
72+ # cleanup to calling p.connectionLost()
73+ call_on_failure[-1] = (p.connectionLost, (None,))
74+ self.pipes[child_fd] = p
75+
76 def _connectSpawnToReactor(self, reactor):
77+ self._exiter = _WaitForExit(reactor, self, self.process_sock)
78+ call_on_failure = [(self._exiter.connectionLost, (None,))]
79 stdin_path = os.path.join(self._fifo_path, 'stdin')
80 stdout_path = os.path.join(self._fifo_path, 'stdout')
81 stderr_path = os.path.join(self._fifo_path, 'stderr')
82- child_stdin_fd = os.open(stdin_path, os.O_WRONLY)
83- self.pipes[0] = process.ProcessWriter(reactor, self, 0,
84- child_stdin_fd)
85- child_stdout_fd = os.open(stdout_path, os.O_RDONLY)
86- # forceReadHack=True ? Used in process.py doesn't seem to be needed
87- # here
88- self.pipes[1] = process.ProcessReader(reactor, self, 1,
89- child_stdout_fd)
90- child_stderr_fd = os.open(stderr_path, os.O_RDONLY)
91- self.pipes[2] = process.ProcessReader(reactor, self, 2,
92- child_stderr_fd)
93- # Note: _exiter forms a GC cycle, since it points to us, and we hold a
94- # reference to it
95- self._exiter = _WaitForExit(reactor, self, self.process_sock)
96+ try:
97+ self._openHandleFailures(call_on_failure, stdin_path, os.O_WRONLY,
98+ process.ProcessWriter, reactor, 0)
99+ self._openHandleFailures(call_on_failure, stdout_path, os.O_RDONLY,
100+ process.ProcessReader, reactor, 1)
101+ self._openHandleFailures(call_on_failure, stderr_path, os.O_RDONLY,
102+ process.ProcessReader, reactor, 2)
103+ except:
104+ exc_class, exc_value, exc_tb = sys.exc_info()
105+ for func, args in call_on_failure:
106+ try:
107+ func(*args)
108+ except:
109+ # Just log any exceptions at this point. This makes sure
110+ # all cleanups get called so we don't get leaks. We know
111+ # there is an active exception, or we wouldn't be here.
112+ log.err()
113+ raise exc_class, exc_value, exc_tb
114 self.pipes['exit'] = self._exiter
115
116 def _getReason(self, status):
117@@ -248,12 +277,14 @@
118 # Implemented because ProcessWriter/ProcessReader want to call it
119 # Copied from twisted.internet.Process
120 def childConnectionLost(self, childFD, reason):
121- close = getattr(self.pipes[childFD], 'close', None)
122- if close is not None:
123- close()
124- else:
125- os.close(self.pipes[childFD].fileno())
126- del self.pipes[childFD]
127+ pipe = self.pipes.get(childFD)
128+ if pipe is not None:
129+ close = getattr(pipe, 'close', None)
130+ if close is not None:
131+ close()
132+ else:
133+ os.close(self.pipes[childFD].fileno())
134+ del self.pipes[childFD]
135 try:
136 self.proto.childConnectionLost(childFD)
137 except: