Merge lp:~jameinel/launchpad/twisted-close-on-failure into lp:launchpad

Proposed by John A Meinel
Status: Merged
Approved by: Robert Collins
Approved revision: no longer in the source branch.
Merged at revision: 12415
Proposed branch: lp:~jameinel/launchpad/twisted-close-on-failure
Merge into: lp:launchpad
Diff against target: 137 lines (+62/-31)
1 file modified
lib/lp/codehosting/sshserver/session.py (+62/-31)
To merge this branch: bzr merge lp:~jameinel/launchpad/twisted-close-on-failure
Reviewer Review Type Date Requested Status
Brad Crittenden (community) code Approve
Benji York (community) code* Approve
Review via email: mp+50067@code.launchpad.net

Commit message

[r=bac,benji][ui=none] [r=bac,benji][bug=717345][no-qa] If we fail to open all handles, close the ones we did succeed in opening. (bug #717345)

Description of the change

This is the final step I've put together for bug #717345.

This changes the spawning code so that it more directly handles when we fail to actually connect to a newly spawned child. This should decrease the chance of leaking file handles. And with https://code.launchpad.net/~jameinel/launchpad/lp-serve-child-hangup/+merge/50055 I see that even after forcing the server to run out of handles for a while, it recovers and all remaining handles get closed.

I tested this using 'ulimit -n 100', and then 'hammer_ssh.py --load=20'. So it starts running out of handles, but serves what it can on the remaining connections. Then once the load is gone, I can see all the remaining handles get closed, and the process is back to a clean state.

To post a comment you must log in.
Revision history for this message
Benji York (benji) wrote :

This branch looks like it will greatly help prevent resource exhaustion.
Approved* with a couple of questions/observations for your consideration:

(* I'm doing mentored reviews, so Brad will be reviewing my review and
giving final approval.)

I'm curious why the check (on line 26 of the diff) that the content of
"response" not start with "FAILURE" was removed? Was it because the
assert will also fire in that case? If so, note two things: 1) that's
an abuse of assert in my book: assert is for asserting the programmer
beliefs about preconditions, postconditions, and invariants, not
checking for erroneous inputs, 2) asserts are compiled out if run with
-O or -OO, so it sets up a failure mode that won't be easy for someone
to track down.

The original "assert ok == 'ok'" was fine because there was a previous
check for ok == 'FAILURE' and the intent was that the only two possible
values for ok were "ok" and "FAILURE" so at that point in the code any
value other than "ok" signaled a mismatch between the programmer's
internal model and reality.

This bit of code confuses me (it's the body of openHandleFailures sans
docstring):

    fd = os.open(path, flags)
    call_on_failure.append((os.close, fd))
    p = proc_class(reactor, self, child_fd, fd)
    # The ProcessReader/Writer now handles this fileno, don't close
    # directly anymore
    call_on_failure[-1] = (p.connectionLost, (None,))
    self.pipes[child_fd] = p

In particular, it appends a tuple to call_on_failure, makes a function
call that appears to be unrelated to call_on_failure and then replaces
the just-added item at the end of call_on_failure with a new tuple.

Oh wait! Is this because the responsibility for closing fd moves to p
after it is successfully created? Is that what the comment is trying to
tell me? If so, maybe a different phrasing would help. Something like:

    # Now that p has been successfully created it takes on the
    # responsibility of closing fd, so we'll replace the explicit close
    # of fp with the connectionLost handler of p.

In the following code (lines 88 to 98 of the diff), is it possible that
any of the functions in call_on_failure might raise an exception? If
so, is another layer of try/excepts prudent?

    try:
        ...
    except:
        for func, args in call_on_failure:
            func(*args)
        raise

review: Approve (code*)
Revision history for this message
John A Meinel (jameinel) wrote :

The FAILURE check was removed because the code we called already checked for it.

The code was, roughly:

def _sendRequest(...):

  response = socket.recv()
  if response.startswith("FAILURE"):
    # raise exception
  return response

def _spawn(...):

  response = self._sendRequest()
  if response.startswith('FAILURE'):
    # raise exception

So I was just removing duplication.

Yes, we no longer close the fd, because the ProcessWriter/ProcessReader will close it. I can switch the comment if you find it easier to understand.

For the cleanup code... we could suppress all exceptions at that point, I guess. Its a little ugly, but that may not be preventable. FWIW, bzr has cleanup code for this sort of thing, except it always runs the cleanups (but suppresses failures if there is a current exception running.)

Revision history for this message
Benji York (benji) wrote :

On Fri, Feb 18, 2011 at 11:03 AM, John A Meinel <email address hidden> wrote:
> The FAILURE check was removed because the code we called already
> checked for it.
[...]
> So I was just removing duplication.

Ah! That makes sense.

> Yes, we no longer close the fd, because the
> ProcessWriter/ProcessReader will close it. I can switch the comment if
> you find it easier to understand.

I do find it clearer. That being said, I eventually figured it out so
the original comment can't be all bad.

> For the cleanup code... we could suppress all exceptions at that
> point, I guess. Its a little ugly, but that may not be preventable.
> FWIW, bzr has cleanup code for this sort of thing, except it always
> runs the cleanups (but suppresses failures if there is a current
> exception running.)

Right, the central question is: should a failure of the cleanup code
cause an exception to bubble up or should it be silently ignored. It's
possible that that failure could cause file handles to be leaked, which
is what you're trying hard to prevent. On the other hand, a rare
failure when cleaning up probably shouldn't cause the entire process to
end, so ignoring exceptions from cleanup code seems like the lesser
evil.
--
Benji York

Revision history for this message
Brad Crittenden (bac) wrote :

I concur with Benji's review. I had the same questions regarding the asserts and think Benji's suggestion for rephrasing the comments makes the code more clear.

The code does look to be much more robust than what we currently have.

I'll mark this branch mentored (approved) and let you and Benji sort out the questions he raised.

review: Approve (code)
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/18/2011 10:07 AM, Benji York wrote:
> On Fri, Feb 18, 2011 at 11:03 AM, John A Meinel <email address hidden> wrote:
>> The FAILURE check was removed because the code we called already
>> checked for it.
> [...]
>> So I was just removing duplication.
>
> Ah! That makes sense.
>
>> Yes, we no longer close the fd, because the
>> ProcessWriter/ProcessReader will close it. I can switch the comment if
>> you find it easier to understand.
>
> I do find it clearer. That being said, I eventually figured it out so
> the original comment can't be all bad.

Done.

>
>> For the cleanup code... we could suppress all exceptions at that
>> point, I guess. Its a little ugly, but that may not be preventable.
>> FWIW, bzr has cleanup code for this sort of thing, except it always
>> runs the cleanups (but suppresses failures if there is a current
>> exception running.)
>
> Right, the central question is: should a failure of the cleanup code
> cause an exception to bubble up or should it be silently ignored. It's
> possible that that failure could cause file handles to be leaked, which
> is what you're trying hard to prevent. On the other hand, a rare
> failure when cleaning up probably shouldn't cause the entire process to
> end, so ignoring exceptions from cleanup code seems like the lesser
> evil.

I changed it to trap the original exception using sys.exc_info(), then
iterate the cleanups, and if there is another exception just call log.err()
Then raise the original exception again.

Which seemed a reasonable tradeoff.

Committed and pushed.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1em8QACgkQJdeBCYSNAAOipACeJJOl3BSzYoRelLS2JrSGt9YU
4xgAnR/lWKNZutRqJB4enGW9Q/WOALQU
=+lBJ
-----END PGP SIGNATURE-----

Revision history for this message
Benji York (benji) wrote :

Looks good!

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/lp/codehosting/sshserver/session.py'
--- lib/lp/codehosting/sshserver/session.py 2010-10-07 23:43:40 +0000
+++ lib/lp/codehosting/sshserver/session.py 2011-02-18 16:23:00 +0000
@@ -11,6 +11,7 @@
11import os11import os
12import signal12import signal
13import socket13import socket
14import sys
14import urlparse15import urlparse
1516
16from zope.event import notify17from zope.event import notify
@@ -21,7 +22,8 @@
21 interfaces,22 interfaces,
22 process,23 process,
23 )24 )
24from twisted.python import log25from twisted.python import log, failure
26from twisted.internet.main import CONNECTION_LOST, CONNECTION_DONE
2527
26from canonical.config import config28from canonical.config import config
27from lp.codehosting import get_bzr_path29from lp.codehosting import get_bzr_path
@@ -129,6 +131,7 @@
129 log.err('Connection failed: %s' % (e,))131 log.err('Connection failed: %s' % (e,))
130 raise132 raise
131 if response.startswith("FAILURE"):133 if response.startswith("FAILURE"):
134 client_sock.close()
132 raise RuntimeError('Failed to send message: %r' % (response,))135 raise RuntimeError('Failed to send message: %r' % (response,))
133 return response, client_sock136 return response, client_sock
134137
@@ -154,36 +157,62 @@
154 message.append('end\n')157 message.append('end\n')
155 message = ''.join(message)158 message = ''.join(message)
156 response, sock = self._sendMessageToService(message)159 response, sock = self._sendMessageToService(message)
157 if response.startswith('FAILURE'):160 try:
158 # TODO: Is there a better error to raise?161 ok, pid, path, tail = response.split('\n')
159 raise RuntimeError("Failed while sending message to forking "162 assert ok == 'ok'
160 "service. message: %r, failure: %r"163 assert tail == ''
161 % (message, response))164 pid = int(pid)
162 ok, pid, path, tail = response.split('\n')165 log.msg('Forking returned pid: %d, path: %s' % (pid, path))
163 assert ok == 'ok'166 except:
164 assert tail == ''167 sock.close()
165 pid = int(pid)168 raise
166 log.msg('Forking returned pid: %d, path: %s' % (pid, path))
167 return pid, path, sock169 return pid, path, sock
168170
171 def _openHandleFailures(self, call_on_failure, path, flags, proc_class,
172 reactor, child_fd):
173 """Open the given path, adding a cleanup as appropriate.
174
175 :param call_on_failure: A list holding (callback, args) tuples. We will
176 append new entries for things that we open
177 :param path: The path to open
178 :param flags: Flags to pass to os.open
179 :param proc_class: The ProcessWriter/ProcessReader class to wrap this
180 connection.
181 :param reactor: The Twisted reactor we are connecting to.
182 :param child_fd: The child file descriptor number passed to proc_class
183 """
184 fd = os.open(path, flags)
185 call_on_failure.append((os.close, fd))
186 p = proc_class(reactor, self, child_fd, fd)
187 # Now that p has been created, it will close fd for us. So switch the
188 # cleanup to calling p.connectionLost()
189 call_on_failure[-1] = (p.connectionLost, (None,))
190 self.pipes[child_fd] = p
191
169 def _connectSpawnToReactor(self, reactor):192 def _connectSpawnToReactor(self, reactor):
193 self._exiter = _WaitForExit(reactor, self, self.process_sock)
194 call_on_failure = [(self._exiter.connectionLost, (None,))]
170 stdin_path = os.path.join(self._fifo_path, 'stdin')195 stdin_path = os.path.join(self._fifo_path, 'stdin')
171 stdout_path = os.path.join(self._fifo_path, 'stdout')196 stdout_path = os.path.join(self._fifo_path, 'stdout')
172 stderr_path = os.path.join(self._fifo_path, 'stderr')197 stderr_path = os.path.join(self._fifo_path, 'stderr')
173 child_stdin_fd = os.open(stdin_path, os.O_WRONLY)198 try:
174 self.pipes[0] = process.ProcessWriter(reactor, self, 0,199 self._openHandleFailures(call_on_failure, stdin_path, os.O_WRONLY,
175 child_stdin_fd)200 process.ProcessWriter, reactor, 0)
176 child_stdout_fd = os.open(stdout_path, os.O_RDONLY)201 self._openHandleFailures(call_on_failure, stdout_path, os.O_RDONLY,
177 # forceReadHack=True ? Used in process.py doesn't seem to be needed202 process.ProcessReader, reactor, 1)
178 # here203 self._openHandleFailures(call_on_failure, stderr_path, os.O_RDONLY,
179 self.pipes[1] = process.ProcessReader(reactor, self, 1,204 process.ProcessReader, reactor, 2)
180 child_stdout_fd)205 except:
181 child_stderr_fd = os.open(stderr_path, os.O_RDONLY)206 exc_class, exc_value, exc_tb = sys.exc_info()
182 self.pipes[2] = process.ProcessReader(reactor, self, 2,207 for func, args in call_on_failure:
183 child_stderr_fd)208 try:
184 # Note: _exiter forms a GC cycle, since it points to us, and we hold a209 func(*args)
185 # reference to it210 except:
186 self._exiter = _WaitForExit(reactor, self, self.process_sock)211 # Just log any exceptions at this point. This makes sure
212 # all cleanups get called so we don't get leaks. We know
213 # there is an active exception, or we wouldn't be here.
214 log.err()
215 raise exc_class, exc_value, exc_tb
187 self.pipes['exit'] = self._exiter216 self.pipes['exit'] = self._exiter
188217
189 def _getReason(self, status):218 def _getReason(self, status):
@@ -248,12 +277,14 @@
248 # Implemented because ProcessWriter/ProcessReader want to call it277 # Implemented because ProcessWriter/ProcessReader want to call it
249 # Copied from twisted.internet.Process278 # Copied from twisted.internet.Process
250 def childConnectionLost(self, childFD, reason):279 def childConnectionLost(self, childFD, reason):
251 close = getattr(self.pipes[childFD], 'close', None)280 pipe = self.pipes.get(childFD)
252 if close is not None:281 if pipe is not None:
253 close()282 close = getattr(pipe, 'close', None)
254 else:283 if close is not None:
255 os.close(self.pipes[childFD].fileno())284 close()
256 del self.pipes[childFD]285 else:
286 os.close(self.pipes[childFD].fileno())
287 del self.pipes[childFD]
257 try:288 try:
258 self.proto.childConnectionLost(childFD)289 self.proto.childConnectionLost(childFD)
259 except:290 except: