Merge lp:~vila/bzr/869366-test-server-race into lp:bzr/2.4

Proposed by Vincent Ladeuil
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 6053
Proposed branch: lp:~vila/bzr/869366-test-server-race
Merge into: lp:bzr/2.4
Diff against target: 159 lines (+61/-21)
3 files modified
bzrlib/tests/test_server.py (+3/-2)
bzrlib/tests/test_test_server.py (+55/-19)
doc/en/release-notes/bzr-2.4.txt (+3/-0)
To merge this branch: bzr merge lp:~vila/bzr/869366-test-server-race
Reviewer Review Type Date Requested Status
John A Meinel Approve
Review via email: mp+78634@code.launchpad.net

Commit message

Fix random test failure due to race in test_server_crash_while_responding

Description of the change

This is was one of the worst races I had to chase.

For context, this is in bzr-2.4 and only the tests are concerned.

My deepest thanks goes to gz for pushing me enough to dive into this madness
again.

Here is the story with its gory details:

These tests are about exception handling in the tests servers.

First of all, two kind of servers are tested here:

* TestingTCPServer runs in a single thread with basically runs a loop
  around:

    accept
    handle_request

  which means that an exception during a request processing just kill the
  whole server

* TestingThreadingTCPServer runs in a thread but spawns a thread for each
  connection, so basically a loop:

     accept
     handle_request by spawning a thread where the processing occurs

Now, test_server_crash_while_responding (the one bug 869366 is about) but
also test_exception_swallowed_while_serving are about what happens server
side when an exception is encountered while processing a request.

For TestingTCPServer it's simpler, the exception kills the server, easy,
just collect the exception.

TestingThreadingTCPServer is harder to grasp because there are two cases:

- the server thread spawns the connection thread and continues until it
  blocks in the accept() call. Here, the exception is caught in the
  connection thread and can be checked there.

  That was the first race I started chasing and ended up with: the exception
  should be checked in the connection thread, not the server thread.

- the server thread spawns the connection thread but doesn't get any cycle
  until the connection thread processes the request, raises and captures the
  exception. At that point, the server thread get some cycles and captures
  the exception *clearing* it in the connection thread *before* reaching the
  accept() call and as such re-raises, captures the exception and never
  reaches the accept() call !

So the fix is that we should check both the connection thread and the server
thread with some care:

- first the connection thread, if we get the exception, we're good,

- if it's not in the connection thread, we need to wait for the server to
  stop before getting... our precious.

Quite a journey.

Now as mentioned in the bug report, this has been encountered ~20 times
during the last year.

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/7/2011 6:34 PM, Vincent Ladeuil wrote:
> Vincent Ladeuil has proposed merging
> lp:~vila/bzr/869366-test-server-race into lp:bzr/2.4.
>
> Requested reviews: bzr-core (bzr-core) Related bugs: Bug #869366 in
> Bazaar: "TestTCPServerInAThread.test_server_crash_while_responding
> random failure" https://bugs.launchpad.net/bzr/+bug/869366
>
> For more details, see:
> https://code.launchpad.net/~vila/bzr/869366-test-server-race/+merge/78634
>
> This is was one of the worst races I had to chase.
>
> For context, this is in bzr-2.4 and only the tests are concerned.
>

 merge: approve

I like the change, and I like that it makes it more deterministic. I
have a couple of comments that might improve clarity, but nothing that
must be done. So feel free to land this, though think about my comments.

John
=:->

...

     def serve(self):
         self.serving = True
- - self.stopped.clear()
         # We are listening and ready to accept connections
         self.started.set()

[comment] This doesn't seem related or necessary. It would allow us to
run 'serve' multiple times, though I suppose we never specifically
need to.

         class FailingDuringResponseHandler(TCPConnectionHandler):

- - def handle_connection(self):
- - req = self.rfile.readline()
- - threading.currentThread().set_sync_event(sync)
+ def handle_connection(request):

^- [comment] changing the 'self' parameter to not be self seems
confusing. I think I preferred your form using:

test = self

Or even better would be to pass 'self' to the
FailingDuringResponseHandler's __init__ constructor, and saving it as
an attribute. This particular change will conflict landing on bzr.dev,
but should be obvious to fix.

The tests are meant to be roughly identical, except the latter case
marks the exception as ignored. Can we move the common code to a
helper to make it purely obvious that this is the case?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6POyQACgkQJdeBCYSNAAPEMQCeN5NABM6irUH7z0BRwK9Cr9od
9MwAni0HIw57Kw7XTuWmWbh03xR2P96U
=/GQC
-----END PGP SIGNATURE-----

review: Approve
Revision history for this message
Martin Packman (gz) wrote :

> ^- [comment] changing the 'self' parameter to not be self seems
> confusing. I think I preferred your form using:
>
> test = self
>
>
> Or even better would be to pass 'self' to the
> FailingDuringResponseHandler's __init__ constructor, and saving it as
> an attribute. This particular change will conflict landing on bzr.dev,
> but should be obvious to fix.

I agree with John, passing in or setting a temporary class attribute is neater.

His other points sound good too, but PQM needs feeding so will try landing this and see what falls out.

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Revision history for this message
Vincent Ladeuil (vila) wrote :

<snip/>

    > merge: approve

    > I like the change, and I like that it makes it more deterministic. I
    > have a couple of comments that might improve clarity, but nothing that
    > must be done. So feel free to land this, though think about my comments.

Thanks, I answer below.

    > John
    > =:->

    > ...

    > def serve(self):
    > self.serving = True
    > - self.stopped.clear()
    > # We are listening and ready to accept connections
    > self.started.set()

    > [comment] This doesn't seem related or necessary. It would allow us to
    > run 'serve' multiple times, though I suppose we never specifically
    > need to.

That's true, we don't. And the way events are used doesn't really allow
that anyway.

    > class FailingDuringResponseHandler(TCPConnectionHandler):

    > - def handle_connection(self):
    > - req = self.rfile.readline()
    > - threading.currentThread().set_sync_event(sync)
    > + def handle_connection(request):

    > ^- [comment] changing the 'self' parameter to not be self seems
    > confusing. I think I preferred your form using:

    > test = self

Right, I did that first, quick-and-dirty to debug but wasn't really
happy with it, then chatting with mgz he objected so I reverted it.

I think both approaches have their pro and cons. You can argue that
'self' should only be used to refer to the object the method is defined
for.

But overall, the three of use having issues with the double use of self
calls for something to be done.

I wanted to add a comment:

  # We use 'request' instead of 'self' below because the test matters
  # more and we need some container to properly set connection_thread

I'll add it before landing.

    > Or even better would be to pass 'self' to the
    > FailingDuringResponseHandler's __init__ constructor, and saving it as
    > an attribute. This particular change will conflict landing on bzr.dev,
    > but should be obvious to fix.

This cannot be done. The handler is instantiated in a place where one
cannot add more parameters.

    > The tests are meant to be roughly identical, except the latter case
    > marks the exception as ignored. Can we move the common code to a
    > helper to make it purely obvious that this is the case?

Hehe, I did try that... and realized the tests were more different than
they appear, they really override a different method in the handler and
need to do so.

Revision history for this message
Vincent Ladeuil (vila) wrote :

Argh, messages crossing on the wire ;-/

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/tests/test_server.py'
--- bzrlib/tests/test_server.py 2011-02-10 12:37:27 +0000
+++ bzrlib/tests/test_server.py 2011-10-07 16:32:33 +0000
@@ -287,7 +287,6 @@
287287
288 def serve(self):288 def serve(self):
289 self.serving = True289 self.serving = True
290 self.stopped.clear()
291 # We are listening and ready to accept connections290 # We are listening and ready to accept connections
292 self.started.set()291 self.started.set()
293 try:292 try:
@@ -453,6 +452,8 @@
453 if debug_threads():452 if debug_threads():
454 sys.stderr.write('Client thread %s started\n' % (t.name,))453 sys.stderr.write('Client thread %s started\n' % (t.name,))
455 # If an exception occured during the thread start, it will get raised.454 # If an exception occured during the thread start, it will get raised.
455 # In rare cases, an exception raised during the request processing may
456 # also get caught here (see http://pad.lv/869366)
456 t.pending_exception()457 t.pending_exception()
457458
458 # The following methods are called by the main thread459 # The following methods are called by the main thread
@@ -508,7 +509,7 @@
508 sync_event=self.server.started,509 sync_event=self.server.started,
509 target=self.run_server)510 target=self.run_server)
510 self._server_thread.start()511 self._server_thread.start()
511 # Wait for the server thread to start (i.e release the lock)512 # Wait for the server thread to start (i.e. release the lock)
512 self.server.started.wait()513 self.server.started.wait()
513 # Get the real address, especially the port514 # Get the real address, especially the port
514 self.host, self.port = self.server.server_address515 self.host, self.port = self.server.server_address
515516
=== modified file 'bzrlib/tests/test_test_server.py'
--- bzrlib/tests/test_test_server.py 2011-02-08 16:26:23 +0000
+++ bzrlib/tests/test_test_server.py 2011-10-07 16:32:33 +0000
@@ -86,9 +86,6 @@
86 for name in86 for name in
87 ('TestingTCPServer', 'TestingThreadingTCPServer')]87 ('TestingTCPServer', 'TestingThreadingTCPServer')]
8888
89 # Set by load_tests()
90 server_class = None
91
92 def get_server(self, server_class=None, connection_handler_class=None):89 def get_server(self, server_class=None, connection_handler_class=None):
93 if server_class is not None:90 if server_class is not None:
94 self.server_class = server_class91 self.server_class = server_class
@@ -172,16 +169,26 @@
172 self.assertRaises(CantConnect, server.stop_server)169 self.assertRaises(CantConnect, server.stop_server)
173170
174 def test_server_crash_while_responding(self):171 def test_server_crash_while_responding(self):
175 sync = threading.Event()172 # We want to ensure the exception has been caught
176 sync.clear()173 caught = threading.Event()
174 caught.clear()
175 # The thread that will serve the client, this needs to be an attribute
176 # so the handler below can modify it when it's executed (it's
177 # instantiated when the request is processed)
178 self.connection_thread = None
179
177 class FailToRespond(Exception):180 class FailToRespond(Exception):
178 pass181 pass
179182
180 class FailingDuringResponseHandler(TCPConnectionHandler):183 class FailingDuringResponseHandler(TCPConnectionHandler):
181184
182 def handle_connection(self):185 def handle_connection(request):
183 req = self.rfile.readline()186 req = request.rfile.readline()
184 threading.currentThread().set_sync_event(sync)187 # Capture the thread and make it use 'caught' so we can wait on
188 # the even that will be set when the exception is caught. We
189 # also capture the thread to know where to look.
190 self.connection_thread = threading.currentThread()
191 self.connection_thread.set_sync_event(caught)
185 raise FailToRespond()192 raise FailToRespond()
186193
187 server = self.get_server(194 server = self.get_server(
@@ -189,21 +196,45 @@
189 client = self.get_client()196 client = self.get_client()
190 client.connect((server.host, server.port))197 client.connect((server.host, server.port))
191 client.write('ping\n')198 client.write('ping\n')
192 sync.wait()199 # Wait for the exception to be caught
193 self.assertRaises(FailToRespond, server.pending_exception)200 caught.wait()
201 # Check that the connection thread did catch the exception,
202 # http://pad.lv/869366 was wrongly checking the server thread which
203 # works for TestingTCPServer where the connection is handled in the
204 # same thread than the server one but is racy for
205 # TestingThreadingTCPServer where the server thread may be in a
206 # blocking accept() call (or not).
207 try:
208 self.connection_thread.pending_exception()
209 except FailToRespond:
210 # Great, the test succeeded
211 pass
212 else:
213 # If the exception is not in the connection thread anymore, it's in
214 # the server's one.
215 server.server.stopped.wait()
216 # The exception is available now
217 self.assertRaises(FailToRespond, server.pending_exception)
194218
195 def test_exception_swallowed_while_serving(self):219 def test_exception_swallowed_while_serving(self):
196 sync = threading.Event()220 # We need to ensure the exception has been caught
197 sync.clear()221 caught = threading.Event()
222 caught.clear()
223 # The thread that will serve the client, this needs to be an attribute
224 # so the handler below can access it when it's executed (it's
225 # instantiated when the request is processed)
226 self.connection_thread = None
198 class CantServe(Exception):227 class CantServe(Exception):
199 pass228 pass
200229
201 class FailingWhileServingConnectionHandler(TCPConnectionHandler):230 class FailingWhileServingConnectionHandler(TCPConnectionHandler):
202231
203 def handle(self):232 def handle(request):
204 # We want to sync with the thread that is serving the233 # Capture the thread and make it use 'caught' so we can wait on
205 # connection.234 # the even that will be set when the exception is caught. We
206 threading.currentThread().set_sync_event(sync)235 # also capture the thread to know where to look.
236 self.connection_thread = threading.currentThread()
237 self.connection_thread.set_sync_event(caught)
207 raise CantServe()238 raise CantServe()
208239
209 server = self.get_server(240 server = self.get_server(
@@ -213,8 +244,13 @@
213 client = self.get_client()244 client = self.get_client()
214 # Connect to the server so the exception is raised there245 # Connect to the server so the exception is raised there
215 client.connect((server.host, server.port))246 client.connect((server.host, server.port))
216 # Wait for the exception to propagate.247 # Wait for the exception to be caught
217 sync.wait()248 caught.wait()
218 # The connection wasn't served properly but the exception should have249 # The connection wasn't served properly but the exception should have
219 # been swallowed.250 # been swallowed (see test_server_crash_while_responding remark about
251 # http://pad.lv/869366 explaining why we can't check the server thread
252 # here). More precisely, the exception *has* been caught and captured
253 # but it is cleared when joining the thread (or trying to acquire the
254 # exception) and as such won't propagate to the server thread.
255 self.connection_thread.pending_exception()
220 server.pending_exception()256 server.pending_exception()
221257
=== modified file 'doc/en/release-notes/bzr-2.4.txt'
--- doc/en/release-notes/bzr-2.4.txt 2011-10-06 16:27:04 +0000
+++ doc/en/release-notes/bzr-2.4.txt 2011-10-07 16:32:33 +0000
@@ -75,6 +75,9 @@
75 suite. This can include new facilities for writing tests, fixes to 75 suite. This can include new facilities for writing tests, fixes to
76 spurious test failures and changes to the way things should be tested.76 spurious test failures and changes to the way things should be tested.
7777
78* Fix the race for TestingThreadingTCPServer in
79 test_server_crash_while_responding. (Vincent Ladeuil, #869366)
80
78* Really corrupt the pack file without depending on a special length or value.81* Really corrupt the pack file without depending on a special length or value.
79 (Vincent Ladeuil, #807032)82 (Vincent Ladeuil, #807032)
8083

Subscribers

People subscribed via source and target branches