Merge lp:~vila/bzr/869366-test-server-race into lp:bzr/2.4

Proposed by Vincent Ladeuil
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 6053
Proposed branch: lp:~vila/bzr/869366-test-server-race
Merge into: lp:bzr/2.4
Diff against target: 159 lines (+61/-21)
3 files modified
bzrlib/tests/test_server.py (+3/-2)
bzrlib/tests/test_test_server.py (+55/-19)
doc/en/release-notes/bzr-2.4.txt (+3/-0)
To merge this branch: bzr merge lp:~vila/bzr/869366-test-server-race
Reviewer Review Type Date Requested Status
John A Meinel Approve
Review via email: mp+78634@code.launchpad.net

Commit message

Fix random test failure due to race in test_server_crash_while_responding

Description of the change

This is was one of the worst races I had to chase.

For context, this is in bzr-2.4 and only the tests are concerned.

My deepest thanks goes to gz for pushing me enough to dive into this madness
again.

Here is the story with its gory details:

These tests are about exception handling in the tests servers.

First of all, two kind of servers are tested here:

* TestingTCPServer runs in a single thread with basically runs a loop
  around:

    accept
    handle_request

  which means that an exception during a request processing just kill the
  whole server

* TestingThreadingTCPServer runs in a thread but spawns a thread for each
  connection, so basically a loop:

     accept
     handle_request by spawning a thread where the processing occurs

Now, test_server_crash_while_responding (the one bug 869366 is about) but
also test_exception_swallowed_while_serving are about what happens server
side when an exception is encountered while processing a request.

For TestingTCPServer it's simpler, the exception kills the server, easy,
just collect the exception.

TestingThreadingTCPServer is harder to grasp because there are two cases:

- the server thread spawns the connection thread and continues until it
  blocks in the accept() call. Here, the exception is caught in the
  connection thread and can be checked there.

  That was the first race I started chasing and ended up with: the exception
  should be checked in the connection thread, not the server thread.

- the server thread spawns the connection thread but doesn't get any cycle
  until the connection thread processes the request, raises and captures the
  exception. At that point, the server thread get some cycles and captures
  the exception *clearing* it in the connection thread *before* reaching the
  accept() call and as such re-raises, captures the exception and never
  reaches the accept() call !

So the fix is that we should check both the connection thread and the server
thread with some care:

- first the connection thread, if we get the exception, we're good,

- if it's not in the connection thread, we need to wait for the server to
  stop before getting... our precious.

Quite a journey.

Now as mentioned in the bug report, this has been encountered ~20 times
during the last year.

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/7/2011 6:34 PM, Vincent Ladeuil wrote:
> Vincent Ladeuil has proposed merging
> lp:~vila/bzr/869366-test-server-race into lp:bzr/2.4.
>
> Requested reviews: bzr-core (bzr-core) Related bugs: Bug #869366 in
> Bazaar: "TestTCPServerInAThread.test_server_crash_while_responding
> random failure" https://bugs.launchpad.net/bzr/+bug/869366
>
> For more details, see:
> https://code.launchpad.net/~vila/bzr/869366-test-server-race/+merge/78634
>
> This is was one of the worst races I had to chase.
>
> For context, this is in bzr-2.4 and only the tests are concerned.
>

 merge: approve

I like the change, and I like that it makes it more deterministic. I
have a couple of comments that might improve clarity, but nothing that
must be done. So feel free to land this, though think about my comments.

John
=:->

...

     def serve(self):
         self.serving = True
- - self.stopped.clear()
         # We are listening and ready to accept connections
         self.started.set()

[comment] This doesn't seem related or necessary. It would allow us to
run 'serve' multiple times, though I suppose we never specifically
need to.

         class FailingDuringResponseHandler(TCPConnectionHandler):

- - def handle_connection(self):
- - req = self.rfile.readline()
- - threading.currentThread().set_sync_event(sync)
+ def handle_connection(request):

^- [comment] changing the 'self' parameter to not be self seems
confusing. I think I preferred your form using:

test = self

Or even better would be to pass 'self' to the
FailingDuringResponseHandler's __init__ constructor, and saving it as
an attribute. This particular change will conflict landing on bzr.dev,
but should be obvious to fix.

The tests are meant to be roughly identical, except the latter case
marks the exception as ignored. Can we move the common code to a
helper to make it purely obvious that this is the case?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6POyQACgkQJdeBCYSNAAPEMQCeN5NABM6irUH7z0BRwK9Cr9od
9MwAni0HIw57Kw7XTuWmWbh03xR2P96U
=/GQC
-----END PGP SIGNATURE-----

review: Approve
Revision history for this message
Martin Packman (gz) wrote :

> ^- [comment] changing the 'self' parameter to not be self seems
> confusing. I think I preferred your form using:
>
> test = self
>
>
> Or even better would be to pass 'self' to the
> FailingDuringResponseHandler's __init__ constructor, and saving it as
> an attribute. This particular change will conflict landing on bzr.dev,
> but should be obvious to fix.

I agree with John, passing in or setting a temporary class attribute is neater.

His other points sound good too, but PQM needs feeding so will try landing this and see what falls out.

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Revision history for this message
Vincent Ladeuil (vila) wrote :

<snip/>

    > merge: approve

    > I like the change, and I like that it makes it more deterministic. I
    > have a couple of comments that might improve clarity, but nothing that
    > must be done. So feel free to land this, though think about my comments.

Thanks, I answer below.

    > John
    > =:->

    > ...

    > def serve(self):
    > self.serving = True
    > - self.stopped.clear()
    > # We are listening and ready to accept connections
    > self.started.set()

    > [comment] This doesn't seem related or necessary. It would allow us to
    > run 'serve' multiple times, though I suppose we never specifically
    > need to.

That's true, we don't. And the way events are used doesn't really allow
that anyway.

    > class FailingDuringResponseHandler(TCPConnectionHandler):

    > - def handle_connection(self):
    > - req = self.rfile.readline()
    > - threading.currentThread().set_sync_event(sync)
    > + def handle_connection(request):

    > ^- [comment] changing the 'self' parameter to not be self seems
    > confusing. I think I preferred your form using:

    > test = self

Right, I did that first, quick-and-dirty to debug but wasn't really
happy with it, then chatting with mgz he objected so I reverted it.

I think both approaches have their pro and cons. You can argue that
'self' should only be used to refer to the object the method is defined
for.

But overall, the three of use having issues with the double use of self
calls for something to be done.

I wanted to add a comment:

  # We use 'request' instead of 'self' below because the test matters
  # more and we need some container to properly set connection_thread

I'll add it before landing.

    > Or even better would be to pass 'self' to the
    > FailingDuringResponseHandler's __init__ constructor, and saving it as
    > an attribute. This particular change will conflict landing on bzr.dev,
    > but should be obvious to fix.

This cannot be done. The handler is instantiated in a place where one
cannot add more parameters.

    > The tests are meant to be roughly identical, except the latter case
    > marks the exception as ignored. Can we move the common code to a
    > helper to make it purely obvious that this is the case?

Hehe, I did try that... and realized the tests were more different than
they appear, they really override a different method in the handler and
need to do so.

Revision history for this message
Vincent Ladeuil (vila) wrote :

Argh, messages crossing on the wire ;-/

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/tests/test_server.py'
2--- bzrlib/tests/test_server.py 2011-02-10 12:37:27 +0000
3+++ bzrlib/tests/test_server.py 2011-10-07 16:32:33 +0000
4@@ -287,7 +287,6 @@
5
6 def serve(self):
7 self.serving = True
8- self.stopped.clear()
9 # We are listening and ready to accept connections
10 self.started.set()
11 try:
12@@ -453,6 +452,8 @@
13 if debug_threads():
14 sys.stderr.write('Client thread %s started\n' % (t.name,))
15 # If an exception occured during the thread start, it will get raised.
16+ # In rare cases, an exception raised during the request processing may
17+ # also get caught here (see http://pad.lv/869366)
18 t.pending_exception()
19
20 # The following methods are called by the main thread
21@@ -508,7 +509,7 @@
22 sync_event=self.server.started,
23 target=self.run_server)
24 self._server_thread.start()
25- # Wait for the server thread to start (i.e release the lock)
26+ # Wait for the server thread to start (i.e. release the lock)
27 self.server.started.wait()
28 # Get the real address, especially the port
29 self.host, self.port = self.server.server_address
30
31=== modified file 'bzrlib/tests/test_test_server.py'
32--- bzrlib/tests/test_test_server.py 2011-02-08 16:26:23 +0000
33+++ bzrlib/tests/test_test_server.py 2011-10-07 16:32:33 +0000
34@@ -86,9 +86,6 @@
35 for name in
36 ('TestingTCPServer', 'TestingThreadingTCPServer')]
37
38- # Set by load_tests()
39- server_class = None
40-
41 def get_server(self, server_class=None, connection_handler_class=None):
42 if server_class is not None:
43 self.server_class = server_class
44@@ -172,16 +169,26 @@
45 self.assertRaises(CantConnect, server.stop_server)
46
47 def test_server_crash_while_responding(self):
48- sync = threading.Event()
49- sync.clear()
50+ # We want to ensure the exception has been caught
51+ caught = threading.Event()
52+ caught.clear()
53+ # The thread that will serve the client, this needs to be an attribute
54+ # so the handler below can modify it when it's executed (it's
55+ # instantiated when the request is processed)
56+ self.connection_thread = None
57+
58 class FailToRespond(Exception):
59 pass
60
61 class FailingDuringResponseHandler(TCPConnectionHandler):
62
63- def handle_connection(self):
64- req = self.rfile.readline()
65- threading.currentThread().set_sync_event(sync)
66+ def handle_connection(request):
67+ req = request.rfile.readline()
68+ # Capture the thread and make it use 'caught' so we can wait on
69+ # the even that will be set when the exception is caught. We
70+ # also capture the thread to know where to look.
71+ self.connection_thread = threading.currentThread()
72+ self.connection_thread.set_sync_event(caught)
73 raise FailToRespond()
74
75 server = self.get_server(
76@@ -189,21 +196,45 @@
77 client = self.get_client()
78 client.connect((server.host, server.port))
79 client.write('ping\n')
80- sync.wait()
81- self.assertRaises(FailToRespond, server.pending_exception)
82+ # Wait for the exception to be caught
83+ caught.wait()
84+ # Check that the connection thread did catch the exception,
85+ # http://pad.lv/869366 was wrongly checking the server thread which
86+ # works for TestingTCPServer where the connection is handled in the
87+ # same thread than the server one but is racy for
88+ # TestingThreadingTCPServer where the server thread may be in a
89+ # blocking accept() call (or not).
90+ try:
91+ self.connection_thread.pending_exception()
92+ except FailToRespond:
93+ # Great, the test succeeded
94+ pass
95+ else:
96+ # If the exception is not in the connection thread anymore, it's in
97+ # the server's one.
98+ server.server.stopped.wait()
99+ # The exception is available now
100+ self.assertRaises(FailToRespond, server.pending_exception)
101
102 def test_exception_swallowed_while_serving(self):
103- sync = threading.Event()
104- sync.clear()
105+ # We need to ensure the exception has been caught
106+ caught = threading.Event()
107+ caught.clear()
108+ # The thread that will serve the client, this needs to be an attribute
109+ # so the handler below can access it when it's executed (it's
110+ # instantiated when the request is processed)
111+ self.connection_thread = None
112 class CantServe(Exception):
113 pass
114
115 class FailingWhileServingConnectionHandler(TCPConnectionHandler):
116
117- def handle(self):
118- # We want to sync with the thread that is serving the
119- # connection.
120- threading.currentThread().set_sync_event(sync)
121+ def handle(request):
122+ # Capture the thread and make it use 'caught' so we can wait on
123+ # the even that will be set when the exception is caught. We
124+ # also capture the thread to know where to look.
125+ self.connection_thread = threading.currentThread()
126+ self.connection_thread.set_sync_event(caught)
127 raise CantServe()
128
129 server = self.get_server(
130@@ -213,8 +244,13 @@
131 client = self.get_client()
132 # Connect to the server so the exception is raised there
133 client.connect((server.host, server.port))
134- # Wait for the exception to propagate.
135- sync.wait()
136+ # Wait for the exception to be caught
137+ caught.wait()
138 # The connection wasn't served properly but the exception should have
139- # been swallowed.
140+ # been swallowed (see test_server_crash_while_responding remark about
141+ # http://pad.lv/869366 explaining why we can't check the server thread
142+ # here). More precisely, the exception *has* been caught and captured
143+ # but it is cleared when joining the thread (or trying to acquire the
144+ # exception) and as such won't propagate to the server thread.
145+ self.connection_thread.pending_exception()
146 server.pending_exception()
147
148=== modified file 'doc/en/release-notes/bzr-2.4.txt'
149--- doc/en/release-notes/bzr-2.4.txt 2011-10-06 16:27:04 +0000
150+++ doc/en/release-notes/bzr-2.4.txt 2011-10-07 16:32:33 +0000
151@@ -75,6 +75,9 @@
152 suite. This can include new facilities for writing tests, fixes to
153 spurious test failures and changes to the way things should be tested.
154
155+* Fix the race for TestingThreadingTCPServer in
156+ test_server_crash_while_responding. (Vincent Ladeuil, #869366)
157+
158 * Really corrupt the pack file without depending on a special length or value.
159 (Vincent Ladeuil, #807032)
160

Subscribers

People subscribed via source and target branches