Merge lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes into lp:bzr

Proposed by Martin Packman
Status: Work in progress
Proposed branch: lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes
Merge into: lp:bzr
Diff against target: 74 lines (+64/-0)
1 file modified
bzrlib/tests/test_smart_transport.py (+64/-0)
To merge this branch: bzr merge lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes
Reviewer Review Type Date Requested Status
Andrew Bennetts Needs Fixing
Review via email: mp+19439@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Martin Packman (gz) wrote :

For background, see this thread (and particularly the three changesets linked ffrom the initial post):
<https://lists.ubuntu.com/archives/bazaar/2010q1/066869.html>

There seemed to be some doubt about exactly how harmful the current code attempting to handle EINTR is, despite my best attempts to explain it, so here's a test that demonstrates a couple of the failure states. It uses a reading method, but these problems apply equally to those that write, it's just harder to get them to block where you want reliably enough to make an automated test.

As an aside, testtools doesn't seem to do anything sensible with TestCase.knownFailure (and the TestCase.expectFailure isn't much better).

5038. By Martin Packman

Turn the expectFailure the right way round, remove sleep trailing whitespace

Revision history for this message
Martin Pool (mbp) wrote :

This does make the bug more clear, thanks for posting it. Clearly we
should fix _read_bytes.

I'm not super keen to actually merge this because history shows that
any test that depends on timing interactions between threads will
intermittently fail in the future. Do you really want us to?

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Packman (gz) wrote :

> I'm not super keen to actually merge this because history shows that
> any test that depends on timing interactions between threads will
> intermittently fail in the future. Do you really want us to?

Your call, at the moment the code risks a random UnexpectedSuccess - which as mentioned by John AM on the list is currently harmless, but shouldn't be. When the underlying issues are fixed it'll pass, and risk wrongly passing if there's a regression, but won't cause random orange.

This kind of timing thing *is* hard to test correctly though, and it may be worth not merging just for the sanity of people reading through the module.

Revision history for this message
Andrew Bennetts (spiv) wrote :

First, I've got a branch that builds on this patch at <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Ok, I see the issue. It appears it's only a problem with the SSH medium, not the simple pipes (--inet) or TCP media. I haven't finished digging, but it appears that the root cause is that socket_object.makefile() is returning a file object that is buggy when EINTR occurs? If so, I think we can workaround that without losing the ability to handle EINTR.

I've rewritten your test so that it doesn't rely on timeouts or time.sleep, and so that it is an implementation test that runs against all of SmartTCPClientMedium, SmartSimplePipesClientMedium, and SmartSSHClientMedium. I do have to do a bit of ugly monkey-patching and fiddly thread interactions instead, but I think the resulting test will pass and fail with much greater reliability than one that can be affected by timing.

I also changed the test to check for the existence of the EINTR symbol rather than just checking the platform name. So now test should run on POSIX-like platforms that don't say os.name == 'posix', yet might be vulnerable to this bug.

I still need to do a bit more refactoring in my branch to remove duplication with existing test code, but this has finally pushed me into making per-medium implementation tests, so thank you! I'll probably put this new test in a new per_smart_medium test module.

Finally, your test is a bit overly precise about the interface. It's ok for _read_bytes to return too few (but not 0 unless the connection has ended), or even too many bytes. Of course, of the bytes it returns, it has to return them in the right order, so I've fixed the test to assert just that.

I'm not totally satisfied with the shape of the new test, but I think it's getting closer, and it does clearly pinpoint the bug. Thanks for your persistence on this issue!

We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking of which, is there a bug number for this?

review: Needs Fixing
Revision history for this message
Martin Packman (gz) wrote :

> First, I've got a branch that builds on this patch at
> <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Great, thanks for looking at this, the test needed to be less specific and monolithic if it was going to stick around.

> Ok, I see the issue. It appears it's only a problem with the SSH medium, not
> the simple pipes (--inet) or TCP media. I haven't finished digging, but it
> appears that the root cause is that socket_object.makefile() is returning a
> file object that is buggy when EINTR occurs? If so, I think we can workaround
> that without losing the ability to handle EINTR.

I urge you to read the mailing list thread if you've not already, it included a complete list of which interfaces are affected and how in the first post. The only tricky bit is indeed SmartSSHClientMedium but the best fix is already upstream.

> I also changed the test to check for the existence of the EINTR symbol rather
> than just checking the platform name. So now test should run on POSIX-like
> platforms that don't say os.name == 'posix', yet might be vulnerable to this
> bug.

That is incorrect. EINTR *exists* on other platforms but this PC loser-ing issue is specific to unix.

> Finally, your test is a bit overly precise about the interface. It's ok for
> _read_bytes to return too few (but not 0 unless the connection has ended), or
> even too many bytes. Of course, of the bytes it returns, it has to return
> them in the right order, so I've fixed the test to assert just that.

It's demonstrating a specific problem with one implementation, poking it to work with others too is fine.

> I'm not totally satisfied with the shape of the new test, but I think it's
> getting closer, and it does clearly pinpoint the bug. Thanks for your
> persistence on this issue!

Your changes keeps around an interface I'm trying to junk and is a little harder to reason about, but losing the sleeps is good. Please CC me on the review when you post it, as I've some specific comments.

> We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking
> of which, is there a bug number for this?

Nope, given that there are about five different bugs with the same underlying cause, it seemed easier to take it to the mailing list so all the discussion stayed in one place.

Revision history for this message
Andrew Bennetts (spiv) wrote :

Martin [gz] wrote:
[...]
> I urge you to read the mailing list thread if you've not already, it included
> a complete list of which interfaces are affected and how in the first post.
> The only tricky bit is indeed SmartSSHClientMedium but the best fix is already
> upstream.

I have read it, but will re-read. It's been a long, slightly convoluted
discussion!

> > I also changed the test to check for the existence of the EINTR symbol rather
> > than just checking the platform name. So now test should run on POSIX-like
> > platforms that don't say os.name == 'posix', yet might be vulnerable to this
> > bug.
>
> That is incorrect. EINTR *exists* on other platforms but this PC loser-ing
> issue is specific to unix.

But isn't it valid to test for correct behaviour in the face of EINTR on all
platforms that have EINTR?

[...]
> > I'm not totally satisfied with the shape of the new test, but I think it's
> > getting closer, and it does clearly pinpoint the bug. Thanks for your
> > persistence on this issue!
>
> Your changes keeps around an interface I'm trying to junk and is a little
> harder to reason about, but losing the sleeps is good. Please CC me on the
> review when you post it, as I've some specific comments.

Which interface is it that you're trying to junk?

I'll definitely CC you on the merge proposal.

> > We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking
> > of which, is there a bug number for this?
>
> Nope, given that there are about five different bugs with the same underlying
> cause, it seemed easier to take it to the mailing list so all the discussion
> stayed in one place.

I see what you mean. On the other hand, I have to search for and then read
through a multi-message thread with no clear summary. Neither a cluster of
bugs or a mailing list thread is entirely satisfying. Hmm.

Thanks for your comments!

Revision history for this message
Martin Packman (gz) wrote :

> But isn't it valid to test for correct behaviour in the face of EINTR on all
> platforms that have EINTR?

Depends on the test. For mocks, I think it's useful, when dealing with real platform code where the outcome will either be a spurious pass or a hang depending on the details, I think it's clearer to skip. Note that os.name == "posix" covers linux, bsd (which actually has a different signal interrupt paradigm but python switches into the posix style), mac os x, solaris... Name me a platform bzr supports that you're worried about skipping the test on?

> Which interface is it that you're trying to junk?

I see you've now found:
<https://code.launchpad.net/~gz/bzr/no_until_no_eintr/+merge/19615>

Revision history for this message
Andrew Bennetts (spiv) wrote :

Martin [gz] wrote:
> > But isn't it valid to test for correct behaviour in the face of EINTR on all
> > platforms that have EINTR?
>
> Depends on the test. For mocks, I think it's useful, when dealing with real
> platform code where the outcome will either be a spurious pass or a hang
> depending on the details, I think it's clearer to skip. Note that os.name ==
> "posix" covers linux, bsd (which actually has a different signal interrupt
> paradigm but python switches into the posix style), mac os x, solaris... Name
> me a platform bzr supports that you're worried about skipping the test on?

It's more that I'd rather *not* worry about making sure all relevant platforms
are covered, and instead aim for the broadest possible coverage. If that turns
out to be impractical then os.name == 'posix' will do.

Revision history for this message
Martin Packman (gz) wrote :

Changing the status to WIP here as Andrew has a branch building on this.

Unmerged revisions

5038. By Martin Packman

Turn the expectFailure the right way round, remove sleep trailing whitespace

5037. By Martin Packman

Test demonstrating a few of the issues with the current code attempting to handle EINTR being raised

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/tests/test_smart_transport.py'
2--- bzrlib/tests/test_smart_transport.py 2010-02-11 09:27:55 +0000
3+++ bzrlib/tests/test_smart_transport.py 2010-02-16 23:34:14 +0000
4@@ -367,6 +367,70 @@
5 client_medium.disconnect()
6 self.assertEqual(['flush'], flush_calls)
7
8+ def test_ssh_client_medium_eintr__read_bytes(self):
9+ """
10+ Verify that _read_bytes is robust against EINTR being raised
11+
12+ It's probably not possible to test this in such a way as it won't
13+ sometimes randomly pass, but this fails in the expected manners
14+ pretty reliably.
15+ """
16+ if os.name != "posix":
17+ raise tests.TestNotApplicable("Only nix needs to handle EINTR")
18+ import errno, signal, sys, time
19+ # Use timeouts we can join the server thread without risking hanging
20+ timeout = 3
21+ sock_server = socket.socket()
22+ sock_server.settimeout(timeout)
23+ sock_server.bind(("127.0.0.1", 0))
24+ sock_server.listen(1)
25+ about_to_read = threading.Event()
26+ def _send_with_interrupt():
27+ """Send a message split by a signal to a connecting client"""
28+ sock_response, addr = sock_server.accept()
29+ about_to_read.wait(timeout)
30+ sock_response.sendall("head")
31+ # Let the main thread recv before raising signal
32+ time.sleep(0.1)
33+ os.kill(os.getpid(), signal.SIGUSR1)
34+ # Let the signal be handled before sending more
35+ time.sleep(0.1)
36+ sock_response.sendall("tail")
37+ server_thread = threading.Thread(target=_send_with_interrupt)
38+ siglog = []
39+ def _record_signal(signum, frame):
40+ """Record the signal received and the functions being called"""
41+ stack = []
42+ while frame:
43+ co_name = frame.f_code.co_name
44+ if co_name == "test_ssh_client_medium_eintr__read_bytes":
45+ break
46+ stack.append(co_name)
47+ frame = frame.f_back
48+ siglog.append((signum, stack))
49+ self.addCleanup(signal.signal, signal.SIGUSR1,
50+ signal.signal(signal.SIGUSR1, _record_signal))
51+ server_thread.start()
52+ self.addCleanup(server_thread.join)
53+ sock_request = socket.socket()
54+ sock_request.connect(sock_server.getsockname())
55+ addr, port = sock_request.getsockname()
56+ client_medium = medium.SmartSSHClientMedium(addr, port,
57+ vendor=StringIOSSHVendor(sock_request.makefile(), None))
58+ client_medium._ensure_connection()
59+ about_to_read.set()
60+ try:
61+ bytes = client_medium._read_bytes(8)
62+ except socket.error, e:
63+ if e.args[0] == errno.EINTR and sys.version_info < (2, 6):
64+ self.knownFailure("Interruptions not handled on Python < 2.6")
65+ raise
66+ self.assertEqual(siglog,
67+ [(signal.SIGUSR1, ["read", "until_no_eintr", "_read_bytes"])])
68+ self.expectFailure("An interrupted read loses bytes already read",
69+ self.assertNotEqual, bytes, "tail")
70+ self.assertEqual(bytes, "headtail")
71+
72 def test_construct_smart_tcp_client_medium(self):
73 # the TCP client medium takes a host and a port. Constructing it won't
74 # connect to anything.