Merge into bzr.dev : test_ssh_client_medium_eintr__read

Status:	Work in progress
Proposed branch:	lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes
Merge into:	lp:bzr
Diff against target:	74 lines (+64/-0) 1 file modified bzrlib/tests/test_smart_transport.py (+64/-0)
To merge this branch:	bzr merge lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Andrew Bennetts		2010-02-16	Needs Fixing on 2010-02-18
Review via email: mp+19439@code.launchpad.net

Revision history for this message

Martin Packman (gz) wrote on 2010-02-16:

#

For background, see this thread (and particularly the three changesets linked ffrom the initial post):
<https://lists.ubuntu.com/archives/bazaar/2010q1/066869.html>

There seemed to be some doubt about exactly how harmful the current code attempting to handle EINTR is, despite my best attempts to explain it, so here's a test that demonstrates a couple of the failure states. It uses a reading method, but these problems apply equally to those that write, it's just harder to get them to block where you want reliably enough to make an automated test.

As an aside, testtools doesn't seem to do anything sensible with TestCase.knownFailure (and the TestCase.expectFailure isn't much better).

Reply

lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes updated on 2010-02-16

5038. By Martin Packman on 2010-02-16: Turn the expectFailure the right way round, remove sleep trailing whitespace

Revision history for this message

Martin Pool (mbp) wrote on 2010-02-17:

#

This does make the bug more clear, thanks for posting it. Clearly we
should fix _read_bytes.

I'm not super keen to actually merge this because history shows that
any test that depends on timing interactions between threads will
intermittently fail in the future. Do you really want us to?

--
Martin <http://launchpad.net/~mbp/>

Reply

Revision history for this message

Martin Packman (gz) wrote on 2010-02-17:

#

> I'm not super keen to actually merge this because history shows that
> any test that depends on timing interactions between threads will
> intermittently fail in the future. Do you really want us to?

Your call, at the moment the code risks a random UnexpectedSuccess - which as mentioned by John AM on the list is currently harmless, but shouldn't be. When the underlying issues are fixed it'll pass, and risk wrongly passing if there's a regression, but won't cause random orange.

This kind of timing thing *is* hard to test correctly though, and it may be worth not merging just for the sanity of people reading through the module.

Reply

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-02-18:

#

First, I've got a branch that builds on this patch at <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Ok, I see the issue. It appears it's only a problem with the SSH medium, not the simple pipes (--inet) or TCP media. I haven't finished digging, but it appears that the root cause is that socket_object.makefile() is returning a file object that is buggy when EINTR occurs? If so, I think we can workaround that without losing the ability to handle EINTR.

I've rewritten your test so that it doesn't rely on timeouts or time.sleep, and so that it is an implementation test that runs against all of SmartTCPClientMedium, SmartSimplePipesClientMedium, and SmartSSHClientMedium. I do have to do a bit of ugly monkey-patching and fiddly thread interactions instead, but I think the resulting test will pass and fail with much greater reliability than one that can be affected by timing.

I also changed the test to check for the existence of the EINTR symbol rather than just checking the platform name. So now test should run on POSIX-like platforms that don't say os.name == 'posix', yet might be vulnerable to this bug.

I still need to do a bit more refactoring in my branch to remove duplication with existing test code, but this has finally pushed me into making per-medium implementation tests, so thank you! I'll probably put this new test in a new per_smart_medium test module.

Finally, your test is a bit overly precise about the interface. It's ok for _read_bytes to return too few (but not 0 unless the connection has ended), or even too many bytes. Of course, of the bytes it returns, it has to return them in the right order, so I've fixed the test to assert just that.

I'm not totally satisfied with the shape of the new test, but I think it's getting closer, and it does clearly pinpoint the bug. Thanks for your persistence on this issue!

We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking of which, is there a bug number for this?

First, I've got a branch that builds on this patch at <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Ok, I see the issue.  It appears it's only a problem with the SSH medium, not the simple pipes (--inet) or TCP media.  I haven't finished digging, but it appears that the root cause is that socket_object.makefile() is returning a file object that is buggy when EINTR occurs?  If so, I think we can workaround that without losing the ability to handle EINTR.

I've rewritten your test so that it doesn't rely on timeouts or time.sleep, and so that it is an implementation test that runs against all of SmartTCPClientMedium, SmartSimplePipesClientMedium, and SmartSSHClientMedium.  I do have to do a bit of ugly monkey-patching and fiddly thread interactions instead, but I think the resulting test will pass and fail with much greater reliability than one that can be affected by timing.

I also changed the test to check for the existence of the EINTR symbol rather than just checking the platform name.  So now test should run on POSIX-like platforms that don't say os.name == 'posix', yet might be vulnerable to this bug.

I still need to do a bit more refactoring in my branch to remove duplication with existing test code, but this has finally pushed me into making per-medium implementation tests, so thank you!  I'll probably put this new test in a new per_smart_medium test module.

Finally, your test is a bit overly precise about the interface.  It's ok for _read_bytes to return too few (but not 0 unless the connection has ended), or even too many bytes.  Of course, of the bytes it returns, it has to return them in the right order, so I've fixed the test to assert just that.

I'm not totally satisfied with the shape of the new test, but I think it's getting closer, and it does clearly pinpoint the bug.  Thanks for your persistence on this issue!

We may to backport the eventual fix to 2.0, if it isn't too hairy.  Thinking of which, is there a bug number for this?

review: Needs Fixing

Reply

Revision history for this message

Martin Packman (gz) wrote on 2010-02-18:

#

> First, I've got a branch that builds on this patch at
> <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Great, thanks for looking at this, the test needed to be less specific and monolithic if it was going to stick around.

> Ok, I see the issue. It appears it's only a problem with the SSH medium, not
> the simple pipes (--inet) or TCP media. I haven't finished digging, but it
> appears that the root cause is that socket_object.makefile() is returning a
> file object that is buggy when EINTR occurs? If so, I think we can workaround
> that without losing the ability to handle EINTR.

I urge you to read the mailing list thread if you've not already, it included a complete list of which interfaces are affected and how in the first post. The only tricky bit is indeed SmartSSHClientMedium but the best fix is already upstream.

> I also changed the test to check for the existence of the EINTR symbol rather
> than just checking the platform name. So now test should run on POSIX-like
> platforms that don't say os.name == 'posix', yet might be vulnerable to this
> bug.

That is incorrect. EINTR *exists* on other platforms but this PC loser-ing issue is specific to unix.

> Finally, your test is a bit overly precise about the interface. It's ok for
> _read_bytes to return too few (but not 0 unless the connection has ended), or
> even too many bytes. Of course, of the bytes it returns, it has to return
> them in the right order, so I've fixed the test to assert just that.

It's demonstrating a specific problem with one implementation, poking it to work with others too is fine.

> I'm not totally satisfied with the shape of the new test, but I think it's
> getting closer, and it does clearly pinpoint the bug. Thanks for your
> persistence on this issue!

Your changes keeps around an interface I'm trying to junk and is a little harder to reason about, but losing the sleeps is good. Please CC me on the review when you post it, as I've some specific comments.

> We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking
> of which, is there a bug number for this?

Nope, given that there are about five different bugs with the same underlying cause, it seemed easier to take it to the mailing list so all the discussion stayed in one place.

> First, I've got a branch that builds on this patch at
> <lp:~spiv/bzr/test_ssh_client_medium_eintr__read_bytes>.

Great, thanks for looking at this, the test needed to be less specific and monolithic if it was going to stick around.

> Ok, I see the issue.  It appears it's only a problem with the SSH medium, not
> the simple pipes (--inet) or TCP media.  I haven't finished digging, but it
> appears that the root cause is that socket_object.makefile() is returning a
> file object that is buggy when EINTR occurs?  If so, I think we can workaround
> that without losing the ability to handle EINTR.

I urge you to read the mailing list thread if you've not already, it included a complete list of which interfaces are affected and how in the first post. The only tricky bit is indeed SmartSSHClientMedium but the best fix is already upstream.

> I also changed the test to check for the existence of the EINTR symbol rather
> than just checking the platform name.  So now test should run on POSIX-like
> platforms that don't say os.name == 'posix', yet might be vulnerable to this
> bug.

That is incorrect. EINTR *exists* on other platforms but this PC loser-ing issue is specific to unix.

> Finally, your test is a bit overly precise about the interface.  It's ok for
> _read_bytes to return too few (but not 0 unless the connection has ended), or
> even too many bytes.  Of course, of the bytes it returns, it has to return
> them in the right order, so I've fixed the test to assert just that.

It's demonstrating a specific problem with one implementation, poking it to work with others too is fine.

> I'm not totally satisfied with the shape of the new test, but I think it's
> getting closer, and it does clearly pinpoint the bug.  Thanks for your
> persistence on this issue!

Your changes keeps around an interface I'm trying to junk and is a little harder to reason about, but losing the sleeps is good. Please CC me on the review when you post it, as I've some specific comments.

> We may to backport the eventual fix to 2.0, if it isn't too hairy.  Thinking
> of which, is there a bug number for this?

Nope, given that there are about five different bugs with the same underlying cause, it seemed easier to take it to the mailing list so all the discussion stayed in one place.

Reply

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-02-18:

#

Martin [gz] wrote:
[...]
> I urge you to read the mailing list thread if you've not already, it included
> a complete list of which interfaces are affected and how in the first post.
> The only tricky bit is indeed SmartSSHClientMedium but the best fix is already
> upstream.

I have read it, but will re-read. It's been a long, slightly convoluted
discussion!

> > I also changed the test to check for the existence of the EINTR symbol rather
> > than just checking the platform name. So now test should run on POSIX-like
> > platforms that don't say os.name == 'posix', yet might be vulnerable to this
> > bug.
>
> That is incorrect. EINTR *exists* on other platforms but this PC loser-ing
> issue is specific to unix.

But isn't it valid to test for correct behaviour in the face of EINTR on all
platforms that have EINTR?

[...]
> > I'm not totally satisfied with the shape of the new test, but I think it's
> > getting closer, and it does clearly pinpoint the bug. Thanks for your
> > persistence on this issue!
>
> Your changes keeps around an interface I'm trying to junk and is a little
> harder to reason about, but losing the sleeps is good. Please CC me on the
> review when you post it, as I've some specific comments.

Which interface is it that you're trying to junk?

I'll definitely CC you on the merge proposal.

> > We may to backport the eventual fix to 2.0, if it isn't too hairy. Thinking
> > of which, is there a bug number for this?
>
> Nope, given that there are about five different bugs with the same underlying
> cause, it seemed easier to take it to the mailing list so all the discussion
> stayed in one place.

I see what you mean. On the other hand, I have to search for and then read
through a multi-message thread with no clear summary. Neither a cluster of
bugs or a mailing list thread is entirely satisfying. Hmm.

Thanks for your comments!

Reply

Revision history for this message

Martin Packman (gz) wrote on 2010-02-19:

#

> But isn't it valid to test for correct behaviour in the face of EINTR on all
> platforms that have EINTR?

Depends on the test. For mocks, I think it's useful, when dealing with real platform code where the outcome will either be a spurious pass or a hang depending on the details, I think it's clearer to skip. Note that os.name == "posix" covers linux, bsd (which actually has a different signal interrupt paradigm but python switches into the posix style), mac os x, solaris... Name me a platform bzr supports that you're worried about skipping the test on?

> Which interface is it that you're trying to junk?

I see you've now found:
<https://code.launchpad.net/~gz/bzr/no_until_no_eintr/+merge/19615>

Reply

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-02-19:

#

Martin [gz] wrote:
> > But isn't it valid to test for correct behaviour in the face of EINTR on all
> > platforms that have EINTR?
>
> Depends on the test. For mocks, I think it's useful, when dealing with real
> platform code where the outcome will either be a spurious pass or a hang
> depending on the details, I think it's clearer to skip. Note that os.name ==
> "posix" covers linux, bsd (which actually has a different signal interrupt
> paradigm but python switches into the posix style), mac os x, solaris... Name
> me a platform bzr supports that you're worried about skipping the test on?

It's more that I'd rather *not* worry about making sure all relevant platforms
are covered, and instead aim for the broadest possible coverage. If that turns
out to be impractical then os.name == 'posix' will do.

Reply

Revision history for this message

Martin Packman (gz) wrote on 2010-02-25:

#

Changing the status to WIP here as Andrew has a branch building on this.

Reply

 === modified file 'bzrlib/tests/test_smart_transport.py'
 --- bzrlib/tests/test_smart_transport.py	2010-02-11 09:27:55 +0000
 +++ bzrlib/tests/test_smart_transport.py	2010-02-16 23:34:14 +0000
@@ -367,6 +367,70 @@
          client_medium.disconnect()
          self.assertEqual(['flush'], flush_calls)
++    def test_ssh_client_medium_eintr__read_bytes(self):
++        """
++        Verify that _read_bytes is robust against EINTR being raised
++
++        It's probably not possible to test this in such a way as it won't
++        sometimes randomly pass, but this fails in the expected manners
++        pretty reliably.
++        """
++        if os.name != "posix":
++            raise tests.TestNotApplicable("Only nix needs to handle EINTR")
++        import errno, signal, sys, time
++        # Use timeouts we can join the server thread without risking hanging
++        timeout = 3
++        sock_server = socket.socket()
++        sock_server.settimeout(timeout)
++        sock_server.bind(("127.0.0.1", 0))
++        sock_server.listen(1)
++        about_to_read = threading.Event()
++        def _send_with_interrupt():
++            """Send a message split by a signal to a connecting client"""
++            sock_response, addr = sock_server.accept()
++            about_to_read.wait(timeout)
++            sock_response.sendall("head")
++            # Let the main thread recv before raising signal
++            time.sleep(0.1)
++            os.kill(os.getpid(), signal.SIGUSR1)
++            # Let the signal be handled before sending more
++            time.sleep(0.1)
++            sock_response.sendall("tail")
++        server_thread = threading.Thread(target=_send_with_interrupt)
++        siglog = []
++        def _record_signal(signum, frame):
++            """Record the signal received and the functions being called"""
++            stack = []
++            while frame:
++                co_name = frame.f_code.co_name
++                if co_name == "test_ssh_client_medium_eintr__read_bytes":
++                    break
++                stack.append(co_name)
++                frame = frame.f_back
++            siglog.append((signum, stack))
++        self.addCleanup(signal.signal, signal.SIGUSR1,
++            signal.signal(signal.SIGUSR1, _record_signal))
++        server_thread.start()
++        self.addCleanup(server_thread.join)
++        sock_request = socket.socket()
++        sock_request.connect(sock_server.getsockname())
++        addr, port = sock_request.getsockname()
++        client_medium = medium.SmartSSHClientMedium(addr, port,
++            vendor=StringIOSSHVendor(sock_request.makefile(), None))
++        client_medium._ensure_connection()
++        about_to_read.set()
++        try:
++            bytes = client_medium._read_bytes(8)
++        except socket.error, e:
++            if e.args[0] == errno.EINTR and sys.version_info < (2, 6):
++                self.knownFailure("Interruptions not handled on Python < 2.6")
++            raise
++        self.assertEqual(siglog,
++            [(signal.SIGUSR1, ["read", "until_no_eintr", "_read_bytes"])])
++        self.expectFailure("An interrupted read loses bytes already read",
++            self.assertNotEqual, bytes, "tail")
++        self.assertEqual(bytes, "headtail")
++
      def test_construct_smart_tcp_client_medium(self):
          # the TCP client medium takes a host and a port.  Constructing it won't
          # connect to anything.

Bazaar

Merge lp:~gz/bzr/test_ssh_client_medium_eintr__read_bytes into lp:bzr

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers