Mir

Merge lp:~afrantzis/mir/fix-1441759-screencast into lp:mir

fix-1441759-screencast
Merge into development-branch

Proposed by Alexandros Frantzis on 2015-04-21

Status:

Merged

Approved by:

Alexandros Frantzis on 2015-04-29

Approved revision:

no longer in the source branch.

Merged at revision:

2522

Proposed branch:

lp:~afrantzis/mir/fix-1441759-screencast

Merge into:

lp:mir

Diff against target:

73 lines (+38/-5)

2 files modified

src/server/frontend/session_mediator.cpp (+8/-5)
tests/unit-tests/frontend/test_session_mediator.cpp (+30/-0)

To merge this branch:

bzr merge lp:~afrantzis/mir/fix-1441759-screencast

Related bugs:

Bug #1441759: [android] Screencasting causes other clients to stop rendering	High	Fix Released
Bug #1454724: Mir server crashed with SIGSEGV from close_session() when a client is killed suddenly	High	Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Alan Griffiths			Approve on 2015-04-27
Chris Halse Rogers			Needs Fixing on 2015-04-22
Kevin DuBois (community)			Approve on 2015-04-21
PS Jenkins bot (community)	continuous-integration		Approve on 2015-04-21
Alberto Aguirre (community)		2015-04-21	Approve on 2015-04-21
Review via email: mp+256942@code.launchpad.net

Commit message

frontend: Properly handle asynchronous completion function invocations (LP: #1441759)

Under some circumstances (e.g., when screencasting) the completion function for a client buffer request may be invoked asynchronously, but from the same thread that initiated the client buffer request (i.e. one of the IPC threads). Our code didn't handle this situation properly, causing clients to hang waiting for a reply that never arrived.

Description of the change

frontend: Properly handle asynchronous swap buffer completion function invocations (LP: #1441759)

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-04-21:

LGTM.

review: Approve

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-04-21:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/3579/rebuild

review: Approve (continuous-integration)

Revision history for this message

Kevin DuBois (kdub) on 2015-04-21:

review: Approve

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-22:

I'm a bit confused here. Under what situation will weak_tid.lock() return a valid pointer but *weak_tid.lock() != this_thread::get_id()?

Yes; the only time that it's valid to call lock.unlock() is when the lambda is executed in a stack frame below advance_buffer, so really we're not shuttling a thread_id into the swap_buffers callback, we're checking whether it's being called on the same stack or not.

It looks like this could be made simpler by just passing ownership of lock to advance_buffer.

Actually, what is that lock protecting, anyway? Eyeballing it, I think it'd be more correct to just release the lock before calling advance_buffers.

Marked as needs fixing. At least requires removing the this_thread.get_id() check, because that check is guaranteed to be true by virtue of executing deeper on the same stack as advance_buffer.

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2015-04-22:

> I'm a bit confused here. Under what situation will weak_tid.lock() return a valid pointer
> Under what situation will weak_tid.lock() return a valid pointer but *weak_tid.lock() != this_thread::get_id()?

This can happen when a buffer is not available to the client when calling stream.swap_buffers(), but becomes available soon after stream.swap_buffers() returns (e.g. because a compositor has released a buffer) but before we leave the advance_buffer function (so weak_ptr is still valid). The completion callback will then be called in another thread (the compositor thread), while weak_ptr is still valid.

> It looks like this could be made simpler by just passing ownership of lock to advance_buffer.

The idea is to avoid holding the lock while running the completion function, which (among other things) performs IPC operations. If lock ownership was passed to advance_buffer (and not conditionally unlocked in the completion function), we would be holding the lock if the completion function executed synchronously.

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-22:

> > I'm a bit confused here. Under what situation will weak_tid.lock() return a
> valid pointer
> > Under what situation will weak_tid.lock() return a valid pointer but
> *weak_tid.lock() != this_thread::get_id()?
>
> This can happen when a buffer is not available to the client when calling
> stream.swap_buffers(), but becomes available soon after stream.swap_buffers()
> returns (e.g. because a compositor has released a buffer) but before we leave
> the advance_buffer function (so weak_ptr is still valid). The completion
> callback will then be called in another thread (the compositor thread), while
> weak_ptr is still valid.

Ah, yeah, fair point.
>
> > It looks like this could be made simpler by just passing ownership of lock
> to advance_buffer.
>
> The idea is to avoid holding the lock while running the completion function,
> which (among other things) performs IPC operations. If lock ownership was
> passed to advance_buffer (and not conditionally unlocked in the completion
> function), we would be holding the lock if the completion function executed
> synchronously.
Right. advance_buffer would be in charge of releasing the lock, which it would be perfectly safe to do unconditionally.

I think I like my last suggestion the most, though - don't call advance_buffer with the session_mutex locked, because it's unnecessary.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2015-04-22:

> I think I like my last suggestion the most, though - don't call advance_buffer with the session_mutex locked, because it's unnecessary.

I think you may be right, but I want to investigate and test a bit more to ensure it doesn't cause any unintended consequences.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2015-04-22:

> I think you may be right, but I want to investigate and test a bit more to ensure it doesn't cause any unintended consequences.

That being said, I would prefer making such an investigation/change independent of this MP. This MP fixes a bug with minimal changes, so I think it would be reasonable to merge this even if we remove the whole locking scheme soon.

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-04-22:

>Right. advance_buffer would be in charge of releasing the lock, which it would be perfectly safe >to do unconditionally.

No, we moved away from that because you don't want to release a lock from a thread that doesn't own it.

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-22:

On Thu, Apr 23, 2015 at 12:31 AM, Alberto Aguirre
<email address hidden> wrote:
>> Right. advance_buffer would be in charge of releasing the lock,
>> which it would be perfectly safe >to do unconditionally.
>
> No, we moved away from that because you don't want to release a lock
> from a thread that doesn't own it.

No, I mean advance_buffer can release the lock *before* calling
swap_buffers.

This is basically a consequence of “holding the lock while calling
advance_buffer doesn't protect anything”.

@Alexandros: If your goal is a quick bugfix for quick
release/cherry-picking, then yeah, go ahead with this. In the absence
of such a goal, I think removing the locking is more appropriate.

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-04-23:

> On Thu, Apr 23, 2015 at 12:31 AM, Alberto Aguirre
> <email address hidden> wrote:
> >> Right. advance_buffer would be in charge of releasing the lock,
> >> which it would be perfectly safe >to do unconditionally.
> >
> > No, we moved away from that because you don't want to release a lock
> > from a thread that doesn't own it.
>
> No, I mean advance_buffer can release the lock *before* calling
> swap_buffers.
>
> This is basically a consequence of “holding the lock while calling
> advance_buffer doesn't protect anything”.
>
> @Alexandros: If your goal is a quick bugfix for quick
> release/cherry-picking, then yeah, go ahead with this. In the absence
> of such a goal, I think removing the locking is more appropriate.

But do we want to say call swap_buffers even if release_surface has been called?
I believe that was the other reason for that mutex existing - making each of the SessionMediator calls atomic up to the IPC callback

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-24:

If that's our goal then I think we fail at it now - when swap_buffers is handled asynchronously I can't see any guarantee that any number of RPC calls haven't been *completely* processed before the IPC callback occurs.

I'm not sure why we want that guarantee, anyway? From a client perspective it's perfectly valid to call mir_surface_release without waiting for a previous mir_surface_swap_buffers to complete.

We do need to ensure that we process IPC messages in-connection-order, but locking at the SessionMediator level is insufficient to guarantee that. Our core socket read code is not reentrant, so it's guaranteed that SessionMediator is not going to be asked to process a release_surface call before a previous create_surface/exchange_buffers/next_buffer call has completed.

If we wanted to provide ordering guarantees across connections I think we'd need to provide explicit RPC ordering primitives.

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-24:

“...has completed.” On the same connection, which corresponds to the same Session.

I think it turns out that I'm arguing that almost all of the locking we do in SessionMediator is pointless? That seems a strong claim. I wonder if ThreadSanitizer can test this hypothesis...

Revision history for this message

Chris Halse Rogers (raof) wrote on 2015-04-24:

It can, and it's weak evidence in favour. Removing session_mutex entirely (a) doesn't cause any tests to fail, and (b) doesn't cause ThreadSanitiser to squawk.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-04-24:

Shorten please :)

+TEST_F(SessionMediator, completes_exchange_buffer_when_completion_is_invoked_asynchronously_from_thread_that_initiated_exchange)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-04-27:

I've not reviewed the locking logic WRT the above discussion but beware of the fact that we're currently running a single IPC thread by default but potentially have a pool.

I think we can now be confident that a pool isn't needed and could remove that option. I also think we can massively simplify the code in this area by using coroutines now that asio has spawn and yield_context.

But this is a decent minimal fix for the problem addressed.

review: Approve

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-04-28:

A pool (or something more elegant than a pool) would be required to solve bug 1395421. I'd favour moving toward ultra-light-weight (and non-blocking) RPC instead though, so the location of the response call becomes irrelevant.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2015-04-29:

Top approving. We can revisit this as part of a potential 'frontend' reworking as discussed in previous comments.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alexandros Frantzis

Emanuele Antonio Faraone

Gerry Boland

Mir development team

 === modified file 'src/server/frontend/session_mediator.cpp'
 --- src/server/frontend/session_mediator.cpp	2015-04-16 20:59:36 +0000
 +++ src/server/frontend/session_mediator.cpp	2015-04-21 14:21:34 +0000
@@ -161,23 +161,26 @@
      std::unique_lock<std::mutex>& lock,
      std::function<void(graphics::Buffer*, graphics::BufferIpcMsgType)> complete)
+ {
--    auto const tid = std::this_thread::get_id();
++    auto const tid = std::make_shared<std::thread::id>(std::this_thread::get_id());
++    auto const weak_tid = std::weak_ptr<std::thread::id>{tid};
      stream.swap_buffers(
          old_buffer,
          // Note: We assume that the lambda will be executed within swap_buffers
          // (in which case the lock reference is valid) or in a different thread
          // altogether (in which case the dangling reference is not accessed)
--        [this, tid, &lock, stream_id, complete](mg::Buffer* new_buffer)
++        [this, weak_tid, &lock, stream_id, complete](mg::Buffer* new_buffer)
+         {
--            if (tid == std::this_thread::get_id())
--                lock.unlock();
++            if (auto const tid = weak_tid.lock())
++            {
++                if (*tid == std::this_thread::get_id())
++                    lock.unlock();
++            }
              if (buffer_stream_tracker.track_buffer(stream_id, new_buffer))
                  complete(new_buffer, mg::BufferIpcMsgType::update_msg);
              else
                  complete(new_buffer, mg::BufferIpcMsgType::full_msg);
--
          });
+ }
 === modified file 'tests/unit-tests/frontend/test_session_mediator.cpp'
 --- tests/unit-tests/frontend/test_session_mediator.cpp	2015-03-31 02:35:42 +0000
 +++ tests/unit-tests/frontend/test_session_mediator.cpp	2015-04-21 14:21:34 +0000
@@ -899,3 +899,33 @@
      EXPECT_THAT(*(reinterpret_cast<int*>(request.data.data())), Eq(magic));
      EXPECT_THAT(drm_response.status_code(), Eq(test_response));
+ }
++
++// Regression test for LP: #1441759
++TEST_F(SessionMediator, completes_exchange_buffer_when_completion_is_invoked_asynchronously_from_thread_that_initiated_exchange)
++{
++    using namespace testing;
++    auto const& mock_surface = stubbed_session->mock_surface_at(mf::SurfaceId{0});
++    mtd::StubBuffer stub_buffer1;
++    mtd::StubBuffer stub_buffer2;
++    std::function<void(mg::Buffer*)> completion_func;
++
++    // create
++    InSequence seq;
++    EXPECT_CALL(*mock_surface, swap_buffers(_, _))
++        .WillOnce(InvokeArgument<1>(&stub_buffer1));
++    // exchange, steal completion function
++    EXPECT_CALL(*mock_surface, swap_buffers(_,_))
++        .WillOnce(SaveArg<1>(&completion_func));
++
++    mediator.connect(nullptr, &connect_parameters, &connection, null_callback.get());
++    mediator.create_surface(nullptr, &surface_parameters, &surface_response, null_callback.get());
++
++    buffer_request.mutable_id()->set_value(surface_response.id().value());
++    *buffer_request.mutable_buffer() = surface_response.buffer_stream().buffer();
++
++    mediator.exchange_buffer(nullptr, &buffer_request, &buffer_response, null_callback.get());
++
++    // Execute completion function asynchronously (i.e. not as part of the exchange_buffer
++    // call), but from the same thread that initiated the exchange_buffer operation
++    completion_func(&stub_buffer2);
++}