Mir

Merge lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking into lp:mir

Proposed by Alan Griffiths
Status: Merged
Approved by: Alan Griffiths
Approved revision: no longer in the source branch.
Merged at revision: 1398
Proposed branch: lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking
Merge into: lp:mir
Prerequisite: lp:~alan-griffiths/mir/refactoring-so-SwitchingBundle-can-control-completion-of-client_acquire
Diff against target: 102 lines (+27/-6)
3 files modified
src/server/compositor/switching_bundle.cpp (+23/-5)
src/server/compositor/switching_bundle.h (+3/-1)
src/server/frontend/session_mediator.cpp (+1/-0)
To merge this branch: bzr merge lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking
Reviewer Review Type Date Requested Status
PS Jenkins bot (community) continuous-integration Approve
Alexandros Frantzis (community) Approve
Andreas Pokorny (community) Approve
Daniel van Vugt Abstain
Kevin DuBois (community) Approve
Alan Griffiths Pending
Review via email: mp+205568@code.launchpad.net

This proposal supersedes a proposal from 2014-01-31.

Commit message

compositor: non-blocking implementation of SwitchingBundle::client_acquire

Description of the change

compositor: non-blocking implementation of SwitchingBundle::client_acquire

This is the second of three passes over the the code.

The final pass will be to clean up some of the mess left in the code - in particular we can get rid of some duplication and kill the legacy "force_requests_to_complete" code that forces unblocking of frontend threads hung in client_acquire.

I'm splitting it this way as it is easier to review in these chunks

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Approve (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Needs Fixing (continuous-integration)
Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal

Not able to reproduce error. Restarting to see if it "magically" goes away.

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Approve (continuous-integration)
Revision history for this message
Kevin DuBois (kdub) wrote : Posted in a previous version of this proposal

looks sensible from a functional perspective.

just non-blocking nits:
38: two lines?
'client_acquisition_callback' makes more sense to me than "client_acquire_todo'

review: Approve
Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal

> looks sensible from a functional perspective.
>
> just non-blocking nits:
> 38: two lines?

http://unity.ubuntu.com/mir/cppguide/index.html?showone=Conditionals#Conditionals

> 'client_acquisition_callback' makes more sense to me than
> "client_acquire_todo'

Then the code would read:
  if (client_acquire_callback) complete_client_acquire(lock);

I think it is clearer as:
  if (client_acquire_todo) complete_client_acquire(lock);

or maybe:
  if (client_acquire_pending) complete_client_acquire(lock);

Revision history for this message
Alexandros Frantzis (afrantzis) wrote : Posted in a previous version of this proposal

Since we want the variable to act as both a boolean flag and a function callback, any name we pick is going to match one situation a bit better than the other.

client_acquire_todo and client_acquire_pending don't tell us much about what is that we want to do or what's pending and they may be a bit misleading, in that they read as pending/todo client_acquire.

I prefer "if (client_acquire_callback)" which to me reads: if we have a client_acquire callback then ...

Possible improvements: client_acquire_pending_callback, or borrowing from interrupt terminology: client_acquire_(pending_)bottom_half ?

Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote : Posted in a previous version of this proposal

I dont have a clear opinion on the function name current version looks good to me, since we do not call client_acquire_todo() but move it into another variable with a proper name before calling.

Following the pattern of the multi-personality variable I think that
if (client_acquire_pending) is better than if (client_acquire_pending_callback) .. because again the calling site uses a different variable name.

review: Approve
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal
review: Approve (continuous-integration)
Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal

I've found a scenario that segfaults the server with code equivalent to this.

Don't top-approve until I've investigated further.

review: Needs Fixing
Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal

> I've found a scenario that segfaults the server with code equivalent to this.

The essence of the problem is that the frontend socket session will clear down on various "error" conditions (like a client process being terminated). That leaves client_acquire_todo pointing into dead objects. This can lead to various failure modes.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal

> > I've found a scenario that segfaults the server with code equivalent to
> this.

Fixed.

I think some cleanup is possible - but I'd prefer to incorporate that in the more general cleanup of the "force_requests_to_complete()" code that is now misnamed (and, at least partly, obsolete).

review: Abstain
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Kevin DuBois (kdub) wrote :

previous approve still stands, it looks like the fix was to just prevent any frontend throws from making it to the compositor. Seems like a test for that scenario might be helpful though :)

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I notice you only eliminated one out of three wait() calls in client_acquire(). I think that's the significant one anyway.

Also, the removal of cond.notify_all() is potentially very dangerous with cond.wait()'s scattered throughout the code still. I can't yet tell if it's safe.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> previous approve still stands, it looks like the fix was to just prevent any
> frontend throws from making it to the compositor. Seems like a test for that
> scenario might be helpful though :)

Actually, adding the missing lock in ~SessionMediator() is at least as significant as it prevents race conditions where the session closes down while swapping buffers. Vis:

97 + std::unique_lock<std::mutex> lock(session_mutex);

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I notice you only eliminated one out of three wait() calls in
> client_acquire(). I think that's the significant one anyway.
>
> Also, the removal of cond.notify_all() is potentially very dangerous with
> cond.wait()'s scattered throughout the code still. I can't yet tell if it's
> safe.

One of the "wait()"s moved to complete_client_acquire() and relates to snapshotting - the "notify()" in snapshot_release() is clearly still in place.

The "wait()" that remains in client_acquire() seems to relate to a scenario that AFAICS doesn't happen in current usage (multiple client buffers having been acquired). But again, there is a "notify()" in client_release()

The "wait()" and "notify()"s I've remove relate to a third wait condition - waiting for compositing to release a buffer.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

One concerning aspect of the changes I realized while reviewing this MP (but not introduced by this MP) is that we moved work from the client thread to the compositor thread. That is, compositor_release() calls complete_client_acquire() which:

1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.

The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.

I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points > above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

Oops, sorry, left over draft. Should have been only:

I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

review: Needs Information
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (but not introduced by this MP)

This is not true either, remains of earlier draft...

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

So, I think you're saying: "Needs Discussion"

~~~~

we moved work from the client thread to the compositor thread. That is, compositor_release() calls complete_client_acquire() which:

1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.

The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.

I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

~~~~

It is worth mentioning that these assumptions will only fail when we start compositing a surface that we'd previously blocked. Not that it can't be considered an issue, but it isn't the principle path through the code.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It seems my original concerns [1] about thread safety are somewhat justified now, according to helgrind:

development-branch: 236 errors from 8 contexts
This branch: 1998 errors from 9 contexts

Tested with:
    valgrind --tool=helgrind bin/mir_unit_tests --gtest_filter="SwitchingBundle*"

[1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244

I can't immediately see what the new context is, but the significant increase in errors warrants some attention.

review: Needs Fixing
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_filter="SwitchingBundle*"
>
> [1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-
> SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.

Giving it some attention.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :
Download full text (3.5 KiB)

> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_filter="SwitchingBundle*"
>
> [1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-
> SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.

With the following suppression file we get a clean run. (Please check you're happy with these suppressions.)

AFAICS the only possibly contentious one is for client_acquire_blocking() in test_swapping_swappers.cpp

###############################################################################
# Part 1: Suppress spurious races in std library functions

{
   std::atomic::load
   Helgrind:Race
   fun:_ZNKSt13__atomic_baseIbE4loadESt12memory_order
}
{
   std::atomic::store
   Helgrind:Race
   fun:_ZNSt13__atomic_baseIbE5storeEbSt12memory_order
}

{
   std::mutex::unlock()
   Helgrind:Race
   fun:_ZNSt5mutex6unlockEv
   obj:*
}

{
   std::mutex::lock()
   Helgrind:Race
   fun:_ZNSt5mutex4lockEv
   obj:*
}

{
   std::thread::join
   Helgrind:Race
   obj:/usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so
   fun:_ZNSt6thread4joinEv
}

{
   std::unique_lock::~unique_lock()
   Helgrind:Race
   fun:_ZNSt11unique_lockISt5mutexED1Ev
   obj:*
}

{
   std::unique_lock::unique_lock(std::mutex&)
   Helgrind:Race
   fun:_ZNSt11unique_lockISt5mutexEC1ERS0_
}

{
   std::shared_ptr<mir::graphics::Buffer, (__gnu_cxx::_Lock_policy)2>::get() const
   Helgrind:Race
   fun:_ZNKSt12__shared_ptrIN3mir8graphics6BufferELN9__gnu_cxx12_Lock_policyE2EE3getEv
   obj:*
}

{
   std::chrono::duration_cast
   Helgrind:Race
   fun:_ZNSt6chrono20__duration_cast_implINS_8durationIlSt5ratioILl1ELl1EEEES2_ILl1ELl1000EElLb1ELb0EE6__castIlS5_EES4_RKNS1_IT_T0_EE
}

{
   std::thread::_Impl_base::~_Impl_base()
   Helgrind:Race
   fun:_ZNSt6thread10_Impl_baseD1Ev
}

{
   std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()
   Helgrind:Race
   fun:_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED1Ev
}

###############################################################################
# Part 2: Suppress spurious races in std library template instantiations

{
   std::thread::_Impl<std::_Bind_simple<void (*(std::reference_wrapper<mir::compositor::SwitchingBundle>, std::reference_wrapper<unsigned long>, std::reference_wrapper<mir::graphics::BufferID>))(mir::compositor::SwitchingBundle&, unsigned long&, mir::graphics::BufferID&)> >::~_Impl()
   Helgrind:Race
   fun:_ZNSt6thread5_ImplISt12_Bind_simpleIFPFvRN3mir10compositor15SwitchingBundleERmRNS2_8graphics8BufferIDEESt17reference_wrapperIS4_ESC_ImESC_IS8_EEEED1Ev
}

{
   std::thread::_Impl<std::_Bind_simple<void (*(std::reference_wrapper<mir::compositor::SwitchingBundle>, int))(mir::compositor::SwitchingBundle&, int)> >::~_Impl()
   Helgrind:Race
   fun:_ZNSt6thread5_ImplISt12_Bind_simpleIFPFvRN3mir10compositor15SwitchingBundleEiESt17reference_wrapperIS4_EiEEED1Ev
}...

Read more...

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I find it hard to believe the C++ library is really that bad. I know if using pure pthread calls then helgrind is always right and you always need to fix your code. Although I can see helgrind is wrong about std::atomic, so it could be wrong elsewhere too. Not sure.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, ignoring helgrind, I don't see any reason why this proposal would be wrong.

I'm nervous about major changes to something that took several months to perfect, just as we were nervous to retire the code it replaced. But I don't have a logical argument against these changes right now.

review: Abstain
Revision history for this message
Andreas Pokorny (andreas-pokorny) :
review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

I am OK with the implementation details, but I am still concerned about introducing delays in the compositing threads. Approving, but we should keep our eyes open for unwanted effects.

review: Approve
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I find it hard to believe the C++ library is really that bad. I know if using
> pure pthread calls then helgrind is always right and you always need to fix
> your code. Although I can see helgrind is wrong about std::atomic, so it could
> be wrong elsewhere too. Not sure.

The difficulty for helgrind is that there are a lot of built-in types that are inherently atomic on Intel. Which means that the generated opcodes are identical to using thew without the atomic wrapper. The same is true of atomic counters used in the threads and smart pointer implementations.

There's discussion on the valgrind, g++ and clang forums about this as it obviously makes checking use of c++11 atomics, threads, lambdas and smart pointers noisy with the potential to lose the signal in the noise.

I don't see a clear solution other than to suppress the false positives (and, I don't immediately see a way to handle generically for template instantiations).

Revision history for this message
PS Jenkins bot (ps-jenkins) :
review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/server/compositor/switching_bundle.cpp'
2--- src/server/compositor/switching_bundle.cpp 2014-01-31 14:13:36 +0000
3+++ src/server/compositor/switching_bundle.cpp 2014-02-13 17:06:03 +0000
4@@ -180,6 +180,7 @@
5 void mc::SwitchingBundle::client_acquire(std::function<void(graphics::Buffer* buffer)> complete)
6 {
7 std::unique_lock<std::mutex> lock(guard);
8+ client_acquire_todo = std::move(complete);
9
10 if ((framedropping || force_drop) && nbuffers > 1)
11 {
12@@ -210,10 +211,17 @@
13 1;
14 #endif
15
16- while (nfree() < min_free)
17- cond.wait(lock);
18+ if (nfree() < min_free)
19+ return;
20 }
21
22+ complete_client_acquire(std::move(lock));
23+}
24+
25+void mc::SwitchingBundle::complete_client_acquire(std::unique_lock<std::mutex> lock)
26+{
27+ auto complete = std::move(client_acquire_todo);
28+
29 if (force_drop > 0)
30 force_drop--;
31
32@@ -247,7 +255,16 @@
33 ring[client].buf = ret;
34 }
35
36- complete(ret.get());
37+ lock.unlock();
38+
39+ try
40+ {
41+ complete(ret.get());
42+ }
43+ catch (...)
44+ {
45+ // TODO comms errors should not propagate to compositing threads
46+ }
47 }
48
49 void mc::SwitchingBundle::client_release(graphics::Buffer* released_buffer)
50@@ -339,7 +356,8 @@
51 first_compositor = next(first_compositor);
52 ncompositors--;
53 }
54- cond.notify_all();
55+
56+ if (client_acquire_todo) complete_client_acquire(std::move(lock));
57 }
58 }
59
60@@ -394,7 +412,7 @@
61 std::unique_lock<std::mutex> lock(guard);
62 drop_frames(nready);
63 force_drop = nbuffers + 1;
64- cond.notify_all();
65+ if (client_acquire_todo) complete_client_acquire(std::move(lock));
66 }
67
68 void mc::SwitchingBundle::allow_framedropping(bool allow_dropping)
69
70=== modified file 'src/server/compositor/switching_bundle.h'
71--- src/server/compositor/switching_bundle.h 2014-02-11 00:59:44 +0000
72+++ src/server/compositor/switching_bundle.h 2014-02-13 17:06:03 +0000
73@@ -80,7 +80,7 @@
74 int last_compositor() const;
75
76 const std::shared_ptr<graphics::Buffer> &alloc_buffer(int slot);
77-
78+ void complete_client_acquire(std::unique_lock<std::mutex> lock);
79 struct SharedBuffer
80 {
81 std::shared_ptr<graphics::Buffer> buf;
82@@ -108,6 +108,8 @@
83 bool framedropping;
84 int force_drop;
85
86+ std::function<void(graphics::Buffer* buffer)> client_acquire_todo;
87+
88 friend std::ostream& operator<<(std::ostream& os, const SwitchingBundle& bundle);
89 };
90
91
92=== modified file 'src/server/frontend/session_mediator.cpp'
93--- src/server/frontend/session_mediator.cpp 2014-02-10 10:56:56 +0000
94+++ src/server/frontend/session_mediator.cpp 2014-02-13 17:06:03 +0000
95@@ -82,6 +82,7 @@
96
97 mf::SessionMediator::~SessionMediator() noexcept
98 {
99+ std::unique_lock<std::mutex> lock(session_mutex);
100 auto session = weak_session.lock();
101 if (session)
102 {

Subscribers

People subscribed via source and target branches