Merge into development-branch : SwitchingBundle-controls-completion-of-client

Reviewer	Review Type	Date Requested	Status
PS Jenkins bot (community)	continuous-integration	2014-02-10	Approve on 2014-02-14
Alexandros Frantzis (community)			Approve on 2014-02-14
Andreas Pokorny (community)		2014-02-10	Approve on 2014-02-14
Daniel van Vugt			Abstain on 2014-02-14
Kevin DuBois (community)		2014-02-10	Approve on 2014-02-11
Alan Griffiths			Pending
Review via email: mp+205568@code.launchpad.net

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-31: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/756/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-03: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/763/rebuild

review: Approve (continuous-integration)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-04: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/770/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-05: Posted in a previous version of this proposal

#

Not able to reproduce error. Restarting to see if it "magically" goes away.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-05: Posted in a previous version of this proposal

#

FAILED: Continuous integration, rev:1377
http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-ci/776/
Executed test runs:
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-android-trusty-i386-build/805/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-clang-trusty-amd64-build/802/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-trusty-touch/400/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-trusty-amd64-ci/506/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-trusty-armhf-ci/510/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-trusty-armhf/401/console

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/776/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-05: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/779/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-06: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/793/rebuild

review: Approve (continuous-integration)

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-02-07: Posted in a previous version of this proposal

#

looks sensible from a functional perspective.

just non-blocking nits:
38: two lines?
'client_acquisition_callback' makes more sense to me than "client_acquire_todo'

review: Approve

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-07: Posted in a previous version of this proposal

#

> looks sensible from a functional perspective.
>
> just non-blocking nits:
> 38: two lines?

http://unity.ubuntu.com/mir/cppguide/index.html?showone=Conditionals#Conditionals

> 'client_acquisition_callback' makes more sense to me than
> "client_acquire_todo'

Then the code would read:
if (client_acquire_callback) complete_client_acquire(lock);

I think it is clearer as:
if (client_acquire_todo) complete_client_acquire(lock);

or maybe:
if (client_acquire_pending) complete_client_acquire(lock);

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-02-07: Posted in a previous version of this proposal

#

Since we want the variable to act as both a boolean flag and a function callback, any name we pick is going to match one situation a bit better than the other.

client_acquire_todo and client_acquire_pending don't tell us much about what is that we want to do or what's pending and they may be a bit misleading, in that they read as pending/todo client_acquire.

I prefer "if (client_acquire_callback)" which to me reads: if we have a client_acquire callback then ...

Possible improvements: client_acquire_pending_callback, or borrowing from interrupt terminology: client_acquire_(pending_)bottom_half ?

Revision history for this message

Andreas Pokorny (andreas-pokorny) wrote on 2014-02-07: Posted in a previous version of this proposal

#

I dont have a clear opinion on the function name current version looks good to me, since we do not call client_acquire_todo() but move it into another variable with a proper name before calling.

Following the pattern of the multi-personality variable I think that
if (client_acquire_pending) is better than if (client_acquire_pending_callback) .. because again the calling site uses a different variable name.

review: Approve

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-07: Posted in a previous version of this proposal

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/801/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-07: Posted in a previous version of this proposal

#

I've found a scenario that segfaults the server with code equivalent to this.

Don't top-approve until I've investigated further.

review: Needs Fixing

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-07: Posted in a previous version of this proposal

#

> I've found a scenario that segfaults the server with code equivalent to this.

The essence of the problem is that the frontend socket session will clear down on various "error" conditions (like a client process being terminated). That leaves client_acquire_todo pointing into dead objects. This can lead to various failure modes.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-10: Posted in a previous version of this proposal

#

> > I've found a scenario that segfaults the server with code equivalent to
> this.

Fixed.

I think some cleanup is possible - but I'd prefer to incorporate that in the more general cleanup of the "force_requests_to_complete()" code that is now misnamed (and, at least partly, obsolete).

review: Abstain

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-02-10:

#

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/812/rebuild

review: Approve (continuous-integration)

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-02-11:

#

previous approve still stands, it looks like the fix was to just prevent any frontend throws from making it to the compositor. Seems like a test for that scenario might be helpful though :)

review: Approve

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-02-11:

#

I notice you only eliminated one out of three wait() calls in client_acquire(). I think that's the significant one anyway.

Also, the removal of cond.notify_all() is potentially very dangerous with cond.wait()'s scattered throughout the code still. I can't yet tell if it's safe.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-11:

#

> previous approve still stands, it looks like the fix was to just prevent any
> frontend throws from making it to the compositor. Seems like a test for that
> scenario might be helpful though :)

Actually, adding the missing lock in ~SessionMediator() is at least as significant as it prevents race conditions where the session closes down while swapping buffers. Vis:

97 + std::unique_lock<std::mutex> lock(session_mutex);

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-11:

#

> I notice you only eliminated one out of three wait() calls in
> client_acquire(). I think that's the significant one anyway.
>
> Also, the removal of cond.notify_all() is potentially very dangerous with
> cond.wait()'s scattered throughout the code still. I can't yet tell if it's
> safe.

One of the "wait()"s moved to complete_client_acquire() and relates to snapshotting - the "notify()" in snapshot_release() is clearly still in place.

The "wait()" that remains in client_acquire() seems to relate to a scenario that AFAICS doesn't happen in current usage (multiple client buffers having been acquired). But again, there is a "notify()" in client_release()

The "wait()" and "notify()"s I've remove relate to a third wait condition - waiting for compositing to release a buffer.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-02-11:

#

One concerning aspect of the changes I realized while reviewing this MP (but not introduced by this MP) is that we moved work from the client thread to the compositor thread. That is, compositor_release() calls complete_client_acquire() which:

1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.

The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.

I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

review: Approve

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-02-11:

#

> I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points > above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

Oops, sorry, left over draft. Should have been only:

I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

review: Needs Information

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-02-11:

#

> (but not introduced by this MP)

This is not true either, remains of earlier draft...

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-11:

#

So, I think you're saying: "Needs Discussion"

~~~~

we moved work from the client thread to the compositor thread. That is, compositor_release() calls complete_client_acquire() which:

1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.

The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.

I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).

~~~~

It is worth mentioning that these assumptions will only fail when we start compositing a surface that we'd previously blocked. Not that it can't be considered an issue, but it isn't the principle path through the code.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-02-13:

#

It seems my original concerns [1] about thread safety are somewhat justified now, according to helgrind:

development-branch: 236 errors from 8 contexts
This branch: 1998 errors from 9 contexts

Tested with:
valgrind --tool=helgrind bin/mir_unit_tests --gtest_filter="SwitchingBundle*"

[1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244

I can't immediately see what the new context is, but the significant increase in errors warrants some attention.

review: Needs Fixing

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-13:

#

> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_filter="SwitchingBundle*"
>
> [1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-
> SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.

Giving it some attention.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-02-13:

#

Download full text (3.5 KiB)

> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_filter="SwitchingBundle*"
>
> [1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-
> SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.

With the following suppression file we get a clean run. (Please check you're happy with these suppressions.)

AFAICS the only possibly contentious one is for client_acquire_blocking() in test_swapping_swappers.cpp

###############################################################################
# Part 1: Suppress spurious races in std library functions

{
   std::atomic::load
   Helgrind:Race
   fun:_ZNKSt13__atomic_baseIbE4loadESt12memory_order
}
{
   std::atomic::store
   Helgrind:Race
   fun:_ZNSt13__atomic_baseIbE5storeEbSt12memory_order
}

{
   std::mutex::unlock()
   Helgrind:Race
   fun:_ZNSt5mutex6unlockEv
   obj:*
}

{
   std::mutex::lock()
   Helgrind:Race
   fun:_ZNSt5mutex4lockEv
   obj:*
}

{
   std::thread::join
   Helgrind:Race
   obj:/usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so
   fun:_ZNSt6thread4joinEv
}

{
   std::unique_lock::~unique_lock()
   Helgrind:Race
   fun:_ZNSt11unique_lockISt5mutexED1Ev
   obj:*
}

{
   std::unique_lock::unique_lock(std::mutex&)
   Helgrind:Race
   fun:_ZNSt11unique_lockISt5mutexEC1ERS0_
}

{
   std::shared_ptr<mir::graphics::Buffer, (__gnu_cxx::_Lock_policy)2>::get() const
   Helgrind:Race
   fun:_ZNKSt12__shared_ptrIN3mir8graphics6BufferELN9__gnu_cxx12_Lock_policyE2EE3getEv
   obj:*
}

{
   std::chrono::duration_cast
   Helgrind:Race
   fun:_ZNSt6chrono20__duration_cast_implINS_8durationIlSt5ratioILl1ELl1EEEES2_ILl1ELl1000EElLb1ELb0EE6__castIlS5_EES4_RKNS1_IT_T0_EE
}

{
   std::thread::_Impl_base::~_Impl_base()
   Helgrind:Race
   fun:_ZNSt6thread10_Impl_baseD1Ev
}

{
   std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()
   Helgrind:Race
   fun:_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED1Ev
}

###############################################################################
# Part 2: Suppress spurious races in std library template instantiations

{
   std::thread::_Impl<std::_Bind_simple<void (*(std::reference_wrapper<mir::compositor::SwitchingBundle>, std::reference_wrapper<unsigned long>, std::reference_wrapper<mir::graphics::BufferID>))(mir::compositor::SwitchingBundle&, unsigned long&, mir::graphics::BufferID&)> >::~_Impl()
   Helgrind:Race
   fun:_ZNSt6thread5_ImplISt12_Bind_simpleIFPFvRN3mir10compositor15SwitchingBundleERmRNS2_8graphics8BufferIDEESt17reference_wrapperIS4_ESC_ImESC_IS8_EEEED1Ev
}

{
   std::thread::_Impl<std::_Bind_simple<void (*(std::reference_wrapper<mir::compositor::SwitchingBundle>, int))(mir::compositor::SwitchingBundle&, int)> >::~_Impl()
   Helgrind:Race
   fun:_ZNSt6thread5_ImplISt12_Bind_simpleIFPFvRN3mir10compositor15SwitchingBundleEiESt17reference_wrapperIS4_EiEEED1Ev
}...

> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
> 
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
> 
> Tested with:
>     valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_filter="SwitchingBundle*"
> 
> [1] https://code.launchpad.net/~alan-griffiths/mir/refactoring-so-
> SwitchingBundle-can-control-completion-of-client_acquire/+merge/204244
> 
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.