Merge lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking into lp:mir
- SwitchingBundle-controls-completion-of-client_acquire-without-blocking
- Merge into development-branch
Status: | Merged |
---|---|
Approved by: | Alan Griffiths |
Approved revision: | no longer in the source branch. |
Merged at revision: | 1398 |
Proposed branch: | lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking |
Merge into: | lp:mir |
Prerequisite: | lp:~alan-griffiths/mir/refactoring-so-SwitchingBundle-can-control-completion-of-client_acquire |
Diff against target: |
102 lines (+27/-6) 3 files modified
src/server/compositor/switching_bundle.cpp (+23/-5) src/server/compositor/switching_bundle.h (+3/-1) src/server/frontend/session_mediator.cpp (+1/-0) |
To merge this branch: | bzr merge lp:~alan-griffiths/mir/SwitchingBundle-controls-completion-of-client_acquire-without-blocking |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
PS Jenkins bot (community) | continuous-integration | Approve | |
Alexandros Frantzis (community) | Approve | ||
Andreas Pokorny (community) | Approve | ||
Daniel van Vugt | Abstain | ||
Kevin DuBois (community) | Approve | ||
Alan Griffiths | Pending | ||
Review via email: mp+205568@code.launchpad.net |
This proposal supersedes a proposal from 2014-01-31.
Commit message
compositor: non-blocking implementation of SwitchingBundle
Description of the change
compositor: non-blocking implementation of SwitchingBundle
This is the second of three passes over the the code.
The final pass will be to clean up some of the mess left in the code - in particular we can get rid of some duplication and kill the legacy "force_
I'm splitting it this way as it is easier to review in these chunks
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
PASSED: Continuous integration, rev:1376
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
FAILED: Continuous integration, rev:1377
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal | # |
Not able to reproduce error. Restarting to see if it "magically" goes away.
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
FAILED: Continuous integration, rev:1377
http://
Executed test runs:
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
Click here to trigger a rebuild:
http://
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
FAILED: Continuous integration, rev:1378
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
PASSED: Continuous integration, rev:1380
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
Kevin DuBois (kdub) wrote : Posted in a previous version of this proposal | # |
looks sensible from a functional perspective.
just non-blocking nits:
38: two lines?
'client_
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal | # |
> looks sensible from a functional perspective.
>
> just non-blocking nits:
> 38: two lines?
http://
> 'client_
> "client_
Then the code would read:
if (client_
I think it is clearer as:
if (client_
or maybe:
if (client_
Alexandros Frantzis (afrantzis) wrote : Posted in a previous version of this proposal | # |
Since we want the variable to act as both a boolean flag and a function callback, any name we pick is going to match one situation a bit better than the other.
client_acquire_todo and client_
I prefer "if (client_
Possible improvements: client_
Andreas Pokorny (andreas-pokorny) wrote : Posted in a previous version of this proposal | # |
I dont have a clear opinion on the function name current version looks good to me, since we do not call client_
Following the pattern of the multi-personality variable I think that
if (client_
PS Jenkins bot (ps-jenkins) wrote : Posted in a previous version of this proposal | # |
PASSED: Continuous integration, rev:1381
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal | # |
I've found a scenario that segfaults the server with code equivalent to this.
Don't top-approve until I've investigated further.
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal | # |
> I've found a scenario that segfaults the server with code equivalent to this.
The essence of the problem is that the frontend socket session will clear down on various "error" conditions (like a client process being terminated). That leaves client_acquire_todo pointing into dead objects. This can lead to various failure modes.
Alan Griffiths (alan-griffiths) wrote : Posted in a previous version of this proposal | # |
> > I've found a scenario that segfaults the server with code equivalent to
> this.
Fixed.
I think some cleanup is possible - but I'd prefer to incorporate that in the more general cleanup of the "force_
PS Jenkins bot (ps-jenkins) wrote : | # |
PASSED: Continuous integration, rev:1383
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
Kevin DuBois (kdub) wrote : | # |
previous approve still stands, it looks like the fix was to just prevent any frontend throws from making it to the compositor. Seems like a test for that scenario might be helpful though :)
Daniel van Vugt (vanvugt) wrote : | # |
I notice you only eliminated one out of three wait() calls in client_acquire(). I think that's the significant one anyway.
Also, the removal of cond.notify_all() is potentially very dangerous with cond.wait()'s scattered throughout the code still. I can't yet tell if it's safe.
Alan Griffiths (alan-griffiths) wrote : | # |
> previous approve still stands, it looks like the fix was to just prevent any
> frontend throws from making it to the compositor. Seems like a test for that
> scenario might be helpful though :)
Actually, adding the missing lock in ~SessionMediator() is at least as significant as it prevents race conditions where the session closes down while swapping buffers. Vis:
97 + std::unique_
Alan Griffiths (alan-griffiths) wrote : | # |
> I notice you only eliminated one out of three wait() calls in
> client_acquire(). I think that's the significant one anyway.
>
> Also, the removal of cond.notify_all() is potentially very dangerous with
> cond.wait()'s scattered throughout the code still. I can't yet tell if it's
> safe.
One of the "wait()"s moved to complete_
The "wait()" that remains in client_acquire() seems to relate to a scenario that AFAICS doesn't happen in current usage (multiple client buffers having been acquired). But again, there is a "notify()" in client_release()
The "wait()" and "notify()"s I've remove relate to a third wait condition - waiting for compositing to release a buffer.
Alexandros Frantzis (afrantzis) wrote : | # |
One concerning aspect of the changes I realized while reviewing this MP (but not introduced by this MP) is that we moved work from the client thread to the compositor thread. That is, compositor_
1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.
The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.
I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).
Alexandros Frantzis (afrantzis) wrote : | # |
> I am OK with this MP per se, so approving, but I think we need to re-evaluate our approach, in light of the points > above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).
Oops, sorry, left over draft. Should have been only:
I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).
Alexandros Frantzis (afrantzis) wrote : | # |
> (but not introduced by this MP)
This is not true either, remains of earlier draft...
Alan Griffiths (alan-griffiths) wrote : | # |
So, I think you're saying: "Needs Discussion"
~~~~
we moved work from the client thread to the compositor thread. That is, compositor_
1. May block if a snapshot is taking place.
2. Calls the externally provided completion function which may take some time to finish.
3. The provided completion function is called under lock, which needs care.
The compositor's interaction with the SwitchingBundle is supposed to be super-fast and non-blocking, and the changes break these assumptions.
I think we need to re-evaluate our approach, in light of the points above (I am not saying they are a problem necessarily, but we certainly need to investigate/discuss further).
~~~~
It is worth mentioning that these assumptions will only fail when we start compositing a surface that we'd previously blocked. Not that it can't be considered an issue, but it isn't the principle path through the code.
Daniel van Vugt (vanvugt) wrote : | # |
It seems my original concerns [1] about thread safety are somewhat justified now, according to helgrind:
development-branch: 236 errors from 8 contexts
This branch: 1998 errors from 9 contexts
Tested with:
valgrind --tool=helgrind bin/mir_unit_tests --gtest_
I can't immediately see what the new context is, but the significant increase in errors warrants some attention.
Alan Griffiths (alan-griffiths) wrote : | # |
> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_
>
> [1] https:/
> SwitchingBundle
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.
Giving it some attention.
Alan Griffiths (alan-griffiths) wrote : | # |
> It seems my original concerns [1] about thread safety are somewhat justified
> now, according to helgrind:
>
> development-branch: 236 errors from 8 contexts
> This branch: 1998 errors from 9 contexts
>
> Tested with:
> valgrind --tool=helgrind bin/mir_unit_tests
> --gtest_
>
> [1] https:/
> SwitchingBundle
>
> I can't immediately see what the new context is, but the significant increase
> in errors warrants some attention.
With the following suppression file we get a clean run. (Please check you're happy with these suppressions.)
AFAICS the only possibly contentious one is for client_
#######
# Part 1: Suppress spurious races in std library functions
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
obj:*
}
{
std:
Helgrind:Race
fun:
obj:*
}
{
std:
Helgrind:Race
obj:
fun:
}
{
std:
Helgrind:Race
fun:
obj:*
}
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
obj:*
}
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
}
#######
# Part 2: Suppress spurious races in std library template instantiations
{
std:
Helgrind:Race
fun:
}
{
std:
Helgrind:Race
fun:
}...
PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:1385
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
Click here to trigger a rebuild:
http://
PS Jenkins bot (ps-jenkins) wrote : | # |
PASSED: Continuous integration, rev:1386
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
Daniel van Vugt (vanvugt) wrote : | # |
I find it hard to believe the C++ library is really that bad. I know if using pure pthread calls then helgrind is always right and you always need to fix your code. Although I can see helgrind is wrong about std::atomic, so it could be wrong elsewhere too. Not sure.
Daniel van Vugt (vanvugt) wrote : | # |
OK, ignoring helgrind, I don't see any reason why this proposal would be wrong.
I'm nervous about major changes to something that took several months to perfect, just as we were nervous to retire the code it replaced. But I don't have a logical argument against these changes right now.
Andreas Pokorny (andreas-pokorny) : | # |
Alexandros Frantzis (afrantzis) wrote : | # |
I am OK with the implementation details, but I am still concerned about introducing delays in the compositing threads. Approving, but we should keep our eyes open for unwanted effects.
Alan Griffiths (alan-griffiths) wrote : | # |
> I find it hard to believe the C++ library is really that bad. I know if using
> pure pthread calls then helgrind is always right and you always need to fix
> your code. Although I can see helgrind is wrong about std::atomic, so it could
> be wrong elsewhere too. Not sure.
The difficulty for helgrind is that there are a lot of built-in types that are inherently atomic on Intel. Which means that the generated opcodes are identical to using thew without the atomic wrapper. The same is true of atomic counters used in the threads and smart pointer implementations.
There's discussion on the valgrind, g++ and clang forums about this as it obviously makes checking use of c++11 atomics, threads, lambdas and smart pointers noisy with the potential to lose the signal in the noise.
I don't see a clear solution other than to suppress the false positives (and, I don't immediately see a way to handle generically for template instantiations).
PS Jenkins bot (ps-jenkins) : | # |
Preview Diff
1 | === modified file 'src/server/compositor/switching_bundle.cpp' |
2 | --- src/server/compositor/switching_bundle.cpp 2014-01-31 14:13:36 +0000 |
3 | +++ src/server/compositor/switching_bundle.cpp 2014-02-13 17:06:03 +0000 |
4 | @@ -180,6 +180,7 @@ |
5 | void mc::SwitchingBundle::client_acquire(std::function<void(graphics::Buffer* buffer)> complete) |
6 | { |
7 | std::unique_lock<std::mutex> lock(guard); |
8 | + client_acquire_todo = std::move(complete); |
9 | |
10 | if ((framedropping || force_drop) && nbuffers > 1) |
11 | { |
12 | @@ -210,10 +211,17 @@ |
13 | 1; |
14 | #endif |
15 | |
16 | - while (nfree() < min_free) |
17 | - cond.wait(lock); |
18 | + if (nfree() < min_free) |
19 | + return; |
20 | } |
21 | |
22 | + complete_client_acquire(std::move(lock)); |
23 | +} |
24 | + |
25 | +void mc::SwitchingBundle::complete_client_acquire(std::unique_lock<std::mutex> lock) |
26 | +{ |
27 | + auto complete = std::move(client_acquire_todo); |
28 | + |
29 | if (force_drop > 0) |
30 | force_drop--; |
31 | |
32 | @@ -247,7 +255,16 @@ |
33 | ring[client].buf = ret; |
34 | } |
35 | |
36 | - complete(ret.get()); |
37 | + lock.unlock(); |
38 | + |
39 | + try |
40 | + { |
41 | + complete(ret.get()); |
42 | + } |
43 | + catch (...) |
44 | + { |
45 | + // TODO comms errors should not propagate to compositing threads |
46 | + } |
47 | } |
48 | |
49 | void mc::SwitchingBundle::client_release(graphics::Buffer* released_buffer) |
50 | @@ -339,7 +356,8 @@ |
51 | first_compositor = next(first_compositor); |
52 | ncompositors--; |
53 | } |
54 | - cond.notify_all(); |
55 | + |
56 | + if (client_acquire_todo) complete_client_acquire(std::move(lock)); |
57 | } |
58 | } |
59 | |
60 | @@ -394,7 +412,7 @@ |
61 | std::unique_lock<std::mutex> lock(guard); |
62 | drop_frames(nready); |
63 | force_drop = nbuffers + 1; |
64 | - cond.notify_all(); |
65 | + if (client_acquire_todo) complete_client_acquire(std::move(lock)); |
66 | } |
67 | |
68 | void mc::SwitchingBundle::allow_framedropping(bool allow_dropping) |
69 | |
70 | === modified file 'src/server/compositor/switching_bundle.h' |
71 | --- src/server/compositor/switching_bundle.h 2014-02-11 00:59:44 +0000 |
72 | +++ src/server/compositor/switching_bundle.h 2014-02-13 17:06:03 +0000 |
73 | @@ -80,7 +80,7 @@ |
74 | int last_compositor() const; |
75 | |
76 | const std::shared_ptr<graphics::Buffer> &alloc_buffer(int slot); |
77 | - |
78 | + void complete_client_acquire(std::unique_lock<std::mutex> lock); |
79 | struct SharedBuffer |
80 | { |
81 | std::shared_ptr<graphics::Buffer> buf; |
82 | @@ -108,6 +108,8 @@ |
83 | bool framedropping; |
84 | int force_drop; |
85 | |
86 | + std::function<void(graphics::Buffer* buffer)> client_acquire_todo; |
87 | + |
88 | friend std::ostream& operator<<(std::ostream& os, const SwitchingBundle& bundle); |
89 | }; |
90 | |
91 | |
92 | === modified file 'src/server/frontend/session_mediator.cpp' |
93 | --- src/server/frontend/session_mediator.cpp 2014-02-10 10:56:56 +0000 |
94 | +++ src/server/frontend/session_mediator.cpp 2014-02-13 17:06:03 +0000 |
95 | @@ -82,6 +82,7 @@ |
96 | |
97 | mf::SessionMediator::~SessionMediator() noexcept |
98 | { |
99 | + std::unique_lock<std::mutex> lock(session_mutex); |
100 | auto session = weak_session.lock(); |
101 | if (session) |
102 | { |
FAILED: Continuous integration, rev:1373 jenkins. qa.ubuntu. com/job/ mir-team- mir-development -branch- ci/756/ jenkins. qa.ubuntu. com/job/ mir-android- trusty- i386-build/ 776 jenkins. qa.ubuntu. com/job/ mir-clang- trusty- amd64-build/ 773 jenkins. qa.ubuntu. com/job/ mir-mediumtests -trusty- touch/377 jenkins. qa.ubuntu. com/job/ mir-team- mir-development -branch- trusty- amd64-ci/ 486/console jenkins. qa.ubuntu. com/job/ mir-team- mir-development -branch- trusty- armhf-ci/ 490 jenkins. qa.ubuntu. com/job/ mir-team- mir-development -branch- trusty- armhf-ci/ 490/artifact/ work/output/ *zip*/output. zip jenkins. qa.ubuntu. com/job/ mir-mediumtests -builder- trusty- armhf/378 jenkins. qa.ubuntu. com/job/ mir-mediumtests -builder- trusty- armhf/378/ artifact/ work/output/ *zip*/output. zip jenkins. qa.ubuntu. com/job/ mir-mediumtests -runner- mako/381 s-jenkins. ubuntu- ci:8080/ job/touch- flash-device/ 3537
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild: s-jenkins. ubuntu- ci:8080/ job/mir- team-mir- development- branch- ci/756/ rebuild
http://