Merge lp:~raof/mir/fix-deadlock-in-glib-alarm into lp:mir
| Status: | Merged |
|---|---|
| Approved by: | Alberto Aguirre on 2015-08-11 |
| Approved revision: | 2808 |
| Merged at revision: | 2832 |
| Proposed branch: | lp:~raof/mir/fix-deadlock-in-glib-alarm |
| Merge into: | lp:mir |
| Diff against target: |
406 lines (+139/-38) 9 files modified
examples/server_example.cpp (+6/-3) examples/server_example_test_client.cpp (+15/-14) examples/server_example_test_client.h (+12/-2) src/include/server/mir/glib_main_loop_sources.h (+4/-2) src/server/glib_main_loop.cpp (+6/-0) src/server/glib_main_loop_sources.cpp (+19/-8) tests/mir_test_framework/stub_input_platform.cpp (+9/-7) tests/mir_test_framework/stub_input_platform.h (+2/-2) tests/unit-tests/test_glib_main_loop.cpp (+66/-0) |
| To merge this branch: | bzr merge lp:~raof/mir/fix-deadlock-in-glib-alarm |
| Related bugs: |
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Alberto Aguirre | Approve on 2015-08-11 | ||
| Alan Griffiths | Approve on 2015-08-11 | ||
| PS Jenkins bot | continuous-integration | Approve on 2015-08-11 | |
| Kevin DuBois (community) | 2015-08-03 | Approve on 2015-08-03 | |
|
Review via email:
|
|||
Commit Message
Fix a deadlock in GLibMainLoop's Alarm implementation.
There's currently a deadlock in the following case:
Thread 1: enters an alarm callback. (And so has taken TimerContext:
Thread 2: calls alarm->
Thread 1: In the alarm callback, calls alarm->
The fundamental problem here is that we use ~GSourceHandle to guarantee that we've left all callbacks. This is needed for the guarantees provided by ~Alarm (and is nice for ::cancel), but is unnecessary for the reschedule_* calls - barring external synchronisation with the MainLoop, calling reschedule_* when there's an alarm already pending is inherently racing against real-time.
So, split the ensure-
Description of the Change
Oh god.
Fix a deadlock in GLibMainLoop's Alarm implementation.
There's currently a deadlock in the following case:
Thread 1: enters an alarm callback. (And so has taken TimerContext:
Thread 2: calls alarm->
Thread 1: In the alarm callback, calls alarm->
The fundamental problem here is that we use ~GSourceHandle to guarantee that we've left all callbacks. This is needed for the guarantees provided by ~Alarm (and is nice for ::cancel), but is unnecessary for the reschedule_* calls - barring external synchronisation with the MainLoop, calling reschedule_* when there's an alarm already pending is inherently racing against real-time.
So, split the ensure-
For bonus points allows us to switch from a std::recursive_
- 2800. By Chris Halse Rogers on 2015-08-04
-
Fix alarm-lifecycle FIXME in example server.
Don't have Alarms that outlive the server.
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2800
http://
Executed test runs:
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
FAILURE: http://
Click here to trigger a rebuild:
http://
- 2801. By Chris Halse Rogers on 2015-08-04
-
Merge trunk, fixing GSourceHandle conflict
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2801
http://
Executed test runs:
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
FAILURE: http://
FAILURE: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2802
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
- 2802. By Chris Halse Rogers on 2015-08-05
-
Fix race in Alarm::cancel test case
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2802
http://
Executed test runs:
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
- 2803. By Chris Halse Rogers on 2015-08-05
-
Guard access to stub platform pointers.
One more data race down. How many to go?
- 2804. By Chris Halse Rogers on 2015-08-05
-
Simplify GLibMainLoopAla
rmTest: :cancel_ blocks_ until_definitel y_cancelled At the cost of being slightly less accurate. It should also eliminate false-positives.
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2804
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
| Chris Halse Rogers (raof) wrote : | # |
Ok. Failure is:
+ phablet-test-run -x -s 00693fd555c9186a -v mir_demo_server --test-client /usr/bin/
running mir_demo_server --test-client /usr/bin/
…
Surface 0 DPI
Signal 15 received. Good night.
[1438765839.346519] mirserver: Stopping
/bin/bash: line 1: 11723 Segmentation fault (core dumped) mir_demo_server --test-client /usr/bin/
And is the same for all the mir_demo_server --test-client... runs.
Unfortunately, I can't reproduce this failure crash, either on my development system or on a cross-built arale.
| Chris Halse Rogers (raof) wrote : | # |
Maybe it's cosmic rays?
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2804
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
| Alberto Aguirre (albaguirre) wrote : | # |
I can replicate the error with a cross-compile build using -Duse_debflags=ON and
"bin/mir_
Probably related to the static alarm objects used in server_
| Alberto Aguirre (albaguirre) wrote : | # |
Actually even without -Duse_debflags=ON the server still crashes at the end.
| Alberto Aguirre (albaguirre) wrote : | # |
And actually just exiting bin/mir_demo_server results in a crash too.
| Chris Halse Rogers (raof) wrote : | # |
Ok, I'm pretty sure this only happens on mako(!). I've tried, with a debug build and use_debflags=ON, on desktop and arale (rc-proposed, wily), and on none of those platforms does mir_demo_server crash on shutdown.
Time to see if my mako has awoken from its dead-battery fugue.
| Alberto Aguirre (albaguirre) wrote : | # |
This is happening on arale too (image #80 from ubuntu-
Stack trace:
http://
Seems like memory corruption.
| Alberto Aguirre (albaguirre) wrote : | # |
Ahh this seems to be the same TLS issue I encountered in lp:1280086.
The relevant bit from screencast.cpp:
"/* On devices with android based openGL drivers, the vendor dispatcher table
* may be optimized if USE_FAST_TLS_KEY is set, which hardcodes a TLS slot where
* the vendor opengl function pointers live. Since glibc is not aware of this
* use, collisions may happen.
"
- 2805. By Chris Halse Rogers on 2015-08-10
-
Remove static std::mutexes. Looks like this runs into TLS memory corruption issues on hybris
- 2806. By Chris Halse Rogers on 2015-08-10
-
Resolve StubInputPlatform data race with an atomic.
| PS Jenkins bot (ps-jenkins) wrote : | # |
FAILED: Continuous integration, rev:2806
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
- 2807. By Chris Halse Rogers on 2015-08-10
-
Unfix alarm-lifecycle FIXME in example server.
Trying to actually tie the alarms' lifecycles to the server apparently hits
TLS-glibc-hybris- borkage resulting in memory corruption and segfault on exit,
and I'm disinclined to spend *more* time debugging why.Just leak the alarms, since we never needed to run their destructors in the
first place.
| PS Jenkins bot (ps-jenkins) wrote : | # |
PASSED: Continuous integration, rev:2807
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://
| Chris Halse Rogers (raof) wrote : | # |
WOOOOOOO! FINALLY!
I wonder what the triggering condition for std::future clashing with the Android TLS slot is? We use futures elsewhere in the code without issue...
| Chris Halse Rogers (raof) wrote : | # |
Oh! Of course it's the TLS mode. The binary can (and almost certainly will) use the local-exec TLS model. The Mir libraries can at most use initial-exec model, and will probably(?) be using the local-dynamic model, and will be getting a non-conflicting slot.
| Alan Griffiths (alan-griffiths) wrote : | # |
Admittedly the existing code using "static" to extend the lifetime of these alarms is wrong, but I can't help feeling leaking them is also inelegant. (I've not thought of a better solution yet.)
I'd also be inclined to make leaked_kill_action & leaked_exit_action function scope static to make it clear they are not used elsewhere.
| Chris Halse Rogers (raof) wrote : | # |
Well, I *could* resurrect the pass-to-main implementation that ran into TLS problems. We could either try forcing a different TLS model or do what screencast does and just hope adding a couple of unused TLS entries will move stdlib's TLS entries out of the conflicting slot...
-----Original Message-----
From: "Alan Griffiths" <email address hidden>
Sent: 10/08/2015 18:56
To: "Chris Halse Rogers" <email address hidden>
Subject: Re: [Merge] lp:~raof/mir/fix-deadlock-in-glib-alarm into lp:mir
Admittedly the existing code using "static" to extend the lifetime of these alarms is wrong, but I can't help feeling leaking them is also inelegant. (I've not thought of a better solution yet.)
I'd also be inclined to make leaked_kill_action & leaked_exit_action function scope static to make it clear they are not used elsewhere.
--
https:/
You are the owner of lp:~raof/mir/fix-deadlock-in-glib-alarm.
| Alberto Aguirre (albaguirre) wrote : | # |
You can actually leave them as static, I fixed what the old FIXME allured to (https:/
Or also why not just use a condition variable instead of a std::future? TLS won't be an issue then.
- 2808. By Chris Halse Rogers on 2015-08-11
-
Fix alarm-lifecycle FIXME in a simpler manner.
Looking at it fresh, we don't actually need any of the std::future machinery.
We can just pass things around by reference. So do so.
| PS Jenkins bot (ps-jenkins) wrote : | # |
PASSED: Continuous integration, rev:2808
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
SUCCESS: http://
Click here to trigger a rebuild:
http://

FAILED: Continuous integration, rev:2799 jenkins. qa.ubuntu. com/job/ mir-ci/ 4458/ jenkins. qa.ubuntu. com/job/ mir-android- vivid-i386- build/3379 s-jenkins. ubuntu- ci:8080/ job/mir- clang-ts- wily-amd64- build/233 jenkins. qa.ubuntu. com/job/ mir-clang- wily-amd64- build/905 jenkins. qa.ubuntu. com/job/ mir-mediumtests -vivid- touch/3327/ console jenkins. qa.ubuntu. com/job/ mir-wily- amd64-ci/ 607 jenkins. qa.ubuntu. com/job/ mir-wily- amd64-ci/ 607/artifact/ work/output/ *zip*/output. zip jenkins. qa.ubuntu. com/job/ mir-mediumtests -builder- vivid-armhf/ 3327 jenkins. qa.ubuntu. com/job/ mir-mediumtests -builder- vivid-armhf/ 3327/artifact/ work/output/ *zip*/output. zip jenkins. qa.ubuntu. com/job/ mir-mediumtests -runner- mako/6118/ console s-jenkins. ubuntu- ci:8080/ job/touch- flash-device/ 22272
http://
Executed test runs:
SUCCESS: http://
SUCCESS: http://
SUCCESS: http://
FAILURE: http://
SUCCESS: http://
deb: http://
SUCCESS: http://
deb: http://
FAILURE: http://
SUCCESS: http://
Click here to trigger a rebuild: s-jenkins. ubuntu- ci:8080/ job/mir- ci/4458/ rebuild
http://