Mir

Merge lp:~afrantzis/mir/fix-1189770 into lp:mir

Proposed by Alexandros Frantzis
Status: Merged
Approved by: Cemil Azizoglu
Approved revision: no longer in the source branch.
Merged at revision: 1605
Proposed branch: lp:~afrantzis/mir/fix-1189770
Merge into: lp:mir
Diff against target: 405 lines (+197/-38)
7 files modified
3rd_party/android-deps/std/Thread.h (+14/-2)
include/server/mir/run_mir.h (+2/-0)
include/test/mir_test/fake_event_hub.h (+3/-0)
src/server/compositor/multi_threaded_compositor.cpp (+11/-0)
src/server/run_mir.cpp (+22/-0)
tests/acceptance-tests/test_server_shutdown.cpp (+135/-36)
tests/mir_test_doubles/fake_event_hub.cpp (+10/-0)
To merge this branch: bzr merge lp:~afrantzis/mir/fix-1189770
Reviewer Review Type Date Requested Status
Cemil Azizoglu (community) Approve
Alan Griffiths Approve
PS Jenkins bot (community) continuous-integration Approve
Daniel van Vugt Needs Fixing
Alberto Aguirre (community) Approve
Kevin DuBois (community) Approve
Robert Carr (community) Approve
Review via email: mp+217652@code.launchpad.net

Commit message

server: Gracefully handle exceptions thrown from server threads

Description of the change

server: Gracefully handle exceptions thrown from server threads

This MP solves issues with incomplete server shutdown in case an exception is thrown from a server component thread. It achieves this by catching and storing exceptions that are still unhandled at thread boundaries, then terminating the server gracefully and finally rethrowing the unhandled exception from the mir::run_mir() level (because we still need to fail, e.g., for apport to catch the problem). This MP adds graceful termination for the input and compositor threads.

I am not completely happy with the indirect dependence of 3rd-party android Thread class on mir functionality, but I guess it will do for now.

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Robert Carr (robertcarr) wrote :

LGTM

review: Approve
Revision history for this message
Kevin DuBois (kdub) wrote :

okay, although it does seem like a future maintainer/improver would have to figure out the terminate_with_current_exception linking. "I guess it will do for now." is how I felt when searching for a different way too, so +1.

review: Approve
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Perhaps we can explore a custom terminate_handler (std::set_terminate)? But that probably won't help much...

Otherwise...this....will...do for now.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Unfortunately this is a regression on bug 1237332, and worse because core files are not longer produced at all.

Not regressing on bug 1237332 is more important than resolving bug 1189770 right now. The ability to get crash dumps from users is critical. Or else you can never fully see what's going wrong and never fix it.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Unfortunately this is a regression on bug 1237332, and worse because core files are not longer produced at all.

We are rethrowing the exception from mir::run_mir() so if that's not caught we will abort() and get a core.

The way I see it, it makes sense to rethrow the exception from mir::run_mir(), since we can gracefully terminate instead, and we should allow our users to do the same. The current situation where we abort() in such cases is just a side effect of an implementation detail (i.e. that we use threads).

I agree that the core file produced (if produced) in such situations is not useful since we are capturing state after teardown. We could use "gcore" to capture the state in mir::terminate_with_exception() and somehow pass that information to interested parties. One way would be to read the command or file pattern from /proc/sys/kernel/core_pattern, and send the gcore output to that. Unfortunately, there are some problems there: we may need root privileges to write or pipe the gcore output to the core_pattern target and also we can't find some core_pattern parameters from userspace (like %P, although this only seem to matter when running in a container).

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Unfortunately, there are some problems there: we may need root privileges to write or pipe the
> gcore output to the core_pattern target

Talked to pitti, and at least for the apport case this is fixable.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I think there are too many serious problems with rethrowing. You lose the original stack trace and therefore critically lose the memory contents and access to the variables leading to the exception. So you can't debug your cores and bugs don't get fixed. And too many users continue to indefinitely experience the same crashes as we experienced in the public testing of XMir.

Plus, all theory aside, when I manually tested this branch no core files came out of crashes. And I think it's naive to assume you can manually craft a core file (of yourself) as reliably and useful as the kernel will produce for you.

But the real solution could be very simple... If we carefully convert unrecoverable fatal exceptions into clean abort()'s with clean core and stack dumping (bug 1285084), then a change like this one could be fine. But we're not there yet.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

At the heart of the above disagreement is the idea that there can be "unrecoverable fatal exceptions" - that is just bad design. (Even Java's questionable "Throwable" hierarchy makes unrecoverable "Error"s something different to "Exception"s.)

I think the requirement for an orderly shutdown and "correct" behavior of the executable beats that of having a core from the throw site. But it is close and I'd very much like to have both.

At yesterday's "standup" we discussed introduced a "core_and_throw" option to some throw sites - but I don't see that as a prerequisite for fixing lp:1189770.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

134 +std::exception_ptr termination_exception;

AFAICS this can be accessed on multiple threads and therefore needs synchronization.

review: Needs Fixing
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Not being able to analyse crashes is an absolutely blocking issue because it prevents resolution of critical bugs. We must ensure we don't go backwards in this area.

"core and throw" is a naive idea. I've seen many attempts to do that on many platforms over the years but nothing ever beats letting the kernel produce clean, complete and accurate cores for you. Most attempts to dump cores of self or stack trace of self make accurate debugging more difficult. You should not do it.

This proposal could be OK only after fatal errors have been converted to abort/raise rather than exceptions.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> 134 +std::exception_ptr termination_exception;
>
> AFAICS this can be accessed on multiple threads and therefore needs
> synchronization.

Fixed, and also changed it so that only the first terminate_with_current_exception request is handled. The rationale is that we can get multiple requests for two reasons:

1. Code in two or more threads throwing at the same time (possibly because of the same underlying reason).
2. Code in one thread throwing, then during shutdown another thread throws.

In case (1) it doesn't matter which exception we catch (which comes first depends on timing anyway). In case (2) we want to catch the original exception more than we want to catch the shutdown exception.

Ideally we would want to record all, and that's possible by storing all exceptions and throwing an exception with all of the original exceptions nested, but that's not critical, so leaving it for the future, unless the team prefers to have it now.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

141 +std::exception_ptr termination_exception;
142 +std::mutex termination_exception_mutex;

termination_exception is accessed in run_mir() without locking the mutex.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> termination_exception is accessed in run_mir() without locking the mutex.

I was thinking that our internal threads would have either not started or finished in the two points we access termination_exception in run_mir(). I guess an external user could be using mir::terminate_with_exception(), although I can't think of a valid use case for it. In any case, fixed.

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

LGTM

review: Approve
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

> I think the requirement for an orderly shutdown and "correct" behavior of the
> executable beats that of having a core from the throw site. But it is close
> and I'd very much like to have both.

+1 indeed!

IMHO for a user not being able to recover her display is several orders of magnitude worse than a frustrated development team having a hard time root-causing bugs due to lack of sane core dumps :-).

But let's not frustrate ourselves either. Even though programmer-assembled dumps may be difficult, we should put in that functionality. Soon after this one gets merged.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file '3rd_party/android-deps/std/Thread.h'
2--- 3rd_party/android-deps/std/Thread.h 2013-05-06 22:22:39 +0000
3+++ 3rd_party/android-deps/std/Thread.h 2014-05-02 15:53:48 +0000
4@@ -32,6 +32,11 @@
5 #include <pthread.h>
6 #endif
7
8+namespace mir
9+{
10+void terminate_with_current_exception();
11+}
12+
13 namespace mir_input
14 {
15 class Thread : virtual public RefBase
16@@ -56,8 +61,15 @@
17
18 thread = std::thread([this]() -> void
19 {
20- if (auto result = readyToRun()) status.store(result);
21- else while (!exitPending() && threadLoop());
22+ try
23+ {
24+ if (auto result = readyToRun()) status.store(result);
25+ else while (!exitPending() && threadLoop());
26+ }
27+ catch (...)
28+ {
29+ mir::terminate_with_current_exception();
30+ }
31 });
32
33 #ifdef HAVE_PTHREADS
34
35=== modified file 'include/server/mir/run_mir.h'
36--- include/server/mir/run_mir.h 2013-04-25 16:52:27 +0000
37+++ include/server/mir/run_mir.h 2014-05-02 15:53:48 +0000
38@@ -39,6 +39,8 @@
39 std::function<void(DisplayServer&)> init);
40
41 void report_exception(std::ostream& out);
42+
43+void terminate_with_current_exception();
44 }
45
46
47
48=== modified file 'include/test/mir_test/fake_event_hub.h'
49--- include/test/mir_test/fake_event_hub.h 2013-10-01 17:29:49 +0000
50+++ include/test/mir_test/fake_event_hub.h 2014-05-02 15:53:48 +0000
51@@ -142,6 +142,8 @@
52 FakeDevice* getDevice(int32_t deviceId);
53 size_t eventsQueueSize() const;
54
55+ void throw_exception_in_next_get_events();
56+
57 // list of RawEvents available for consumption via getEvents
58 std::mutex guard;
59 std::list<droidinput::RawEvent> events_available;
60@@ -155,6 +157,7 @@
61
62 private:
63 const KeyInfo* getKey(const FakeDevice* device, int32_t scanCode, int32_t usageCode) const;
64+ bool throw_in_get_events = false;
65
66 };
67 }
68
69=== modified file 'src/server/compositor/multi_threaded_compositor.cpp'
70--- src/server/compositor/multi_threaded_compositor.cpp 2014-04-30 04:46:52 +0000
71+++ src/server/compositor/multi_threaded_compositor.cpp 2014-05-02 15:53:48 +0000
72@@ -26,6 +26,7 @@
73 #include "mir/scene/legacy_scene_change_notification.h"
74 #include "mir/scene/surface_observer.h"
75 #include "mir/scene/surface.h"
76+#include "mir/run_mir.h"
77
78 #include <thread>
79 #include <condition_variable>
80@@ -161,6 +162,7 @@
81 }
82
83 void operator()() noexcept // noexcept is important! (LP: #1237332)
84+ try
85 {
86 /*
87 * Make the buffer the current rendering target, and release
88@@ -180,6 +182,10 @@
89
90 run_compositing_loop([&] { return display_buffer_compositor->composite();});
91 }
92+ catch (...)
93+ {
94+ mir::terminate_with_current_exception();
95+ }
96
97 private:
98 std::shared_ptr<mc::DisplayBufferCompositorFactory> const display_buffer_compositor_factory;
99@@ -196,6 +202,7 @@
100 }
101
102 void operator()() noexcept // noexcept is important! (LP: #1237332)
103+ try
104 {
105 run_compositing_loop(
106 [this]
107@@ -215,6 +222,10 @@
108 return false;
109 });
110 }
111+ catch (...)
112+ {
113+ mir::terminate_with_current_exception();
114+ }
115
116 private:
117 void wait_until_next_fake_vsync()
118
119=== modified file 'src/server/run_mir.cpp'
120--- src/server/run_mir.cpp 2013-10-21 08:17:25 +0000
121+++ src/server/run_mir.cpp 2014-05-02 15:53:48 +0000
122@@ -23,6 +23,8 @@
123 #include "mir/frontend/connector.h"
124 #include "mir/raii.h"
125
126+#include <exception>
127+#include <mutex>
128 #include <csignal>
129 #include <cstdlib>
130 #include <cassert>
131@@ -32,6 +34,8 @@
132 auto const intercepted = { SIGQUIT, SIGABRT, SIGFPE, SIGSEGV, SIGBUS };
133
134 std::weak_ptr<mir::frontend::Connector> weak_connector;
135+std::exception_ptr termination_exception;
136+std::mutex termination_exception_mutex;
137
138 extern "C" void delete_endpoint()
139 {
140@@ -58,6 +62,10 @@
141 void mir::run_mir(ServerConfiguration& config, std::function<void(DisplayServer&)> init)
142 {
143 DisplayServer* server_ptr{nullptr};
144+ {
145+ std::lock_guard<std::mutex> lock{termination_exception_mutex};
146+ termination_exception = nullptr;
147+ }
148 auto main_loop = config.the_main_loop();
149
150 main_loop->register_signal_handler(
151@@ -88,4 +96,18 @@
152
153 init(server);
154 server.run();
155+
156+ std::lock_guard<std::mutex> lock{termination_exception_mutex};
157+ if (termination_exception)
158+ std::rethrow_exception(termination_exception);
159+}
160+
161+void mir::terminate_with_current_exception()
162+{
163+ std::lock_guard<std::mutex> lock{termination_exception_mutex};
164+ if (!termination_exception)
165+ {
166+ termination_exception = std::current_exception();
167+ raise(SIGTERM);
168+ }
169 }
170
171=== modified file 'tests/acceptance-tests/test_server_shutdown.cpp'
172--- tests/acceptance-tests/test_server_shutdown.cpp 2014-03-06 06:05:17 +0000
173+++ tests/acceptance-tests/test_server_shutdown.cpp 2014-05-02 15:53:48 +0000
174@@ -19,10 +19,14 @@
175 #include "mir_toolkit/mir_client_library.h"
176 #include "mir/compositor/renderer.h"
177 #include "mir/compositor/renderer_factory.h"
178+#include "mir/compositor/display_buffer_compositor.h"
179+#include "mir/compositor/display_buffer_compositor_factory.h"
180 #include "mir/input/composite_event_filter.h"
181+#include "mir/run_mir.h"
182
183 #include "mir_test_framework/display_server_test_fixture.h"
184 #include "mir_test/fake_event_hub_input_configuration.h"
185+#include "mir_test/fake_event_hub.h"
186 #include "mir_test_framework/cross_process_sync.h"
187 #include "mir_test_doubles/stub_renderer.h"
188
189@@ -55,6 +59,25 @@
190 }
191 };
192
193+class ExceptionThrowingDisplayBufferCompositorFactory : public mc::DisplayBufferCompositorFactory
194+{
195+public:
196+ std::unique_ptr<mc::DisplayBufferCompositor>
197+ create_compositor_for(mg::DisplayBuffer&) override
198+ {
199+ struct ExceptionThrowingDisplayBufferCompositor : mc::DisplayBufferCompositor
200+ {
201+ bool composite() override
202+ {
203+ throw std::runtime_error("ExceptionThrowingDisplayBufferCompositor");
204+ }
205+ };
206+
207+ return std::unique_ptr<mc::DisplayBufferCompositor>(
208+ new ExceptionThrowingDisplayBufferCompositor{});
209+ }
210+};
211+
212 void null_surface_callback(MirSurface*, void*)
213 {
214 }
215@@ -94,6 +117,46 @@
216 std::string const flag_file;
217 };
218
219+struct FakeEventHubServerConfig : TestingServerConfiguration
220+{
221+ std::shared_ptr<mi::InputConfiguration> the_input_configuration() override
222+ {
223+ if (!input_configuration)
224+ {
225+ input_configuration =
226+ std::make_shared<mtd::FakeEventHubInputConfiguration>(
227+ the_composite_event_filter(),
228+ the_input_region(),
229+ std::shared_ptr<mi::CursorListener>(),
230+ the_input_report());
231+ }
232+
233+ return input_configuration;
234+ }
235+
236+ std::shared_ptr<mi::InputManager> the_input_manager() override
237+ {
238+ return DefaultServerConfiguration::the_input_manager();
239+ }
240+
241+ std::shared_ptr<mir::shell::InputTargeter> the_input_targeter() override
242+ {
243+ return DefaultServerConfiguration::the_input_targeter();
244+ }
245+ std::shared_ptr<mir::scene::InputRegistrar> the_input_registrar() override
246+ {
247+ return DefaultServerConfiguration::the_input_registrar();
248+ }
249+
250+ mia::FakeEventHub* the_fake_event_hub()
251+ {
252+ the_input_configuration();
253+ return input_configuration->the_fake_event_hub();
254+ }
255+
256+ std::shared_ptr<mtd::FakeEventHubInputConfiguration> input_configuration;
257+};
258+
259 }
260
261 using ServerShutdown = BespokeDisplayServerTestFixture;
262@@ -194,42 +257,7 @@
263 Flag resources_freed_success{"resources_free_success_7e9c69fc.tmp"};
264 Flag resources_freed_failure{"resources_free_failure_7e9c69fc.tmp"};
265
266- /* Use the real input manager, but with a fake event hub */
267- struct ServerConfig : TestingServerConfiguration
268- {
269- std::shared_ptr<mi::InputConfiguration> the_input_configuration() override
270- {
271- if (!input_configuration)
272- {
273- input_configuration =
274- std::make_shared<mtd::FakeEventHubInputConfiguration>(
275- the_composite_event_filter(),
276- the_input_region(),
277- std::shared_ptr<mi::CursorListener>(),
278- the_input_report());
279- }
280-
281- return input_configuration;
282- }
283-
284- std::shared_ptr<mi::InputManager> the_input_manager() override
285- {
286- return DefaultServerConfiguration::the_input_manager();
287- }
288-
289- std::shared_ptr<mir::shell::InputTargeter> the_input_targeter() override
290- {
291- return DefaultServerConfiguration::the_input_targeter();
292- }
293- std::shared_ptr<mir::scene::InputRegistrar> the_input_registrar() override
294- {
295- return DefaultServerConfiguration::the_input_registrar();
296- }
297-
298- std::shared_ptr<mi::InputConfiguration> input_configuration;
299- };
300-
301- auto server_config = std::make_shared<ServerConfig>();
302+ auto server_config = std::make_shared<FakeEventHubServerConfig>();
303 launch_server_process(*server_config);
304
305 struct ClientConfig : TestingClientConfiguration
306@@ -427,3 +455,74 @@
307 INSTANTIATE_TEST_CASE_P(ServerShutdown,
308 OnSignal,
309 ::testing::Values(SIGQUIT, SIGABRT, SIGFPE, SIGSEGV, SIGBUS));
310+
311+TEST(ServerShutdownWithThreadException,
312+ server_releases_resources_on_abnormal_input_thread_termination)
313+{
314+ auto server_config = std::make_shared<FakeEventHubServerConfig>();
315+ auto fake_event_hub = server_config->the_fake_event_hub();
316+
317+ std::thread server{
318+ [&]
319+ {
320+ EXPECT_THROW(
321+ mir::run_mir(*server_config, [](mir::DisplayServer&){}),
322+ std::runtime_error);
323+ }};
324+
325+ fake_event_hub->throw_exception_in_next_get_events();
326+ server.join();
327+
328+ std::weak_ptr<mir::graphics::Display> display = server_config->the_display();
329+ std::weak_ptr<mir::compositor::Compositor> compositor = server_config->the_compositor();
330+ std::weak_ptr<mir::frontend::Connector> connector = server_config->the_connector();
331+ std::weak_ptr<mir::input::InputManager> input_manager = server_config->the_input_manager();
332+
333+ server_config.reset();
334+
335+ EXPECT_EQ(0, display.use_count());
336+ EXPECT_EQ(0, compositor.use_count());
337+ EXPECT_EQ(0, connector.use_count());
338+ EXPECT_EQ(0, input_manager.use_count());
339+}
340+
341+TEST(ServerShutdownWithThreadException,
342+ server_releases_resources_on_abnormal_compositor_thread_termination)
343+{
344+ struct ServerConfig : TestingServerConfiguration
345+ {
346+ std::shared_ptr<mc::DisplayBufferCompositorFactory>
347+ the_display_buffer_compositor_factory() override
348+ {
349+ return display_buffer_compositor_factory(
350+ [this]()
351+ {
352+ return std::make_shared<ExceptionThrowingDisplayBufferCompositorFactory>();
353+ });
354+ }
355+ };
356+
357+ auto server_config = std::make_shared<ServerConfig>();
358+
359+ std::thread server{
360+ [&]
361+ {
362+ EXPECT_THROW(
363+ mir::run_mir(*server_config, [](mir::DisplayServer&){}),
364+ std::runtime_error);
365+ }};
366+
367+ server.join();
368+
369+ std::weak_ptr<mir::graphics::Display> display = server_config->the_display();
370+ std::weak_ptr<mir::compositor::Compositor> compositor = server_config->the_compositor();
371+ std::weak_ptr<mir::frontend::Connector> connector = server_config->the_connector();
372+ std::weak_ptr<mir::input::InputManager> input_manager = server_config->the_input_manager();
373+
374+ server_config.reset();
375+
376+ EXPECT_EQ(0, display.use_count());
377+ EXPECT_EQ(0, compositor.use_count());
378+ EXPECT_EQ(0, connector.use_count());
379+ EXPECT_EQ(0, input_manager.use_count());
380+}
381
382=== modified file 'tests/mir_test_doubles/fake_event_hub.cpp'
383--- tests/mir_test_doubles/fake_event_hub.cpp 2013-10-02 11:04:00 +0000
384+++ tests/mir_test_doubles/fake_event_hub.cpp 2014-05-02 15:53:48 +0000
385@@ -221,6 +221,10 @@
386 (void) timeoutMillis;
387 {
388 std::lock_guard<std::mutex> lg(guard);
389+
390+ if (throw_in_get_events)
391+ throw std::runtime_error("FakeEventHub::getEvents() exception");
392+
393 for (size_t i = 0; i < bufferSize && events_available.size() > 0; ++i)
394 {
395 buffer[i] = events_available.front();
396@@ -688,3 +692,9 @@
397 {
398 return events_available.size();
399 }
400+
401+void FakeEventHub::throw_exception_in_next_get_events()
402+{
403+ std::lock_guard<std::mutex> lg(guard);
404+ throw_in_get_events = true;
405+}

Subscribers

People subscribed via source and target branches