Mir

Merge lp:~afrantzis/mir/fix-1350207-unresponsive-clients into lp:mir

Proposed by Alexandros Frantzis
Status: Merged
Approved by: Daniel van Vugt
Approved revision: no longer in the source branch.
Merged at revision: 1920
Proposed branch: lp:~afrantzis/mir/fix-1350207-unresponsive-clients
Merge into: lp:mir
Diff against target: 293 lines (+158/-15)
9 files modified
src/server/frontend/socket_messenger.cpp (+21/-5)
tests/acceptance-tests/CMakeLists.txt (+1/-0)
tests/acceptance-tests/test_unresponsive_client.cpp (+122/-0)
tests/include/mir_test/cross_process_action.h (+2/-1)
tests/include/mir_test_framework/display_server_test_fixture.h (+1/-1)
tests/include/mir_test_framework/testing_process_manager.h (+1/-1)
tests/mir_test/cross_process_action.cpp (+2/-2)
tests/mir_test_framework/display_server_test_fixture.cpp (+2/-2)
tests/mir_test_framework/testing_process_manager.cpp (+6/-3)
To merge this branch: bzr merge lp:~afrantzis/mir/fix-1350207-unresponsive-clients
Reviewer Review Type Date Requested Status
Daniel van Vugt Approve
Alberto Aguirre (community) Approve
Kevin DuBois (community) Approve
Chris Halse Rogers Approve
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+233934@code.launchpad.net

Commit message

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

Description of the change

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

This MP makes the sockets used for communicating with clients non-blocking. When the server tries to write to a socket with a full send buffer it will fail instead of blocking, resulting in either dropping the event (when sending events), or disconnecting the client (when responding to a client rpc request).

This MP also increases the socket send buffer to 64KiB (default is 16KiB) to give clients a little more breathing room for transient freezes. We can increase it even more if needed (max is 4MiB).

Note that this is only a workaround, not a final fix, which would involve making the data writes asynchronous, essentially providing some more buffering on the server side (instead of the kernel). Unfortunately, this is not easily achievable with asio in this particular scenario because:

1. Asio doesn't support sending ancillary data (e.g. file descriptors) asynchronously and in coordination with normal data.

2. There is no guaranteed ordering and atomicity of asio asynchronous writes, and messages may end up being interleaved without additional care.

More discussion is needed in this area:
* what are the criteria for dropping a client (should we decide or the shell or both?)
* application-not-responding notification to shell
* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

I haven't looked at the code in detail, but it sounds like to actually fix the problem, we need to spawn a comms thread and not block the main thread.

Revision history for this message
Chris Halse Rogers (raof) wrote :

No, I don't think a comms thread helps particularly; that just changes the problem to “blocked clients cause unbounded memory consumption”.

I think that properly handling this would be:
a) Mark the client as unresponsive in a way that's visible to shells
b) Register a callback for when the socket becomes writable
c) Drop any additional writes to the socket
d) When the socket becomes writable, send a full state-update message so that we know the client and server agree on what the current state is.

For now bumping the buffer size and hoping that the client didn't really need the messages we dropped seems like a non-terrible improvement to the status quo. Except...

Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel with a busy-wait in the server, or does transfer_exactly() not work on a non-blocking socket?

You'll certainly busy-wait until there's something to read, which seems bad.

Does anyone else find ‘ba::mutable_buffers_1 const&’ hilarious? :)

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel
> with a busy-wait in the server, or does transfer_exactly() not work on a
> non-blocking socket?
>
> You'll certainly busy-wait until there's something to read, which seems bad.

This is indeed a busy-wait in the server, but one that doesn't currently affect us because
the receive_msg() method is only used in this context:

if (message_receiver->available_bytes() >= body_size)
{
    on_new_message(message_receiver->receive_msg(ba::buffer(body)));
}

essentially guaranteeing that we never busy-loop. I introduced the loop to ensure we retain the blocking semantics this method had previously. It's too bad that we can't set (non-)blocking mode for reads and writes separately.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Yeah, that seems reasonable. I'm a bit surprised that boost::transfer_exactly() doesn't do the right thing, but eh.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Tested. Bug fixed!

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Whoops. Now that I look more closely:

(1) This looks too much like an infinite loop waiting to happen. I think "!=" needs to change to "<" for safety:
  while (nread != ba::buffer_size(buffer))
Also you could incorporate the if statement by changing to a "do .. while".

(2) This worries me too:
  "* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)"
If a busy client gets blocked doing other things just a bit too long, there's the real risk it won't be able to recover any more (rendering will freeze and never resume).

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) This looks too much like an infinite loop waiting to happen.
> I think "!=" needs to change to "<" for safety: while (nread != ba::buffer_size(buffer))

nread can't ever become larger than ba::buffer_size(buffer), since we are descreasing the buffer size we are passing to ba::read() (ba::mutable_buffers_1{buffer + nread}).

> (2) This worries me too:
> "* if we drop messages to clients does it make sense for the clients to continue in a possibly
> inconsistent state (it depends on the kind of dropped messages)"
> If a busy client gets blocked doing other things just a bit too long, there's the real risk
> it won't be able to recover any more (rendering will freeze and never resume).

As things stand, if a client is blocked and the socket becomes full, then:

1. if the server tries to send an event and it can't the event will be dropped
2. if the server tries to reply to an RPC call (e.g. next_buffer) and it can't, it will disconnect the client

So, at least we won't get a frozen client. Of course this is far from ideal, but it's a trade-off we need to decide if we want to make: we either run the risk of dropping events/disconnecting clients if they become busy for too long, or we run the risk of a blocked client hanging the whole server. Note that we can further increase the socket send buffer to allow more breathing room for transiently frozen clients if we find it necessary.

The idea behind this MP is to provide a reasonable workaround for the issue without changing too much code and risking destabilizing the frontend for rtm. I am open to other ideas.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, weakly "Approved". I recommend fixing still...

(1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.

(2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

Revision history for this message
Kevin DuBois (kdub) wrote :

Well, this looks okay from a code perspective. It seems that we're trading the 'blocked clients can hang server' bug for 'legitimately stopped clients will get forcibly disconnected' bug. I suppose its the lesser of two evils though.

review: Approve
Revision history for this message
Kevin DuBois (kdub) wrote :

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a
> new bug for the dropped response issue immediately?
Sounds appropriate to do

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

30 + while (nread != ba::buffer_size(buffer))

I would prefer a < as well...but not blocking.

+1 on submitting a bug for (2) and also for the TODO.

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.
> I would prefer a < as well...but not blocking.

Fixed.

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

I will open a new bug as soon as this gets merged.

Revision history for this message
Daniel van Vugt (vanvugt) :
review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> I will open a new bug as soon as this gets merged.

https://bugs.launchpad.net/mir/+bug/1370117

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/server/frontend/socket_messenger.cpp'
2--- src/server/frontend/socket_messenger.cpp 2014-09-10 12:50:53 +0000
3+++ src/server/frontend/socket_messenger.cpp 2014-09-16 08:38:35 +0000
4@@ -36,6 +36,14 @@
5 mfd::SocketMessenger::SocketMessenger(std::shared_ptr<ba::local::stream_protocol::socket> const& socket)
6 : socket(socket)
7 {
8+ // Make the socket non-blocking to avoid hanging the server when a client
9+ // is unresponsive. Also increase the send buffer size to 64KiB to allow
10+ // more leeway for transient client freezes.
11+ // See https://bugs.launchpad.net/mir/+bug/1350207
12+ // TODO: Rework the messenger to support asynchronous sends
13+ socket->non_blocking(true);
14+ boost::asio::socket_base::send_buffer_size option(64*1024);
15+ socket->set_option(option);
16 }
17
18 mf::SessionCredentials mfd::SocketMessenger::creator_creds() const
19@@ -143,11 +151,19 @@
20 ba::mutable_buffers_1 const& buffer)
21 {
22 bs::error_code e;
23- boost::asio::read(
24- *socket,
25- buffer,
26- boost::asio::transfer_exactly(ba::buffer_size(buffer)),
27- e);
28+ size_t nread = 0;
29+
30+ while (nread < ba::buffer_size(buffer))
31+ {
32+ nread += boost::asio::read(
33+ *socket,
34+ ba::mutable_buffers_1{buffer + nread},
35+ e);
36+
37+ if (e && e != ba::error::would_block)
38+ break;
39+ }
40+
41 return e;
42 }
43
44
45=== modified file 'tests/acceptance-tests/CMakeLists.txt'
46--- tests/acceptance-tests/CMakeLists.txt 2014-09-10 12:50:53 +0000
47+++ tests/acceptance-tests/CMakeLists.txt 2014-09-16 08:38:35 +0000
48@@ -46,6 +46,7 @@
49 test_server_without_active_outputs.cpp
50 test_client_input.cpp
51 test_server_startup.cpp
52+ test_unresponsive_client.cpp
53 ${GENERATED_PROTOBUF_SRCS}
54 ${GENERATED_PROTOBUF_HDRS}
55 )
56
57=== added file 'tests/acceptance-tests/test_unresponsive_client.cpp'
58--- tests/acceptance-tests/test_unresponsive_client.cpp 1970-01-01 00:00:00 +0000
59+++ tests/acceptance-tests/test_unresponsive_client.cpp 2014-09-16 08:38:35 +0000
60@@ -0,0 +1,122 @@
61+/*
62+ * Copyright © 2014 Canonical Ltd.
63+ *
64+ * This program is free software: you can redistribute it and/or modify
65+ * it under the terms of the GNU General Public License version 3 as
66+ * published by the Free Software Foundation.
67+ *
68+ * This program is distributed in the hope that it will be useful,
69+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
70+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
71+ * GNU General Public License for more details.
72+ *
73+ * You should have received a copy of the GNU General Public License
74+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
75+ *
76+ * Authored by: Alexandros Frantzis <alexandros.frantzis@canonical.com>
77+ */
78+
79+#include "src/server/scene/session_container.h"
80+#include "mir/scene/session.h"
81+#include "mir/scene/surface.h"
82+
83+#include "mir_test/cross_process_action.h"
84+#include "mir_test_framework/display_server_test_fixture.h"
85+
86+#include "mir_toolkit/mir_client_library.h"
87+
88+#include <gtest/gtest.h>
89+
90+#include <string>
91+#include <thread>
92+
93+namespace mt = mir::test;
94+namespace mtf = mir_test_framework;
95+
96+namespace
97+{
98+char const* const mir_test_socket = mtf::test_socket_file().c_str();
99+}
100+using UnresponsiveClient = mtf::BespokeDisplayServerTestFixture;
101+
102+TEST_F(UnresponsiveClient, does_not_hang_server)
103+{
104+ struct ServerConfig : TestingServerConfiguration
105+ {
106+ void on_start() override
107+ {
108+ std::thread(
109+ [this]
110+ {
111+ send_events.exec(
112+ [this]
113+ {
114+ auto const sessions = the_session_container();
115+
116+ for (int i = 0; i < 1000; ++i)
117+ {
118+ sessions->for_each(
119+ [i] (std::shared_ptr<mir::scene::Session> const& session)
120+ {
121+ session->default_surface()->resize({i + 1, i + 1});
122+ });
123+ }
124+ });
125+ }).detach();
126+ }
127+ mt::CrossProcessAction send_events;
128+ } server_config;
129+
130+ launch_server_process(server_config);
131+
132+ struct ClientConfig : TestingClientConfiguration
133+ {
134+ void exec()
135+ {
136+ MirConnection* connection = nullptr;
137+ MirSurface* surface = nullptr;
138+
139+ connect.exec(
140+ [&]
141+ {
142+ connection = mir_connect_sync(mir_test_socket, __PRETTY_FUNCTION__);
143+
144+ MirSurfaceParameters const request_params =
145+ {
146+ __PRETTY_FUNCTION__,
147+ 100, 100,
148+ mir_pixel_format_abgr_8888,
149+ mir_buffer_usage_hardware,
150+ mir_display_output_id_invalid
151+ };
152+ surface = mir_connection_create_surface_sync(connection, &request_params);
153+ });
154+
155+ release.exec(
156+ [&]
157+ {
158+ // We would normally explicitly release the surface at this
159+ // point. However, because we have been filling the server
160+ // send socket buffer, releasing the surface may cause the
161+ // server to try to write to a full socket buffer when
162+ // responding, leading the server to believe that the client
163+ // is blocked, and causing a premature client disconnection.
164+ mir_connection_release(connection);
165+ });
166+ }
167+
168+ mt::CrossProcessAction connect;
169+ mt::CrossProcessAction release;
170+ } client_config;
171+
172+ auto const client_pid = launch_client_process(client_config);
173+
174+ run_in_test_process([&]
175+ {
176+ client_config.connect();
177+ kill(client_pid, SIGSTOP);
178+ server_config.send_events(std::chrono::seconds{10});
179+ kill(client_pid, SIGCONT);
180+ client_config.release();
181+ });
182+}
183
184=== modified file 'tests/include/mir_test/cross_process_action.h'
185--- tests/include/mir_test/cross_process_action.h 2013-08-20 13:39:31 +0000
186+++ tests/include/mir_test/cross_process_action.h 2014-09-16 08:38:35 +0000
187@@ -23,6 +23,7 @@
188 #include "mir_test_framework/cross_process_sync.h"
189
190 #include <functional>
191+#include <chrono>
192
193 namespace mir
194 {
195@@ -33,7 +34,7 @@
196 {
197 public:
198 void exec(std::function<void()> const& f);
199- void operator()();
200+ void operator()(std::chrono::milliseconds timeout = std::chrono::milliseconds{-1});
201
202 private:
203 mir_test_framework::CrossProcessSync start_sync;
204
205=== modified file 'tests/include/mir_test_framework/display_server_test_fixture.h'
206--- tests/include/mir_test_framework/display_server_test_fixture.h 2014-09-10 12:50:53 +0000
207+++ tests/include/mir_test_framework/display_server_test_fixture.h 2014-09-16 08:38:35 +0000
208@@ -62,7 +62,7 @@
209
210 void launch_server_process(TestingServerConfiguration& config);
211
212- void launch_client_process(TestingClientConfiguration& config);
213+ pid_t launch_client_process(TestingClientConfiguration& config);
214
215 bool shutdown_server_process();
216 Result wait_for_shutdown_server_process();
217
218=== modified file 'tests/include/mir_test_framework/testing_process_manager.h'
219--- tests/include/mir_test_framework/testing_process_manager.h 2014-09-10 12:50:53 +0000
220+++ tests/include/mir_test_framework/testing_process_manager.h 2014-09-16 08:38:35 +0000
221@@ -51,7 +51,7 @@
222 ~TestingProcessManager();
223
224 void launch_server_process(TestingServerConfiguration& config);
225- void launch_client_process(TestingClientConfiguration& config,
226+ pid_t launch_client_process(TestingClientConfiguration& config,
227 mir::options::Option const& test_options);
228
229 void tear_down_clients();
230
231=== modified file 'tests/mir_test/cross_process_action.cpp'
232--- tests/mir_test/cross_process_action.cpp 2013-08-20 13:39:31 +0000
233+++ tests/mir_test/cross_process_action.cpp 2014-09-16 08:38:35 +0000
234@@ -27,8 +27,8 @@
235 finish_sync.signal_ready();
236 }
237
238-void mt::CrossProcessAction::operator()()
239+void mt::CrossProcessAction::operator()(std::chrono::milliseconds timeout)
240 {
241 start_sync.signal_ready();
242- finish_sync.wait_for_signal_ready();
243+ finish_sync.wait_for_signal_ready_for(timeout);
244 }
245
246=== modified file 'tests/mir_test_framework/display_server_test_fixture.cpp'
247--- tests/mir_test_framework/display_server_test_fixture.cpp 2014-09-10 12:50:53 +0000
248+++ tests/mir_test_framework/display_server_test_fixture.cpp 2014-09-16 08:38:35 +0000
249@@ -60,9 +60,9 @@
250 process_manager.launch_server_process(functor);
251 }
252
253-void BespokeDisplayServerTestFixture::launch_client_process(TestingClientConfiguration& config)
254+pid_t BespokeDisplayServerTestFixture::launch_client_process(TestingClientConfiguration& config)
255 {
256- process_manager.launch_client_process(config, *test_options);
257+ return process_manager.launch_client_process(config, *test_options);
258 }
259
260 bool BespokeDisplayServerTestFixture::shutdown_server_process()
261
262=== modified file 'tests/mir_test_framework/testing_process_manager.cpp'
263--- tests/mir_test_framework/testing_process_manager.cpp 2014-09-10 12:50:53 +0000
264+++ tests/mir_test_framework/testing_process_manager.cpp 2014-09-16 08:38:35 +0000
265@@ -73,15 +73,16 @@
266 }
267 }
268
269-void mtf::TestingProcessManager::launch_client_process(TestingClientConfiguration& config, mo::Option const& test_options)
270+pid_t mtf::TestingProcessManager::launch_client_process(TestingClientConfiguration& config, mo::Option const& test_options)
271 {
272 if (!is_test_process)
273 {
274- return; // We're not in the test process, so just return gracefully
275+ return 0; // We're not in the test process, so just return gracefully
276 }
277
278 // We're in the test process, so make sure we started a service
279- ASSERT_TRUE(server_process_was_started);
280+ if (!server_process_was_started)
281+ throw std::runtime_error("Trying to launch client process, but server process has not started");
282
283 pid_t pid = fork();
284
285@@ -121,6 +122,8 @@
286 {
287 clients.push_back(std::shared_ptr<Process>(new Process(pid)));
288 }
289+
290+ return pid;
291 }
292
293 void mtf::TestingProcessManager::tear_down_clients()

Subscribers

People subscribed via source and target branches