Mir

Merge lp:~afrantzis/mir/fix-1350207-unresponsive-clients into lp:mir

Proposed by Alexandros Frantzis
Status: Merged
Approved by: Daniel van Vugt
Approved revision: no longer in the source branch.
Merged at revision: 1920
Proposed branch: lp:~afrantzis/mir/fix-1350207-unresponsive-clients
Merge into: lp:mir
Diff against target: 293 lines (+158/-15)
9 files modified
src/server/frontend/socket_messenger.cpp (+21/-5)
tests/acceptance-tests/CMakeLists.txt (+1/-0)
tests/acceptance-tests/test_unresponsive_client.cpp (+122/-0)
tests/include/mir_test/cross_process_action.h (+2/-1)
tests/include/mir_test_framework/display_server_test_fixture.h (+1/-1)
tests/include/mir_test_framework/testing_process_manager.h (+1/-1)
tests/mir_test/cross_process_action.cpp (+2/-2)
tests/mir_test_framework/display_server_test_fixture.cpp (+2/-2)
tests/mir_test_framework/testing_process_manager.cpp (+6/-3)
To merge this branch: bzr merge lp:~afrantzis/mir/fix-1350207-unresponsive-clients
Reviewer Review Type Date Requested Status
Daniel van Vugt Approve
Alberto Aguirre (community) Approve
Kevin DuBois (community) Approve
Chris Halse Rogers Approve
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+233934@code.launchpad.net

Commit message

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

Description of the change

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

This MP makes the sockets used for communicating with clients non-blocking. When the server tries to write to a socket with a full send buffer it will fail instead of blocking, resulting in either dropping the event (when sending events), or disconnecting the client (when responding to a client rpc request).

This MP also increases the socket send buffer to 64KiB (default is 16KiB) to give clients a little more breathing room for transient freezes. We can increase it even more if needed (max is 4MiB).

Note that this is only a workaround, not a final fix, which would involve making the data writes asynchronous, essentially providing some more buffering on the server side (instead of the kernel). Unfortunately, this is not easily achievable with asio in this particular scenario because:

1. Asio doesn't support sending ancillary data (e.g. file descriptors) asynchronously and in coordination with normal data.

2. There is no guaranteed ordering and atomicity of asio asynchronous writes, and messages may end up being interleaved without additional care.

More discussion is needed in this area:
* what are the criteria for dropping a client (should we decide or the shell or both?)
* application-not-responding notification to shell
* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

I haven't looked at the code in detail, but it sounds like to actually fix the problem, we need to spawn a comms thread and not block the main thread.

Revision history for this message
Chris Halse Rogers (raof) wrote :

No, I don't think a comms thread helps particularly; that just changes the problem to “blocked clients cause unbounded memory consumption”.

I think that properly handling this would be:
a) Mark the client as unresponsive in a way that's visible to shells
b) Register a callback for when the socket becomes writable
c) Drop any additional writes to the socket
d) When the socket becomes writable, send a full state-update message so that we know the client and server agree on what the current state is.

For now bumping the buffer size and hoping that the client didn't really need the messages we dropped seems like a non-terrible improvement to the status quo. Except...

Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel with a busy-wait in the server, or does transfer_exactly() not work on a non-blocking socket?

You'll certainly busy-wait until there's something to read, which seems bad.

Does anyone else find ‘ba::mutable_buffers_1 const&’ hilarious? :)

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel
> with a busy-wait in the server, or does transfer_exactly() not work on a
> non-blocking socket?
>
> You'll certainly busy-wait until there's something to read, which seems bad.

This is indeed a busy-wait in the server, but one that doesn't currently affect us because
the receive_msg() method is only used in this context:

if (message_receiver->available_bytes() >= body_size)
{
    on_new_message(message_receiver->receive_msg(ba::buffer(body)));
}

essentially guaranteeing that we never busy-loop. I introduced the loop to ensure we retain the blocking semantics this method had previously. It's too bad that we can't set (non-)blocking mode for reads and writes separately.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Yeah, that seems reasonable. I'm a bit surprised that boost::transfer_exactly() doesn't do the right thing, but eh.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Tested. Bug fixed!

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Whoops. Now that I look more closely:

(1) This looks too much like an infinite loop waiting to happen. I think "!=" needs to change to "<" for safety:
  while (nread != ba::buffer_size(buffer))
Also you could incorporate the if statement by changing to a "do .. while".

(2) This worries me too:
  "* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)"
If a busy client gets blocked doing other things just a bit too long, there's the real risk it won't be able to recover any more (rendering will freeze and never resume).

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) This looks too much like an infinite loop waiting to happen.
> I think "!=" needs to change to "<" for safety: while (nread != ba::buffer_size(buffer))

nread can't ever become larger than ba::buffer_size(buffer), since we are descreasing the buffer size we are passing to ba::read() (ba::mutable_buffers_1{buffer + nread}).

> (2) This worries me too:
> "* if we drop messages to clients does it make sense for the clients to continue in a possibly
> inconsistent state (it depends on the kind of dropped messages)"
> If a busy client gets blocked doing other things just a bit too long, there's the real risk
> it won't be able to recover any more (rendering will freeze and never resume).

As things stand, if a client is blocked and the socket becomes full, then:

1. if the server tries to send an event and it can't the event will be dropped
2. if the server tries to reply to an RPC call (e.g. next_buffer) and it can't, it will disconnect the client

So, at least we won't get a frozen client. Of course this is far from ideal, but it's a trade-off we need to decide if we want to make: we either run the risk of dropping events/disconnecting clients if they become busy for too long, or we run the risk of a blocked client hanging the whole server. Note that we can further increase the socket send buffer to allow more breathing room for transiently frozen clients if we find it necessary.

The idea behind this MP is to provide a reasonable workaround for the issue without changing too much code and risking destabilizing the frontend for rtm. I am open to other ideas.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, weakly "Approved". I recommend fixing still...

(1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.

(2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

Revision history for this message
Kevin DuBois (kdub) wrote :

Well, this looks okay from a code perspective. It seems that we're trading the 'blocked clients can hang server' bug for 'legitimately stopped clients will get forcibly disconnected' bug. I suppose its the lesser of two evils though.

review: Approve
Revision history for this message
Kevin DuBois (kdub) wrote :

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a
> new bug for the dropped response issue immediately?
Sounds appropriate to do

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

30 + while (nread != ba::buffer_size(buffer))

I would prefer a < as well...but not blocking.

+1 on submitting a bug for (2) and also for the TODO.

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.
> I would prefer a < as well...but not blocking.

Fixed.

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

I will open a new bug as soon as this gets merged.

Revision history for this message
Daniel van Vugt (vanvugt) :
review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> I will open a new bug as soon as this gets merged.

https://bugs.launchpad.net/mir/+bug/1370117

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/server/frontend/socket_messenger.cpp'
--- src/server/frontend/socket_messenger.cpp 2014-09-10 12:50:53 +0000
+++ src/server/frontend/socket_messenger.cpp 2014-09-16 08:38:35 +0000
@@ -36,6 +36,14 @@
36mfd::SocketMessenger::SocketMessenger(std::shared_ptr<ba::local::stream_protocol::socket> const& socket)36mfd::SocketMessenger::SocketMessenger(std::shared_ptr<ba::local::stream_protocol::socket> const& socket)
37 : socket(socket)37 : socket(socket)
38{38{
39 // Make the socket non-blocking to avoid hanging the server when a client
40 // is unresponsive. Also increase the send buffer size to 64KiB to allow
41 // more leeway for transient client freezes.
42 // See https://bugs.launchpad.net/mir/+bug/1350207
43 // TODO: Rework the messenger to support asynchronous sends
44 socket->non_blocking(true);
45 boost::asio::socket_base::send_buffer_size option(64*1024);
46 socket->set_option(option);
39}47}
4048
41mf::SessionCredentials mfd::SocketMessenger::creator_creds() const49mf::SessionCredentials mfd::SocketMessenger::creator_creds() const
@@ -143,11 +151,19 @@
143 ba::mutable_buffers_1 const& buffer)151 ba::mutable_buffers_1 const& buffer)
144{152{
145 bs::error_code e;153 bs::error_code e;
146 boost::asio::read(154 size_t nread = 0;
147 *socket,155
148 buffer,156 while (nread < ba::buffer_size(buffer))
149 boost::asio::transfer_exactly(ba::buffer_size(buffer)),157 {
150 e);158 nread += boost::asio::read(
159 *socket,
160 ba::mutable_buffers_1{buffer + nread},
161 e);
162
163 if (e && e != ba::error::would_block)
164 break;
165 }
166
151 return e;167 return e;
152}168}
153169
154170
=== modified file 'tests/acceptance-tests/CMakeLists.txt'
--- tests/acceptance-tests/CMakeLists.txt 2014-09-10 12:50:53 +0000
+++ tests/acceptance-tests/CMakeLists.txt 2014-09-16 08:38:35 +0000
@@ -46,6 +46,7 @@
46 test_server_without_active_outputs.cpp46 test_server_without_active_outputs.cpp
47 test_client_input.cpp47 test_client_input.cpp
48 test_server_startup.cpp48 test_server_startup.cpp
49 test_unresponsive_client.cpp
49 ${GENERATED_PROTOBUF_SRCS}50 ${GENERATED_PROTOBUF_SRCS}
50 ${GENERATED_PROTOBUF_HDRS}51 ${GENERATED_PROTOBUF_HDRS}
51)52)
5253
=== added file 'tests/acceptance-tests/test_unresponsive_client.cpp'
--- tests/acceptance-tests/test_unresponsive_client.cpp 1970-01-01 00:00:00 +0000
+++ tests/acceptance-tests/test_unresponsive_client.cpp 2014-09-16 08:38:35 +0000
@@ -0,0 +1,122 @@
1/*
2 * Copyright © 2014 Canonical Ltd.
3 *
4 * This program is free software: you can redistribute it and/or modify
5 * it under the terms of the GNU General Public License version 3 as
6 * published by the Free Software Foundation.
7 *
8 * This program is distributed in the hope that it will be useful,
9 * but WITHOUT ANY WARRANTY; without even the implied warranty of
10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11 * GNU General Public License for more details.
12 *
13 * You should have received a copy of the GNU General Public License
14 * along with this program. If not, see <http://www.gnu.org/licenses/>.
15 *
16 * Authored by: Alexandros Frantzis <alexandros.frantzis@canonical.com>
17 */
18
19#include "src/server/scene/session_container.h"
20#include "mir/scene/session.h"
21#include "mir/scene/surface.h"
22
23#include "mir_test/cross_process_action.h"
24#include "mir_test_framework/display_server_test_fixture.h"
25
26#include "mir_toolkit/mir_client_library.h"
27
28#include <gtest/gtest.h>
29
30#include <string>
31#include <thread>
32
33namespace mt = mir::test;
34namespace mtf = mir_test_framework;
35
36namespace
37{
38char const* const mir_test_socket = mtf::test_socket_file().c_str();
39}
40using UnresponsiveClient = mtf::BespokeDisplayServerTestFixture;
41
42TEST_F(UnresponsiveClient, does_not_hang_server)
43{
44 struct ServerConfig : TestingServerConfiguration
45 {
46 void on_start() override
47 {
48 std::thread(
49 [this]
50 {
51 send_events.exec(
52 [this]
53 {
54 auto const sessions = the_session_container();
55
56 for (int i = 0; i < 1000; ++i)
57 {
58 sessions->for_each(
59 [i] (std::shared_ptr<mir::scene::Session> const& session)
60 {
61 session->default_surface()->resize({i + 1, i + 1});
62 });
63 }
64 });
65 }).detach();
66 }
67 mt::CrossProcessAction send_events;
68 } server_config;
69
70 launch_server_process(server_config);
71
72 struct ClientConfig : TestingClientConfiguration
73 {
74 void exec()
75 {
76 MirConnection* connection = nullptr;
77 MirSurface* surface = nullptr;
78
79 connect.exec(
80 [&]
81 {
82 connection = mir_connect_sync(mir_test_socket, __PRETTY_FUNCTION__);
83
84 MirSurfaceParameters const request_params =
85 {
86 __PRETTY_FUNCTION__,
87 100, 100,
88 mir_pixel_format_abgr_8888,
89 mir_buffer_usage_hardware,
90 mir_display_output_id_invalid
91 };
92 surface = mir_connection_create_surface_sync(connection, &request_params);
93 });
94
95 release.exec(
96 [&]
97 {
98 // We would normally explicitly release the surface at this
99 // point. However, because we have been filling the server
100 // send socket buffer, releasing the surface may cause the
101 // server to try to write to a full socket buffer when
102 // responding, leading the server to believe that the client
103 // is blocked, and causing a premature client disconnection.
104 mir_connection_release(connection);
105 });
106 }
107
108 mt::CrossProcessAction connect;
109 mt::CrossProcessAction release;
110 } client_config;
111
112 auto const client_pid = launch_client_process(client_config);
113
114 run_in_test_process([&]
115 {
116 client_config.connect();
117 kill(client_pid, SIGSTOP);
118 server_config.send_events(std::chrono::seconds{10});
119 kill(client_pid, SIGCONT);
120 client_config.release();
121 });
122}
0123
=== modified file 'tests/include/mir_test/cross_process_action.h'
--- tests/include/mir_test/cross_process_action.h 2013-08-20 13:39:31 +0000
+++ tests/include/mir_test/cross_process_action.h 2014-09-16 08:38:35 +0000
@@ -23,6 +23,7 @@
23#include "mir_test_framework/cross_process_sync.h"23#include "mir_test_framework/cross_process_sync.h"
2424
25#include <functional>25#include <functional>
26#include <chrono>
2627
27namespace mir28namespace mir
28{29{
@@ -33,7 +34,7 @@
33{34{
34public:35public:
35 void exec(std::function<void()> const& f);36 void exec(std::function<void()> const& f);
36 void operator()();37 void operator()(std::chrono::milliseconds timeout = std::chrono::milliseconds{-1});
3738
38private:39private:
39 mir_test_framework::CrossProcessSync start_sync;40 mir_test_framework::CrossProcessSync start_sync;
4041
=== modified file 'tests/include/mir_test_framework/display_server_test_fixture.h'
--- tests/include/mir_test_framework/display_server_test_fixture.h 2014-09-10 12:50:53 +0000
+++ tests/include/mir_test_framework/display_server_test_fixture.h 2014-09-16 08:38:35 +0000
@@ -62,7 +62,7 @@
6262
63 void launch_server_process(TestingServerConfiguration& config);63 void launch_server_process(TestingServerConfiguration& config);
6464
65 void launch_client_process(TestingClientConfiguration& config);65 pid_t launch_client_process(TestingClientConfiguration& config);
6666
67 bool shutdown_server_process();67 bool shutdown_server_process();
68 Result wait_for_shutdown_server_process();68 Result wait_for_shutdown_server_process();
6969
=== modified file 'tests/include/mir_test_framework/testing_process_manager.h'
--- tests/include/mir_test_framework/testing_process_manager.h 2014-09-10 12:50:53 +0000
+++ tests/include/mir_test_framework/testing_process_manager.h 2014-09-16 08:38:35 +0000
@@ -51,7 +51,7 @@
51 ~TestingProcessManager();51 ~TestingProcessManager();
5252
53 void launch_server_process(TestingServerConfiguration& config);53 void launch_server_process(TestingServerConfiguration& config);
54 void launch_client_process(TestingClientConfiguration& config,54 pid_t launch_client_process(TestingClientConfiguration& config,
55 mir::options::Option const& test_options);55 mir::options::Option const& test_options);
5656
57 void tear_down_clients();57 void tear_down_clients();
5858
=== modified file 'tests/mir_test/cross_process_action.cpp'
--- tests/mir_test/cross_process_action.cpp 2013-08-20 13:39:31 +0000
+++ tests/mir_test/cross_process_action.cpp 2014-09-16 08:38:35 +0000
@@ -27,8 +27,8 @@
27 finish_sync.signal_ready();27 finish_sync.signal_ready();
28}28}
2929
30void mt::CrossProcessAction::operator()()30void mt::CrossProcessAction::operator()(std::chrono::milliseconds timeout)
31{31{
32 start_sync.signal_ready();32 start_sync.signal_ready();
33 finish_sync.wait_for_signal_ready();33 finish_sync.wait_for_signal_ready_for(timeout);
34}34}
3535
=== modified file 'tests/mir_test_framework/display_server_test_fixture.cpp'
--- tests/mir_test_framework/display_server_test_fixture.cpp 2014-09-10 12:50:53 +0000
+++ tests/mir_test_framework/display_server_test_fixture.cpp 2014-09-16 08:38:35 +0000
@@ -60,9 +60,9 @@
60 process_manager.launch_server_process(functor);60 process_manager.launch_server_process(functor);
61}61}
6262
63void BespokeDisplayServerTestFixture::launch_client_process(TestingClientConfiguration& config)63pid_t BespokeDisplayServerTestFixture::launch_client_process(TestingClientConfiguration& config)
64{64{
65 process_manager.launch_client_process(config, *test_options);65 return process_manager.launch_client_process(config, *test_options);
66}66}
6767
68bool BespokeDisplayServerTestFixture::shutdown_server_process()68bool BespokeDisplayServerTestFixture::shutdown_server_process()
6969
=== modified file 'tests/mir_test_framework/testing_process_manager.cpp'
--- tests/mir_test_framework/testing_process_manager.cpp 2014-09-10 12:50:53 +0000
+++ tests/mir_test_framework/testing_process_manager.cpp 2014-09-16 08:38:35 +0000
@@ -73,15 +73,16 @@
73 }73 }
74}74}
7575
76void mtf::TestingProcessManager::launch_client_process(TestingClientConfiguration& config, mo::Option const& test_options)76pid_t mtf::TestingProcessManager::launch_client_process(TestingClientConfiguration& config, mo::Option const& test_options)
77{77{
78 if (!is_test_process)78 if (!is_test_process)
79 {79 {
80 return; // We're not in the test process, so just return gracefully80 return 0; // We're not in the test process, so just return gracefully
81 }81 }
8282
83 // We're in the test process, so make sure we started a service83 // We're in the test process, so make sure we started a service
84 ASSERT_TRUE(server_process_was_started);84 if (!server_process_was_started)
85 throw std::runtime_error("Trying to launch client process, but server process has not started");
8586
86 pid_t pid = fork();87 pid_t pid = fork();
8788
@@ -121,6 +122,8 @@
121 {122 {
122 clients.push_back(std::shared_ptr<Process>(new Process(pid)));123 clients.push_back(std::shared_ptr<Process>(new Process(pid)));
123 }124 }
125
126 return pid;
124}127}
125128
126void mtf::TestingProcessManager::tear_down_clients()129void mtf::TestingProcessManager::tear_down_clients()

Subscribers

People subscribed via source and target branches