Mir

Merge lp:~alan-griffiths/mir/fix-1526248 into lp:mir

Proposed by Alan Griffiths
Status: Merged
Approved by: Daniel van Vugt
Approved revision: 3200
Merged at revision: 3201
Proposed branch: lp:~alan-griffiths/mir/fix-1526248
Merge into: lp:mir
Diff against target: 158 lines (+24/-18)
5 files modified
src/client/connection_surface_map.h (+2/-2)
src/client/rpc/mir_basic_rpc_channel.cpp (+4/-2)
src/client/rpc/mir_basic_rpc_channel.h (+3/-1)
src/client/rpc/mir_protobuf_rpc_channel.cpp (+7/-3)
src/client/surface_map.cpp (+8/-10)
To merge this branch: bzr merge lp:~alan-griffiths/mir/fix-1526248
Reviewer Review Type Date Requested Status
Daniel van Vugt Abstain
Alberto Aguirre (community) Approve
PS Jenkins bot (community) continuous-integration Approve
Kevin DuBois (community) Approve
Review via email: mp+280578@code.launchpad.net

Commit message

client: Fix racy access surfaces and streams in ConnectionSurfaceMap and to message objects in MirProtobufRpcChannel

Description of the change

client: Fix racy access surfaces and streams in ConnectionSurfaceMap and to message objects in MirProtobufRpcChannel

To post a comment you must log in.
Revision history for this message
Kevin DuBois (kdub) wrote :

I have a more involved fix coming (probably today actually), where we switch to using shared ptrs.

Its a bit unintuitive, but since the resources in the map are raw ptrs, they can actually be deleted before the map gets the lock, and removes them. The branch I have switches to shared_ptrs, so the object can't get deleted while under SurfaceMap's lock.

Revision history for this message
Kevin DuBois (kdub) wrote :
lp:~alan-griffiths/mir/fix-1526248 updated
3200. By Alan Griffiths

Fix race in mclr::MirProtobufRpcChannel::on_data_available()

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I have a more involved fix coming (probably today actually), where we switch
> to using shared ptrs.

Do you want to block this version?

Is there an additional reason to switch to shared_ptr<>?

Revision history for this message
Kevin DuBois (kdub) wrote :

The additional reason was that the shutdown sequence is somewhat confused, and its difficult to come up with a proper shutdown/disconnect message handling with NBS (which has more rapid events being sent)

No reason to block this one though, (esp if it has a second fix in it outside of SurfaceMap) lgtm.

review: Approve
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:3199
http://jenkins.qa.ubuntu.com/job/mir-ci/5849/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/5316
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-vivid-amd64-build/4222
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/5265
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-xenial-touch/159
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/177
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/177/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/177
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/177/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5265
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5265/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-touch/7785
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26067
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/157
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/157/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-xenial-touch/16
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26078

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/5849/rebuild

review: Approve (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:3200
http://jenkins.qa.ubuntu.com/job/mir-ci/5855/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/5322
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-vivid-amd64-build/4228
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/5271
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-xenial-touch/161
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/183
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/183/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/183
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/183/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5271
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5271/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-touch/7787
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26088
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/159
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/159/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-xenial-touch/18
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26092

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/5855/rebuild

review: Approve (continuous-integration)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I recall a long time ago that locking during exec() caused deadlocks there. How sure are you that's now fixed?

- lk.unlock();
         exec(surface);

review: Needs Information
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I recall a long time ago that locking during exec() caused deadlocks there.
> How sure are you that's now fixed?
>
> - lk.unlock();
> exec(surface);

I've left the tests of this area looping over lunch and been running exploratory tests.

This isn't mathematical proof that there isn't a deadlock scenario, but deadlock is hypothetical and this fixes a demonstrable problem:

Unlocking is clearly wrong as it allows another thread to call erase() followed by deleting "surface" between unlock() and exec(). Holding a shared lock prevents that while still allowing non-mutating access.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

There's a lock order inversion report (with a TSan build)

but It looks benign since the only time the locks in question are acquired in different order is during surface creation.

During creation:

Callback MirSurface::created is invoked
-locks MirSurface::mutex
-calls buffer_stream_factory->make_producer_stream
--calls MirConnection::on_stream_created
---calls ConnectionSurfaceMap::insert
----write locks ConnectionSurfaceMap::guard

During normal operation:

MirProtobufRpcChannel::process_event_sequence
-calls ConnectionSurfaceMap::with_surface_do
--Read locks ConnectionSurfaceMap::guard
---Invokes callback which does MirSurface::handle_event
----locks MirSurface::mutex

http://paste.ubuntu.com/14054652/

I don't think the buffer stream creation needs to be done with a locked MirSurface::mutex; We should take a closer look at the uses of MirSurface::mutex, some don't seem necessary -but that's a separate issue outside the scope of this MP.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) :
review: Abstain

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/client/connection_surface_map.h'
2--- src/client/connection_surface_map.h 2015-07-16 07:03:19 +0000
3+++ src/client/connection_surface_map.h 2015-12-15 15:42:58 +0000
4@@ -21,8 +21,8 @@
5
6 #include "surface_map.h"
7
8+#include <shared_mutex>
9 #include <unordered_map>
10-#include <mutex>
11
12 namespace mir
13 {
14@@ -49,7 +49,7 @@
15 void erase(frontend::BufferStreamId surface_id);
16
17 private:
18- std::mutex mutable guard;
19+ std::shared_timed_mutex mutable guard;
20 std::unordered_map<frontend::SurfaceId, MirSurface*> surfaces;
21 std::unordered_map<frontend::BufferStreamId, StreamInfo> streams;
22 };
23
24=== modified file 'src/client/rpc/mir_basic_rpc_channel.cpp'
25--- src/client/rpc/mir_basic_rpc_channel.cpp 2015-08-21 12:08:03 +0000
26+++ src/client/rpc/mir_basic_rpc_channel.cpp 2015-12-15 15:42:58 +0000
27@@ -66,10 +66,12 @@
28 pending_calls[invoke.id()] = PendingCall(response, complete);
29 }
30
31-google::protobuf::MessageLite* mclrd::PendingCallCache::message_for_result(mir::protobuf::wire::Result& result)
32+void mclrd::PendingCallCache::populate_message_for_result(
33+ mir::protobuf::wire::Result& result,
34+ std::function<void(google::protobuf::MessageLite*)> const& populator)
35 {
36 std::unique_lock<std::mutex> lock(mutex);
37- return pending_calls.at(result.id()).response;
38+ populator(pending_calls.at(result.id()).response);
39 }
40
41 void mclrd::PendingCallCache::complete_response(mir::protobuf::wire::Result& result)
42
43=== modified file 'src/client/rpc/mir_basic_rpc_channel.h'
44--- src/client/rpc/mir_basic_rpc_channel.h 2015-08-21 12:08:03 +0000
45+++ src/client/rpc/mir_basic_rpc_channel.h 2015-12-15 15:42:58 +0000
46@@ -67,7 +67,9 @@
47 google::protobuf::Closure* complete);
48
49
50- google::protobuf::MessageLite* message_for_result(mir::protobuf::wire::Result& result);
51+ void populate_message_for_result(
52+ mir::protobuf::wire::Result& result,
53+ std::function<void(google::protobuf::MessageLite*)> const& populator);
54
55 void complete_response(mir::protobuf::wire::Result& result);
56
57
58=== modified file 'src/client/rpc/mir_protobuf_rpc_channel.cpp'
59--- src/client/rpc/mir_protobuf_rpc_channel.cpp 2015-10-07 12:07:00 +0000
60+++ src/client/rpc/mir_protobuf_rpc_channel.cpp 2015-12-15 15:42:58 +0000
61@@ -382,9 +382,13 @@
62
63 if (result->has_id())
64 {
65- auto result_message = pending_calls.message_for_result(*result);
66- result_message->ParseFromString(result->response());
67- receive_file_descriptors(result_message);
68+ pending_calls.populate_message_for_result(
69+ *result,
70+ [&](google::protobuf::MessageLite* result_message)
71+ {
72+ result_message->ParseFromString(result->response());
73+ receive_file_descriptors(result_message);
74+ });
75
76 if (id_to_wait_for)
77 {
78
79=== modified file 'src/client/surface_map.cpp'
80--- src/client/surface_map.cpp 2015-07-27 07:24:50 +0000
81+++ src/client/surface_map.cpp 2015-12-15 15:42:58 +0000
82@@ -37,7 +37,7 @@
83 //Prevent TSAN from flagging lock ordering issues
84 //as the surface/buffer_stream destructors acquire internal locks
85 //The mutex lock is used mainly as a memory barrier here
86- std::lock_guard<std::mutex> lk(guard);
87+ std::lock_guard<decltype(guard)> lk(guard);
88 surface_map = std::move(surfaces);
89 stream_map = std::move(streams);
90 }
91@@ -60,12 +60,11 @@
92 void mcl::ConnectionSurfaceMap::with_surface_do(
93 mf::SurfaceId surface_id, std::function<void(MirSurface*)> const& exec) const
94 {
95- std::unique_lock<std::mutex> lk(guard);
96+ std::shared_lock<decltype(guard)> lk(guard);
97 auto const it = surfaces.find(surface_id);
98 if (it != surfaces.end())
99 {
100 auto const surface = it->second;
101- lk.unlock();
102 exec(surface);
103 }
104 else
105@@ -81,14 +80,14 @@
106 // get_buffer_stream has internal locks - call before locking mutex to
107 // avoid locking ordering issues
108 auto const stream = surface->get_buffer_stream();
109- std::lock_guard<std::mutex> lk(guard);
110+ std::lock_guard<decltype(guard)> lk(guard);
111 surfaces[surface_id] = surface;
112 streams[mf::BufferStreamId(surface_id.as_value())] = {stream, false};
113 }
114
115 void mcl::ConnectionSurfaceMap::erase(mf::SurfaceId surface_id)
116 {
117- std::lock_guard<std::mutex> lk(guard);
118+ std::lock_guard<decltype(guard)> lk(guard);
119 surfaces.erase(surface_id);
120 streams.erase(mf::BufferStreamId(surface_id.as_value()));
121 }
122@@ -96,12 +95,11 @@
123 void mcl::ConnectionSurfaceMap::with_stream_do(
124 mf::BufferStreamId stream_id, std::function<void(ClientBufferStream*)> const& exec) const
125 {
126- std::unique_lock<std::mutex> lk(guard);
127+ std::shared_lock<decltype(guard)> lk(guard);
128 auto const it = streams.find(stream_id);
129 if (it != streams.end())
130 {
131 auto const stream = it->second.stream;
132- lk.unlock();
133 exec(stream);
134 }
135 else
136@@ -114,19 +112,19 @@
137
138 void mcl::ConnectionSurfaceMap::with_all_streams_do(std::function<void(ClientBufferStream*)> const& fn) const
139 {
140- std::unique_lock<std::mutex> lk(guard);
141+ std::shared_lock<decltype(guard)> lk(guard);
142 for(auto const& stream : streams)
143 fn(stream.second.stream);
144 }
145
146 void mcl::ConnectionSurfaceMap::insert(mf::BufferStreamId stream_id, ClientBufferStream* stream)
147 {
148- std::lock_guard<std::mutex> lk(guard);
149+ std::lock_guard<decltype(guard)> lk(guard);
150 streams[stream_id] = {stream, true};
151 }
152
153 void mcl::ConnectionSurfaceMap::erase(mf::BufferStreamId stream_id)
154 {
155- std::lock_guard<std::mutex> lk(guard);
156+ std::lock_guard<decltype(guard)> lk(guard);
157 streams.erase(stream_id);
158 }

Subscribers

People subscribed via source and target branches