Mir

Merge lp:~alan-griffiths/mir/fix-1526248 into lp:mir

Proposed by Alan Griffiths
Status: Merged
Approved by: Daniel van Vugt
Approved revision: 3200
Merged at revision: 3201
Proposed branch: lp:~alan-griffiths/mir/fix-1526248
Merge into: lp:mir
Diff against target: 158 lines (+24/-18)
5 files modified
src/client/connection_surface_map.h (+2/-2)
src/client/rpc/mir_basic_rpc_channel.cpp (+4/-2)
src/client/rpc/mir_basic_rpc_channel.h (+3/-1)
src/client/rpc/mir_protobuf_rpc_channel.cpp (+7/-3)
src/client/surface_map.cpp (+8/-10)
To merge this branch: bzr merge lp:~alan-griffiths/mir/fix-1526248
Reviewer Review Type Date Requested Status
Daniel van Vugt Abstain
Alberto Aguirre (community) Approve
PS Jenkins bot (community) continuous-integration Approve
Kevin DuBois (community) Approve
Review via email: mp+280578@code.launchpad.net

Commit message

client: Fix racy access surfaces and streams in ConnectionSurfaceMap and to message objects in MirProtobufRpcChannel

Description of the change

client: Fix racy access surfaces and streams in ConnectionSurfaceMap and to message objects in MirProtobufRpcChannel

To post a comment you must log in.
Revision history for this message
Kevin DuBois (kdub) wrote :

I have a more involved fix coming (probably today actually), where we switch to using shared ptrs.

Its a bit unintuitive, but since the resources in the map are raw ptrs, they can actually be deleted before the map gets the lock, and removes them. The branch I have switches to shared_ptrs, so the object can't get deleted while under SurfaceMap's lock.

Revision history for this message
Kevin DuBois (kdub) wrote :
lp:~alan-griffiths/mir/fix-1526248 updated
3200. By Alan Griffiths

Fix race in mclr::MirProtobufRpcChannel::on_data_available()

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I have a more involved fix coming (probably today actually), where we switch
> to using shared ptrs.

Do you want to block this version?

Is there an additional reason to switch to shared_ptr<>?

Revision history for this message
Kevin DuBois (kdub) wrote :

The additional reason was that the shutdown sequence is somewhat confused, and its difficult to come up with a proper shutdown/disconnect message handling with NBS (which has more rapid events being sent)

No reason to block this one though, (esp if it has a second fix in it outside of SurfaceMap) lgtm.

review: Approve
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:3199
http://jenkins.qa.ubuntu.com/job/mir-ci/5849/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/5316
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-vivid-amd64-build/4222
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/5265
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-xenial-touch/159
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/177
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/177/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/177
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/177/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5265
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5265/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-touch/7785
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26067
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/157
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/157/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-xenial-touch/16
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26078

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/5849/rebuild

review: Approve (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:3200
http://jenkins.qa.ubuntu.com/job/mir-ci/5855/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/5322
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-vivid-amd64-build/4228
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/5271
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-xenial-touch/161
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/183
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-amd64-ci/183/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/183
        deb: http://jenkins.qa.ubuntu.com/job/mir-xenial-i386-ci/183/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5271
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/5271/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-touch/7787
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26088
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/159
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-xenial-armhf/159/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-xenial-touch/18
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/26092

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/5855/rebuild

review: Approve (continuous-integration)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I recall a long time ago that locking during exec() caused deadlocks there. How sure are you that's now fixed?

- lk.unlock();
         exec(surface);

review: Needs Information
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> I recall a long time ago that locking during exec() caused deadlocks there.
> How sure are you that's now fixed?
>
> - lk.unlock();
> exec(surface);

I've left the tests of this area looping over lunch and been running exploratory tests.

This isn't mathematical proof that there isn't a deadlock scenario, but deadlock is hypothetical and this fixes a demonstrable problem:

Unlocking is clearly wrong as it allows another thread to call erase() followed by deleting "surface" between unlock() and exec(). Holding a shared lock prevents that while still allowing non-mutating access.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

There's a lock order inversion report (with a TSan build)

but It looks benign since the only time the locks in question are acquired in different order is during surface creation.

During creation:

Callback MirSurface::created is invoked
-locks MirSurface::mutex
-calls buffer_stream_factory->make_producer_stream
--calls MirConnection::on_stream_created
---calls ConnectionSurfaceMap::insert
----write locks ConnectionSurfaceMap::guard

During normal operation:

MirProtobufRpcChannel::process_event_sequence
-calls ConnectionSurfaceMap::with_surface_do
--Read locks ConnectionSurfaceMap::guard
---Invokes callback which does MirSurface::handle_event
----locks MirSurface::mutex

http://paste.ubuntu.com/14054652/

I don't think the buffer stream creation needs to be done with a locked MirSurface::mutex; We should take a closer look at the uses of MirSurface::mutex, some don't seem necessary -but that's a separate issue outside the scope of this MP.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) :
review: Abstain

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/client/connection_surface_map.h'
--- src/client/connection_surface_map.h 2015-07-16 07:03:19 +0000
+++ src/client/connection_surface_map.h 2015-12-15 15:42:58 +0000
@@ -21,8 +21,8 @@
2121
22#include "surface_map.h"22#include "surface_map.h"
2323
24#include <shared_mutex>
24#include <unordered_map>25#include <unordered_map>
25#include <mutex>
2626
27namespace mir27namespace mir
28{28{
@@ -49,7 +49,7 @@
49 void erase(frontend::BufferStreamId surface_id);49 void erase(frontend::BufferStreamId surface_id);
5050
51private:51private:
52 std::mutex mutable guard;52 std::shared_timed_mutex mutable guard;
53 std::unordered_map<frontend::SurfaceId, MirSurface*> surfaces;53 std::unordered_map<frontend::SurfaceId, MirSurface*> surfaces;
54 std::unordered_map<frontend::BufferStreamId, StreamInfo> streams;54 std::unordered_map<frontend::BufferStreamId, StreamInfo> streams;
55};55};
5656
=== modified file 'src/client/rpc/mir_basic_rpc_channel.cpp'
--- src/client/rpc/mir_basic_rpc_channel.cpp 2015-08-21 12:08:03 +0000
+++ src/client/rpc/mir_basic_rpc_channel.cpp 2015-12-15 15:42:58 +0000
@@ -66,10 +66,12 @@
66 pending_calls[invoke.id()] = PendingCall(response, complete);66 pending_calls[invoke.id()] = PendingCall(response, complete);
67}67}
6868
69google::protobuf::MessageLite* mclrd::PendingCallCache::message_for_result(mir::protobuf::wire::Result& result)69void mclrd::PendingCallCache::populate_message_for_result(
70 mir::protobuf::wire::Result& result,
71 std::function<void(google::protobuf::MessageLite*)> const& populator)
70{72{
71 std::unique_lock<std::mutex> lock(mutex);73 std::unique_lock<std::mutex> lock(mutex);
72 return pending_calls.at(result.id()).response;74 populator(pending_calls.at(result.id()).response);
73}75}
7476
75void mclrd::PendingCallCache::complete_response(mir::protobuf::wire::Result& result)77void mclrd::PendingCallCache::complete_response(mir::protobuf::wire::Result& result)
7678
=== modified file 'src/client/rpc/mir_basic_rpc_channel.h'
--- src/client/rpc/mir_basic_rpc_channel.h 2015-08-21 12:08:03 +0000
+++ src/client/rpc/mir_basic_rpc_channel.h 2015-12-15 15:42:58 +0000
@@ -67,7 +67,9 @@
67 google::protobuf::Closure* complete);67 google::protobuf::Closure* complete);
6868
6969
70 google::protobuf::MessageLite* message_for_result(mir::protobuf::wire::Result& result);70 void populate_message_for_result(
71 mir::protobuf::wire::Result& result,
72 std::function<void(google::protobuf::MessageLite*)> const& populator);
7173
72 void complete_response(mir::protobuf::wire::Result& result);74 void complete_response(mir::protobuf::wire::Result& result);
7375
7476
=== modified file 'src/client/rpc/mir_protobuf_rpc_channel.cpp'
--- src/client/rpc/mir_protobuf_rpc_channel.cpp 2015-10-07 12:07:00 +0000
+++ src/client/rpc/mir_protobuf_rpc_channel.cpp 2015-12-15 15:42:58 +0000
@@ -382,9 +382,13 @@
382382
383 if (result->has_id())383 if (result->has_id())
384 {384 {
385 auto result_message = pending_calls.message_for_result(*result);385 pending_calls.populate_message_for_result(
386 result_message->ParseFromString(result->response());386 *result,
387 receive_file_descriptors(result_message);387 [&](google::protobuf::MessageLite* result_message)
388 {
389 result_message->ParseFromString(result->response());
390 receive_file_descriptors(result_message);
391 });
388392
389 if (id_to_wait_for)393 if (id_to_wait_for)
390 {394 {
391395
=== modified file 'src/client/surface_map.cpp'
--- src/client/surface_map.cpp 2015-07-27 07:24:50 +0000
+++ src/client/surface_map.cpp 2015-12-15 15:42:58 +0000
@@ -37,7 +37,7 @@
37 //Prevent TSAN from flagging lock ordering issues37 //Prevent TSAN from flagging lock ordering issues
38 //as the surface/buffer_stream destructors acquire internal locks38 //as the surface/buffer_stream destructors acquire internal locks
39 //The mutex lock is used mainly as a memory barrier here39 //The mutex lock is used mainly as a memory barrier here
40 std::lock_guard<std::mutex> lk(guard);40 std::lock_guard<decltype(guard)> lk(guard);
41 surface_map = std::move(surfaces);41 surface_map = std::move(surfaces);
42 stream_map = std::move(streams);42 stream_map = std::move(streams);
43 }43 }
@@ -60,12 +60,11 @@
60void mcl::ConnectionSurfaceMap::with_surface_do(60void mcl::ConnectionSurfaceMap::with_surface_do(
61 mf::SurfaceId surface_id, std::function<void(MirSurface*)> const& exec) const61 mf::SurfaceId surface_id, std::function<void(MirSurface*)> const& exec) const
62{62{
63 std::unique_lock<std::mutex> lk(guard);63 std::shared_lock<decltype(guard)> lk(guard);
64 auto const it = surfaces.find(surface_id);64 auto const it = surfaces.find(surface_id);
65 if (it != surfaces.end())65 if (it != surfaces.end())
66 {66 {
67 auto const surface = it->second;67 auto const surface = it->second;
68 lk.unlock();
69 exec(surface);68 exec(surface);
70 }69 }
71 else70 else
@@ -81,14 +80,14 @@
81 // get_buffer_stream has internal locks - call before locking mutex to80 // get_buffer_stream has internal locks - call before locking mutex to
82 // avoid locking ordering issues81 // avoid locking ordering issues
83 auto const stream = surface->get_buffer_stream();82 auto const stream = surface->get_buffer_stream();
84 std::lock_guard<std::mutex> lk(guard);83 std::lock_guard<decltype(guard)> lk(guard);
85 surfaces[surface_id] = surface;84 surfaces[surface_id] = surface;
86 streams[mf::BufferStreamId(surface_id.as_value())] = {stream, false};85 streams[mf::BufferStreamId(surface_id.as_value())] = {stream, false};
87}86}
8887
89void mcl::ConnectionSurfaceMap::erase(mf::SurfaceId surface_id)88void mcl::ConnectionSurfaceMap::erase(mf::SurfaceId surface_id)
90{89{
91 std::lock_guard<std::mutex> lk(guard);90 std::lock_guard<decltype(guard)> lk(guard);
92 surfaces.erase(surface_id);91 surfaces.erase(surface_id);
93 streams.erase(mf::BufferStreamId(surface_id.as_value()));92 streams.erase(mf::BufferStreamId(surface_id.as_value()));
94}93}
@@ -96,12 +95,11 @@
96void mcl::ConnectionSurfaceMap::with_stream_do(95void mcl::ConnectionSurfaceMap::with_stream_do(
97 mf::BufferStreamId stream_id, std::function<void(ClientBufferStream*)> const& exec) const96 mf::BufferStreamId stream_id, std::function<void(ClientBufferStream*)> const& exec) const
98{97{
99 std::unique_lock<std::mutex> lk(guard);98 std::shared_lock<decltype(guard)> lk(guard);
100 auto const it = streams.find(stream_id);99 auto const it = streams.find(stream_id);
101 if (it != streams.end())100 if (it != streams.end())
102 {101 {
103 auto const stream = it->second.stream;102 auto const stream = it->second.stream;
104 lk.unlock();
105 exec(stream);103 exec(stream);
106 }104 }
107 else105 else
@@ -114,19 +112,19 @@
114112
115void mcl::ConnectionSurfaceMap::with_all_streams_do(std::function<void(ClientBufferStream*)> const& fn) const113void mcl::ConnectionSurfaceMap::with_all_streams_do(std::function<void(ClientBufferStream*)> const& fn) const
116{114{
117 std::unique_lock<std::mutex> lk(guard);115 std::shared_lock<decltype(guard)> lk(guard);
118 for(auto const& stream : streams)116 for(auto const& stream : streams)
119 fn(stream.second.stream);117 fn(stream.second.stream);
120}118}
121119
122void mcl::ConnectionSurfaceMap::insert(mf::BufferStreamId stream_id, ClientBufferStream* stream)120void mcl::ConnectionSurfaceMap::insert(mf::BufferStreamId stream_id, ClientBufferStream* stream)
123{121{
124 std::lock_guard<std::mutex> lk(guard);122 std::lock_guard<decltype(guard)> lk(guard);
125 streams[stream_id] = {stream, true};123 streams[stream_id] = {stream, true};
126}124}
127125
128void mcl::ConnectionSurfaceMap::erase(mf::BufferStreamId stream_id)126void mcl::ConnectionSurfaceMap::erase(mf::BufferStreamId stream_id)
129{127{
130 std::lock_guard<std::mutex> lk(guard);128 std::lock_guard<decltype(guard)> lk(guard);
131 streams.erase(stream_id);129 streams.erase(stream_id);
132}130}

Subscribers

People subscribed via source and target branches