Mir

Merge lp:~mir-team/mir/fix-1339700 into lp:mir

Proposed by Alberto Aguirre
Status: Rejected
Rejected by: Alberto Aguirre
Proposed branch: lp:~mir-team/mir/fix-1339700
Merge into: lp:mir
Diff against target: 120 lines (+42/-16)
2 files modified
src/server/compositor/buffer_queue.cpp (+14/-2)
src/server/compositor/timeout_frame_dropping_policy_factory.cpp (+28/-14)
To merge this branch: bzr merge lp:~mir-team/mir/fix-1339700
Reviewer Review Type Date Requested Status
PS Jenkins bot (community) continuous-integration Approve
Cemil Azizoglu (community) Approve
Alexandros Frantzis (community) Needs Information
Alan Griffiths Approve
Daniel van Vugt Needs Fixing
Review via email: mp+226233@code.launchpad.net

Commit message

Fix deadlock in BufferQueue when calling framedrop_policy (LP: #1339700)

Description of the change

Fix deadlock in BufferQueue when calling framedrop_policy (LP: #1339700)

To post a comment you must log in.
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I'm not really familiar with this. It's all new stuff that appeared while I was on vacation. But simple logic still applies:

(1) Accessing a member variable after having given up the lock that was protecting it, I think is a bad idea:
36 + give_buffer_to_client(buffer, std::move(lock));
37 framedrop_policy->swap_unblocked();
38 - give_buffer_to_client(buffer, std::move(lock));

(2) This unlock also results in a member variable no longer being safely locked. It looks like it should stay locked. Certainly it doesn't appear to do any callbacks...
28 + lock.unlock();
29 framedrop_policy->swap_now_blocking();

(3) No regression test!
Maybe I can help with that as I remember seeing the offending assertion fail for me a couple of times this week. I just need to find where...

review: Needs Fixing
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Chris Halse Rogers (raof) wrote :

Cross post from the 0.4 merge:

Hm. This would appear to be a bandaid solution. The underlying problem is that Alarm is calling out to code it doesn't control while holding a lock; we can't guarantee correct locking in this case.

This was introduced in https://code.launchpad.net/~alan-griffiths/mir/fix-1334287/+merge/224457, which fixes a bug, so we can't simply drop the lock. We might need a second lock in AlarmImpl guarding the destructor and data->callback() call?

I'm not averse to this as an urgent fix, but unless we also fix it properly this sort of timebomb will keep exploding :)

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I admit Mir is the first project I've ever seen that employs callbacks from opaque and potentially unsafe locations, where you can't guarantee which locks are already held.

Maybe we need to start considering synchronization primitives without callbacks. That would also afford us the freedom to be more sensible with locking, and only have locks in regions of code which explicitly create/use threads in their own locality. One should start with no locking, and then only introduce it adjacently to the threads that necessitate it. Certainly you should never lock inside class methods if that class isn't the one creating the threads. Because you're just hiding more locks which when combined with callbacks will cause deadlocks like this.

But now I'm hand-waving. We need shorter term practical solutions first. Maybe not as short-term as removing locking like is proposed here though...

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

@Daniel

1) 2) How does the lock protect framedrop_policy? And what does it protect? From what I can see there's no need to hold the lock. The implementation of the policy has its own internal synchronization.

3) I could not come up with a reasonable test against deadlock that doesn't involve modifying production code

@Chris, yes definitely a band-aid for now. I think applying a similar custom lock as in Alan's deadlock free branch will be needed for alarm callbacks.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alberto: framedrop_policy is a member of BufferQueue and as such is protected by BufferQueue.guard. Giving framedrop_policy its own internal locking and making it an exceptional member I guess could be safe (maybe?), but it's better if we don't force anyone to have to think that hard. Because tricky exceptions like that just result in future misunderstandings and bugs.

I've proposed a simpler fix here with a test (which successfully detects the deadlock without hanging itself):
https://code.launchpad.net/~vanvugt/mir/fix-1339700-alarm/+merge/226252

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> Alberto: framedrop_policy is a member of BufferQueue and as such is protected
> by BufferQueue.guard.

const members do not need to be protected by a lock.

While framedrop_policy isn't declared const it is only set during constructor so it is an "honorary const" member.

This isn't clear and I'd prefer a rewrite that makes framedrop_policy const but this does seem safe and fix the deadlock.

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> > Alberto: framedrop_policy is a member of BufferQueue and as such is
> protected
> > by BufferQueue.guard.
>
> const members do not need to be protected by a lock.
>
> While framedrop_policy isn't declared const it is only set during constructor
> so it is an "honorary const" member.
>
> This isn't clear and I'd prefer a rewrite that makes framedrop_policy const
> but this does seem safe and fix the deadlock.

std::unique_ptr<FrameDroppingPolicy> is effectively const but the FrameDroppingPolicy object itself is not, so we need to ensure it's thread-safe. It's not obvious that our TimeoutFrameDroppingPolicy implementation is thread-safe; calling Alarm methods muddies the water.

review: Needs Information
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

I 've done my own testing for regressions and didn't encounter issues with this fix. It does bother me that it doesn't have a test. If you think one up, let's add it. For now, in the absence of a better alternative, letting this one through...

review: Approve
Revision history for this message
kevin gunn (kgunn72) wrote :

top approving to get the silo rolling which always takes a while.
still needs followup with Alberto

Revision history for this message
kevin gunn (kgunn72) wrote :

> top approving to get the silo rolling which always takes a while.
> still needs followup with Alberto

top approving only https://code.launchpad.net/~mir-team/mir/fix-1339700-in-0.4/+merge/226232
leaving this open, follow up on 0.4 & devel will need to happen in sync

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

> Alberto: framedrop_policy is a member of BufferQueue and as such is protected
> by BufferQueue.guard. Giving framedrop_policy its own internal locking and
> making it an exceptional member I guess could be safe (maybe?), but it's
> better if we don't force anyone to have to think that hard. Because tricky
> exceptions like that just result in future misunderstandings and bugs.
>

Mutual exclusion protection is only needed if:
1) The member variable itself is mutating - it's not in this case even though it's not const.
2) The methods called on the object are not thread safe - from what I can see the implementation (TimeoutFrameDroppingPolicy) seems to be thread safe with respect to the two methods we invoke (swap_now_blocking and swap_unblocked)

I agree, we need to fix the root of the issue - but I think this is a viable short term bandaid.

> I've proposed a simpler fix here with a test (which successfully detects the
> deadlock without hanging itself):
> https://code.launchpad.net/~vanvugt/mir/fix-1339700-alarm/+merge/226252

I explicitly avoided dropping the lock at the AlarmImpl since it brings up lifetime issues which Alan addressed.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

> std::unique_ptr<FrameDroppingPolicy> is effectively const but the
> FrameDroppingPolicy object itself is not, so we need to ensure it's thread-
> safe. It's not obvious that our TimeoutFrameDroppingPolicy implementation is
> thread-safe; calling Alarm methods muddies the water.

So the methods of interest are swap_now_blocking and swap_unblocked in TimeoutFrameDroppingPolicy

The question is really should those methods be completely atomic or can we get by if they are not (i.e. is pending_swaps being atomic sufficient)- so I will stare at them a bit more...

lp:~mir-team/mir/fix-1339700 updated
1759. By Alberto Aguirre

merge lp:mir/devel

1760. By Alberto Aguirre

merge lp:~alan-griffiths/mir/doodle-in-TimeoutFrameDroppingPolicy-code

1761. By Alberto Aguirre

more doodling to prevent pending_swaps going negative

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Ok, after staring at TimeoutFrameDroppingPolicy there are scenarios that can lead to an assert.

Removing the assert creates other racing conditions, which are mostly addressed with the updates. However, there's the following scenario:

Assume there's a pending client notification,
Thread A calls BufferQueue::release, let's go of the lock and gets pre-empted before calling framedrop_policy->swap_unblocked();

Thread B calls client_release (because Thread A already gave a buffer) then calls client_acquire.
Thread B calls framedrop_policy->swap_now_blocking() and an alarm is scheduled.

Thread A resumes and calls framedrop_policy->swap_unblocked(), the alarm is now cancelled but the client did not receive any buffer.

Which is an issue if the compositor is stopped before another compositor_acquire/compositor_release cycle.

The bottom line is pending_client_notifications should be in sync with pending_swaps in TimeoutFrameDroppingPolicy, which is not possible if we drop the lock...

Revision history for this message
kevin gunn (kgunn72) wrote :

> > top approving to get the silo rolling which always takes a while.
> > still needs followup with Alberto
>
> top approving only https://code.launchpad.net/~mir-
> team/mir/fix-1339700-in-0.4/+merge/226232
> leaving this open, follow up on 0.4 & devel will need to happen in sync

actually not top-approving anything for the moment, waiting for a bit more doodling resolution

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)

Unmerged revisions

1761. By Alberto Aguirre

more doodling to prevent pending_swaps going negative

1760. By Alberto Aguirre

merge lp:~alan-griffiths/mir/doodle-in-TimeoutFrameDroppingPolicy-code

1759. By Alberto Aguirre

merge lp:mir/devel

1758. By Alberto Aguirre

Fix deadlock in BufferQueue when calling framedrop_policy (LP: #1339700)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/server/compositor/buffer_queue.cpp'
2--- src/server/compositor/buffer_queue.cpp 2014-06-12 15:35:08 +0000
3+++ src/server/compositor/buffer_queue.cpp 2014-07-10 18:40:31 +0000
4@@ -133,7 +133,18 @@
5 framedrop_policy = policy_provider.create_policy([this]
6 {
7 std::unique_lock<decltype(guard)> lock{guard};
8- assert(!pending_client_notifications.empty());
9+
10+ if (pending_client_notifications.empty())
11+ {
12+ /*
13+ * This is a benign race between the policy calling back due to a
14+ * timeout and invoking framedrop_policy->swap_unblocked().
15+ * A client request may already have been fulfilled just before
16+ * invoking framedrop_policy->swap_unblocked()
17+ */
18+ return;
19+ }
20+
21 if (ready_to_composite_queue.empty())
22 {
23 /*
24@@ -193,6 +204,7 @@
25 * until the compositor is done with an old frame, or the policy
26 * says they've waited long enough.
27 */
28+ lock.unlock();
29 framedrop_policy->swap_now_blocking();
30 }
31
32@@ -434,8 +446,8 @@
33 {
34 if (!pending_client_notifications.empty())
35 {
36+ give_buffer_to_client(buffer, std::move(lock));
37 framedrop_policy->swap_unblocked();
38- give_buffer_to_client(buffer, std::move(lock));
39 }
40 else
41 free_buffers.push_back(buffer);
42
43=== modified file 'src/server/compositor/timeout_frame_dropping_policy_factory.cpp'
44--- src/server/compositor/timeout_frame_dropping_policy_factory.cpp 2014-06-25 16:30:46 +0000
45+++ src/server/compositor/timeout_frame_dropping_policy_factory.cpp 2014-07-10 18:40:31 +0000
46@@ -20,8 +20,7 @@
47
48 #include "timeout_frame_dropping_policy_factory.h"
49
50-#include <cassert>
51-#include <atomic>
52+#include <mutex>
53 #include <chrono>
54 #include <boost/throw_exception.hpp>
55
56@@ -42,7 +41,8 @@
57 private:
58 std::chrono::milliseconds const timeout;
59 std::unique_ptr<mir::time::Alarm> const alarm;
60- std::atomic<unsigned int> pending_swaps;
61+ std::mutex mutex;
62+ unsigned int pending_swaps;
63 };
64
65 TimeoutFrameDroppingPolicy::TimeoutFrameDroppingPolicy(std::shared_ptr<mir::time::Timer> const& timer,
66@@ -51,10 +51,16 @@
67 : timeout{timeout},
68 alarm{timer->create_alarm([this, drop_frame]
69 {
70- assert(pending_swaps.load() > 0);
71- drop_frame();
72- if (--pending_swaps > 0)
73- alarm->reschedule_in(this->timeout);
74+ bool call_drop{false};
75+ bool reschedule{false};
76+
77+ {
78+ std::lock_guard<std::mutex> lock{mutex};
79+ call_drop = pending_swaps > 0;
80+ reschedule = call_drop && --pending_swaps > 0;
81+ }
82+ if (call_drop) drop_frame();
83+ if (reschedule) alarm->reschedule_in(this->timeout);
84 })},
85 pending_swaps{0}
86 {
87@@ -62,18 +68,26 @@
88
89 void TimeoutFrameDroppingPolicy::swap_now_blocking()
90 {
91- if (pending_swaps++ == 0)
92+ if ([&]{ std::lock_guard<std::mutex> lock{mutex}; return pending_swaps++ == 0; }())
93 alarm->reschedule_in(timeout);
94 }
95
96 void TimeoutFrameDroppingPolicy::swap_unblocked()
97 {
98- if (alarm->state() != mir::time::Alarm::cancelled && alarm->cancel())
99- {
100- if (--pending_swaps > 0)
101- {
102- alarm->reschedule_in(timeout);
103- }
104+ if (alarm->state() == mir::time::Alarm::cancelled) return;
105+
106+ bool update_alarm{false};
107+ bool reschedule{false};
108+ {
109+ std::lock_guard<std::mutex> lock{mutex};
110+ update_alarm = pending_swaps > 0;
111+ reschedule = update_alarm && --pending_swaps > 0;
112+ }
113+
114+ if (update_alarm)
115+ {
116+ if (reschedule) alarm->reschedule_in(timeout);
117+ else alarm->cancel();
118 }
119 }
120 }

Subscribers

People subscribed via source and target branches