Mir

Merge lp:~dandrader/mir/switchingBundle_lp1270964 into lp:mir

switchingBundle_lp1270964
Merge into development-branch

Proposed by Daniel d'Andrada on 2014-01-21

Status:

Merged

Approved by:

Daniel van Vugt on 2014-01-23

Approved revision:

no longer in the source branch.

Merged at revision:

1350

Proposed branch:

lp:~dandrader/mir/switchingBundle_lp1270964

Merge into:

lp:mir

Diff against target:

88 lines (+34/-4)

3 files modified

src/server/compositor/switching_bundle.cpp (+1/-2)
src/server/compositor/switching_bundle.h (+3/-1)
tests/unit-tests/compositor/test_switching_bundle.cpp (+30/-1)

To merge this branch:

bzr merge lp:~dandrader/mir/switchingBundle_lp1270964

Low

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
PS Jenkins bot (community)	continuous-integration	2014-01-21	Approve on 2014-01-23
Daniel van Vugt		2014-01-21	Approve on 2014-01-23
Alan Griffiths		2014-01-21	Abstain on 2014-01-22
Andreas Pokorny (community)			Approve on 2014-01-22
Alexandros Frantzis (community)			Approve on 2014-01-22
Review via email: mp+202446@code.launchpad.net

This proposal supersedes a proposal from 2014-01-20.

Commit message

Only use SwitchingBundle::last_consumed after it has been set.

Otherwise SwitchingBundle::compositor_acquire could follow a bogus code path.
(LP: #1270964)

Description of the change

Fixes bug 1270964.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-21: Posted in a previous version of this proposal

FAILED: Continuous integration, rev:1342
http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-ci/689/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-trusty-i386-build/677
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-trusty-amd64-build/673
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-trusty-touch/279/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-trusty-amd64-ci/419/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-trusty-armhf-ci/423/console
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-trusty-armhf/279/console

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/689/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-01-21: Posted in a previous version of this proposal

(1) The new test fails:
[ RUN ] SwitchingBundleTest.compositor_client_interleaved
/home/dan/bzr/mir/tmp.964/tests/unit-tests/compositor/test_switching_bundle.cpp:389: Failure
Value of: bundle.first_ready
Actual: 1
Expected: 0
[ FAILED ] SwitchingBundleTest.compositor_client_interleaved (0 ms)
The same failure is occurring on desktop amd64 and armhf (Nexus 10).

(2) I'm undecided about this approach:
47 + friend class SwitchingBundleTest_compositor_client_interleaved_Test;
It's certainly a convenient solution for introspection. But that then allows the test to couple to internal members it shouldn't be coupled to.

And before anyone asks, this has nothing to do with bug 1270245 (which still happens with this branch).

review: Needs Fixing

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-01-21: Posted in a previous version of this proposal

(2) Please remove "friend class ..." and all the ASSERT_EQ statements that depend on it. They're not needed to produce a correct regression test, as shown here:
https://code.launchpad.net/~vanvugt/mir/test-1270964/+merge/202406
Feel free to merge that branch into this one.

review: Needs Fixing

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-21: Posted in a previous version of this proposal

> (1) The new test fails:
> [ RUN ] SwitchingBundleTest.compositor_client_interleaved
> /home/dan/bzr/mir/tmp.964/tests/unit-
> tests/compositor/test_switching_bundle.cpp:389: Failure
> Value of: bundle.first_ready
> Actual: 1
> Expected: 0
> [ FAILED ] SwitchingBundleTest.compositor_client_interleaved (0 ms)
> The same failure is occurring on desktop amd64 and armhf (Nexus 10).

Sorry, I rushed with that proposal right before my EOD.

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-21:

> (2) Please remove "friend class ..." and all the ASSERT_EQ statements that depend on it. They're not needed to produce a correct regression test

Sure, but I do think there's a lot of value in testing the internal state, as you can identify problems at an earlier stage before they manifest themselves externally, in this case, locking in a subsequent call.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-21:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/697/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-01-21:

52 +/*
53 + Regression test for LP#1270964
54 + In the situation emulated here SwitchingBundle::last_consumed ise used without ever being set,
55 + thus its initial value of zero will wrongly be used and lead to bogus behavior.
56 + */
57 +TEST_F(SwitchingBundleTest, compositor_client_interleaved)

Would a better name be "does_not_hang_as_described_in_bug_1270964"?

71 + //(nbufs=3,fcomp=2,ncomp=1,fready=0,nready=1,fclient=1,nclient=1) <-- BUG starts here
72 + // Should be:
73 + //(nbufs=3,fcomp=0,ncomp=1,fready=0,nready=0,fclient=1,nclient=1)
...
77 + //(nbufs=3,fcomp=2,ncomp=1,fready=0,nready=2,fclient=2,nclient=0)
78 + // Should be:
79 + //(nbufs=3,fcomp=0,ncomp=1,fready=1,nready=1,fclient=2,nclient=0)

These comments are too cryptic for me.

81 + client_buffer = bundle.client_acquire(); // <- would lock here if in buggy state

I'm not a fan of tests that fail by hanging.

review: Needs Fixing

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-21:

> 71 + //(nbufs=3,fcomp=2,ncomp=1,fready=0,nready=1,fclient=1,nclient=1)
> <-- BUG starts here
> 72 + // Should be:
> 73 + //(nbufs=3,fcomp=0,ncomp=1,fready=0,nready=0,fclient=1,nclient=1)
> ...
> 77 + //(nbufs=3,fcomp=2,ncomp=1,fready=0,nready=2,fclient=2,nclient=0)
> 78 + // Should be:
> 79 + //(nbufs=3,fcomp=0,ncomp=1,fready=1,nready=1,fclient=2,nclient=0)
>
> These comments are too cryptic for me.

This is the output of operator<<, telling the internal state of SwitchingBundle.

>
> 81 + client_buffer = bundle.client_acquire(); // <- would lock here if in
> buggy state
>
> I'm not a fan of tests that fail by hanging.

Me neither, suggestions?

That's the external, "visible", symptom of this bug. The other option being checking the internal state of SwitchingBundle after each step to see if doing things correctly. But duflu doesn't like that.

I vote for internal state checks, as explained on a previous comment.

So, what do we do?
1 - test if it locks (current approach)
2 - check the internal state (doesn't hang when it fails as it detects the issue at earlier steps)
3 - Suggestions?

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-01-21:

> > I'm not a fan of tests that fail by hanging.
>
> Me neither, suggestions?
>
> That's the external, "visible", symptom of this bug. The other option being
> checking the internal state of SwitchingBundle after each step to see if doing
> things correctly. But duflu doesn't like that.
>
> I vote for internal state checks, as explained on a previous comment.
>
> So, what do we do?
> 1 - test if it locks (current approach)
> 2 - check the internal state (doesn't hang when it fails as it detects the
> issue at earlier steps)
> 3 - Suggestions?

Some (not necessarily good) suggestions...

Expose enough of the state to enable a check through the public interface. e.g.

ASSERT_TRUE(bundle.is_valid());

Or check the result of streaming?

    std::ostringstream out;
    out << bundle;
    ASSERT_THAT(out.str(), HasSubstring(...));

Have a timer thread that breaks the lock.

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-01-21:

could this be related to https://bugs.launchpad.net/mir/+bug/1270245 ?

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-01-22:

As stated yesterday:
"And before anyone asks, this has nothing to do with bug 1270245 (which still happens with this branch)."

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-01-22:

This solution is indeed technically correct. Although I would personally take a simpler approach and initialize last_consumed to 0xdeadbeef or something similar. Then the caller has much less (1 in 4 billion) chance of accidentally writing code than could trigger the bug. And no need for a new member.

Fortunately I just realized and remembered Mir presently has 0% chance of experiencing the bug. Because our real compositor starts at frame #1.

review: Approve

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-01-22:

> Fortunately I just realized and remembered Mir presently has 0% chance of
> experiencing the bug. Because our real compositor starts at frame #1.

Then a comment "use 1 for the first frameno" would appear to suffice".

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-22:

> Fortunately I just realized and remembered Mir presently has 0% chance of
> experiencing the bug. Because our real compositor starts at frame #1.

For what it's worth, I consistently get it with the prototype qml compositor.

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-22:

> > Fortunately I just realized and remembered Mir presently has 0% chance of
> > experiencing the bug. Because our real compositor starts at frame #1.
>
> Then a comment "use 1 for the first frameno" would appear to suffice".

As a workaround.

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-22:

> > Fortunately I just realized and remembered Mir presently has 0% chance of
> > experiencing the bug. Because our real compositor starts at frame #1.
>
> For what it's worth, I consistently get it with the prototype qml compositor.

Because I start with frameno 0, duh.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-01-22:

Then it could be argued that the prototype qml compositor is buggy - as we now realize know it should not start with frameno = 0.

I'm now tending to the view that 0 is a bad initial value (as it is otherwise reasonable for a compositor to start with this value) and that there ought to be better way to manage frame numbers (e.g. a generator class or just a constant "initial_frameno").

I don't like the proposed test (as discussed above) and the proposed solution feels complex for what is really a poor interface relying on an undocumented convention.

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-22:

Now the test doesn't lock on failure or probe SwitchingBundle's internal state. Should make everybody happy. :)

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-01-22:

Since 0 is a valid value it may come up again when our global/local frameno counts wrap around. So we either:
1. Need to allow it (proposed solution)
2. Make the value invalid and ensure 100% that it never comes up during normal operation

I am OK with (1). It allows the user to start with any initial value they want and this is in IMO a better interface since less is expected from the user and there is no opportunity for error.

That being said, I would also be happy with something like the following for approach (2):

struct FrameNumber
{
void operator++() { if (++number == invalid_value) ++number; }
int operator int() { return number; }

static int const invalid_value = 0;

private:
int number = invalid_value + 1;
};

BufferBundle::compositor_acquire(FrameNumber frameno);

review: Approve

Revision history for this message

Andreas Pokorny (andreas-pokorny) wrote on 2014-01-22:

As a gradual improvement I would suggest an encapsulation of these two member variables into one:

i.e. using boost::optional<unsigned long> last_consuned;

But Alexandros proposal for (2) is nicer.

review: Approve

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-01-22:

> Since 0 is a valid value it may come up again when our global/local frameno
> counts wrap around.

that doesn't matter if last_consumed has been written - it is only the first frame number used that can trigger the failure.

review: Abstain

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-22:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/708/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-01-22:

> > Since 0 is a valid value it may come up again when our global/local frameno
> > counts wrap around.
>
> that doesn't matter if last_consumed has been written - it is only the first
> frame number used that can trigger the failure.

The fine point here is that this is a potential problem for all new surfaces, since they may be created just after the count wraps around and therefore get 0 as their first frameno value.

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-22:

> As a gradual improvement I would suggest an encapsulation of these two member
> variables into one:
>
> i.e. using boost::optional<unsigned long> last_consuned;

I liked it! Done.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-22:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/711/rebuild

review: Approve (continuous-integration)

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-01-23:

> As stated yesterday:
> "And before anyone asks, this has nothing to do with bug 1270245 (which still
> happens with this branch)."

must have missed the message, sorry

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-01-23:

I'm not sure that using boost is a better solution. Mostly because boost is so hideously documented that it's often very hard for the reader to learn what a particular template is/does. If using an additional variable lets you create a solution that the reader will understand immediately (no new library to learn), then that would have been better.

Now I kind of understand boost::optional, I think it's a dangerous template to use. Because assigning to an optional and testing its boolean value are essentially unrelated values. That's asking to be misinterpreted (as I have already today).

But this new version looks correct also.

review: Approve

Revision history for this message

PS Jenkins bot (ps-jenkins) on 2014-01-23:

review: Approve (continuous-integration)

Revision history for this message

Daniel d'Andrada (dandrader) wrote on 2014-01-23:

> [...] Mostly because boost is so hideously documented that it's often very hard for the reader to learn what a particular template is/does. [...]

Gotta agree with you on that.

Wasn't easy to learn boost::optional from its own documentation, which is more an essay on its motivation and design than an explanation on how to actually use it (the header file ended up doing the job).

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Daniel d'Andrada

Emanuele Antonio Faraone

Gerry Boland

Mir development team

 === modified file 'src/server/compositor/switching_bundle.cpp'
 --- src/server/compositor/switching_bundle.cpp	2014-01-21 06:27:40 +0000
 +++ src/server/compositor/switching_bundle.cpp	2014-01-22 17:13:38 +0000
@@ -76,7 +76,6 @@
        first_ready{0}, nready{0},
        first_client{0}, nclients{0},
        snapshot{-1}, nsnapshotters{0},
--      last_consumed{0},
        overlapping_compositors{false},
        framedropping{false}, force_drop{0}
+ {
@@ -273,7 +272,7 @@
      int compositor;
      // Multi-monitor acquires close to each other get the same frame:
--    bool same_frame = (frameno == last_consumed);
++    bool same_frame = last_consumed && (frameno == *last_consumed);
      int avail = nfree();
      bool can_recycle = ncompositors || avail;
 === modified file 'src/server/compositor/switching_bundle.h'
 --- src/server/compositor/switching_bundle.h	2014-01-13 06:12:33 +0000
 +++ src/server/compositor/switching_bundle.h	2014-01-22 17:13:38 +0000
@@ -25,6 +25,8 @@
  #include <mutex>
  #include <memory>
++#include <boost/optional/optional.hpp>
++
  namespace mir
+ {
  namespace graphics
@@ -98,7 +100,7 @@
      mutable std::mutex guard;
      std::condition_variable cond;
--    unsigned long last_consumed;
++    boost::optional<unsigned long> last_consumed;
      bool overlapping_compositors;
 === modified file 'tests/unit-tests/compositor/test_switching_bundle.cpp'
 --- tests/unit-tests/compositor/test_switching_bundle.cpp	2014-01-13 06:12:33 +0000
 +++ tests/unit-tests/compositor/test_switching_bundle.cpp	2014-01-22 17:13:38 +0000
@@ -329,6 +329,36 @@
+     }
+ }
++/*
++ Regression test for LP#1270964
++ In the original bug, SwitchingBundle::last_consumed would be used without ever being set
++ in the compositor_acquire() call, thus its initial value of zero would wrongly be used
++ and lead to the wrong buffer being given to the compositor
++ */
++TEST_F(SwitchingBundleTest, compositor_client_interleaved)
++{
++    int nbuffers = 3;
++    mc::SwitchingBundle bundle(nbuffers, allocator, basic_properties);
++    mg::Buffer* client_buffer = nullptr;
++    std::shared_ptr<mg::Buffer> compositor_buffer = nullptr;
++
++    client_buffer = bundle.client_acquire();
++    mg::BufferID first_ready_buffer_id = client_buffer->id();
++    bundle.client_release(client_buffer);
++
++    client_buffer = bundle.client_acquire();
++
++    // in the original bug, compositor would be given the wrong buffer here
++    compositor_buffer = bundle.compositor_acquire(0 /*frameno*/);
++
++    ASSERT_EQ(first_ready_buffer_id, compositor_buffer->id());
++
++    // Clean up
++    bundle.client_release(client_buffer);
++    bundle.compositor_release(compositor_buffer);
++    compositor_buffer.reset();
++}
++
  TEST_F(SwitchingBundleTest, overlapping_compositors_get_different_frames)
+ {
      // This test simulates bypass behaviour
@@ -868,4 +898,3 @@
+         }
+     }
+ }
--