Mir

Merge lp:~vanvugt/mir/predictive-bypass-2 into lp:mir

predictive-bypass-2
Merge into development-branch

Proposed by Daniel van Vugt on 2015-06-15

Status:

Merged

Merged at revision:

2719

Proposed branch:

lp:~vanvugt/mir/predictive-bypass-2

Merge into:

lp:mir

Diff against target:

377 lines (+190/-0)

9 files modified

src/platforms/android/server/hwc_device.cpp (+45/-0)
src/platforms/mesa/server/kms/display_buffer.cpp (+62/-0)
src/platforms/mesa/server/kms/kms_output.h (+8/-0)
src/platforms/mesa/server/kms/real_kms_output.cpp (+8/-0)
src/platforms/mesa/server/kms/real_kms_output.h (+1/-0)
tests/mir_test_doubles/mock_drm.cpp (+5/-0)
tests/unit-tests/graphics/android/test_hwc_device.cpp (+23/-0)
tests/unit-tests/graphics/mesa/kms/mock_kms_output.h (+1/-0)
tests/unit-tests/graphics/mesa/kms/test_display_buffer.cpp (+37/-0)

To merge this branch:

bzr merge lp:~vanvugt/mir/predictive-bypass-2

Medium

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Alan Griffiths		2015-06-15	Needs Information on 2015-06-22
Andreas Pokorny (community)			Needs Information on 2015-06-17
PS Jenkins bot (community)	continuous-integration		Approve on 2015-06-16
Review via email: mp+261941@code.launchpad.net

Commit message

Introducing "predictive bypass"; this provides a constant ~10ms reduction
in latency when fully bypassed/overlayed. This benefit is in addition to
any lag reductions provided by other branches.

Additional unexpected benefits (free!):
  * In some cases even smoothness/frame rate is improved by this branch
    (LP: #1447896).
  * Software cursors/touchspots appear to "stick" to the client app better
    with this branch. Because the underlying client surface has the
    additional time it needs to update for the new cursor/touch position
    before the frame is posted.

Description of the change

Background research where this algorithm was first proposed (in a slightly different form):
https://docs.google.com/document/d/116i4TC0rls4wKFmbaRrHL_UT_Jg22XI8YqpXGLLTVCc/edit#

Possible variations on the algorithm:
  * The same technique can be applied to GL compositing too, but has
    explicitly NOT been enabled for compositing right now. This is
    because you can't trust most hardware (especially mobile) to be fast
    enough to GL render and still have time to sleep without missing a
    frame. Only the bypass/overylay code path is safe enough because we
    know there is no GL compositing required.
  * Detect render times, and missed frames intelligently... That's what
    the first version of this optimization did. Unfortunately it was not
    reliable and prone to false positives which resulted in the
    optimization backing off and losing benefit. That approach is also
    not very portable to Android (where we need it most).

If you're interested in manual testing:
Seeing 10ms improvement requires very careful attention to detail. If
you're trying to see it with your own eyes, these techniques help:

  * Always start from the lowest latency configuration, so that any
    improvement will be relatively more significant. This means do not
    test with nesting, and do add --nbuffers=2 to your server. Your test
    clients also should be run in lowest latency mode to most easily
    see the improvement. This means using the -n option for framedropping
    (on desktop, but don't use it on Android due to bugs, see below):
       mir_demo_client_target -n
  * Configure your input devices for minimal latency. USB mice for example
    default to about 8ms latency (125Hz). You can reduce that to 1ms
    (1000Hz) using the kernel parameter: usbhid.mousepoll=1
  * Android warning! There's a bunch of Android performance bugs you need
    to be aware of and are likely to run into making testing difficult and
    confusing, so familiarise yourself:
       https://bugs.launchpad.net/mir/+bugs?field.tag=android
  * Remember only the bypass/overlay path gets the optimization so on
    desktop at least that means only fullscreen GL clients benefit.

Of course, we aim for greater than 10ms improvement in future. This
branch is just one step on that journey.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-06-15:

FAILED: Continuous integration, rev:2651
http://jenkins.qa.ubuntu.com/job/mir-ci/4094/
Executed test runs:
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/2840/console
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-wily-amd64-build/353
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/2788/console
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-wily-amd64-ci/250
        deb: http://jenkins.qa.ubuntu.com/job/mir-wily-amd64-ci/250/artifact/work/output/*zip*/output.zip
    FAILURE: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/2788/console

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/4094/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-16:

Argh, vivid is not implementing C++14 but wily is?

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-16:

Finally got this tested on the slowest machine I have: An Atom N270. This machine struggles with Unity7 and Unity8 it struggles even more. However running Mir demos I can get its end-to-end latency down to approximately 6ms with this branch [1ms input latency, 1ms render time, 4ms grace time to page flip]. And full 60 FPS.

So even a machine that's arguably too slow for Unity can easily keep up with predictive bypass.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-06-16:

PASSED: Continuous integration, rev:2653
http://jenkins.qa.ubuntu.com/job/mir-ci/4107/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-vivid-i386-build/2859
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-wily-amd64-build/372
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-vivid-touch/2807
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-wily-amd64-ci/263
        deb: http://jenkins.qa.ubuntu.com/job/mir-wily-amd64-ci/263/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/2807
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-vivid-armhf/2807/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-mako/5637
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/21209

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/4107/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-06-16:

*Needs Discussion*

We really need some decent way of testing end-end performance improvements like this claim.

It is impractical to rely on manual testing to detect regressions. And a bunch of unit tests for the throttling logic does not adequately prove the real world behaviour of the system.

review: Needs Information

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-17:

It's good to have some scepticism. And it's reasonable to not expect most people to do exhaustive manual testing to convince themselves this works.

Before anyone says it, I think sleeping sounds like a terrible hack. I have questioned myself for months over this. However, the gain of 10ms less lag consistently across all devices is a big win worth living with some ugliness for. You can see the improvement on a phone in something as simple as swiping across the screen.

To help with peoples' mental models, consider this (time in milliseconds for illustration):

Present behaviour:

t0: Compositor snapshots scene and decides it can bypass/overlay it
t0: post() begins
t16: post() returns (waited for vblank and page flip completed)

Result: What's on screen is at least 16ms old (time since the snapshot)
Result: Only the first surface that woke up the compositor loop is likely to have a fresh frame displayed, because the time between the frame_posted signal and the scene snapshot is negligibly small. Other surfaces won't have a fresh frame ready and appear to stutter instead (LP: #1447896).

New behaviour in this branch:

t0: Still sleeping ~10ms from last frame
t10: Compositor snapshots scene and decides it can bypass/overlay it
t10: post() begins
t16: post() returns (waited for vblank and page flip completed)

Result: What's on screen is at least 6ms old, but that's an improvement of 10ms.
Result: Many surfaces have time to provide a fresh frame, composited into the new frame (bug 1447896 fixed).

Revision history for this message

Andreas Pokorny (andreas-pokorny) wrote on 2015-06-17:

I agree that we should schedule the frame posting as late as possible..

But why sleep at the end of post() instead of prior to it?

Instead I would have thought

while(running)
{
auto sleepTime = estimateTimeNeededToUpdateDisplayBasedOnRecentFrames(refreshRate);

sleepIfPossible(sleepTime);

TimeSpan span;
compositeFrame();
post();
updateEstimator(span.diff());
}

review: Needs Information

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-06-17:

I suppose this is why newer versions of android schedule on vsync ticks instead.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-17:

> Before anyone says it, I think sleeping sounds like a terrible hack. I have
> questioned myself for months over this. However, the gain of 10ms less lag
> consistently across all devices is a big win worth living with some ugliness
> for. You can see the improvement on a phone in something as simple as swiping
> across the screen.

Eh, I think its tricky to improve our latency past the vsync period amount, and given the limitations of the vsync system, sleeping is okay to get us a fresher frame. The vsync period is the minimum we can hope to guarantee, but with some guessing about the gpu load and sleeping based on these, we can push the latency down lower than the vsync period. Now, this means we have to make some good guesses as to how much to sleep, based on the GPU load though, as bad guesses can also increase latency.

In the meantime, why is this done on the mg{a,m}::DisplayBuffer level? It seems like a scheduling optimization, and our scheduler is mc::MultiThreadedCompositor. It obviously needs some hints (feedback really) from the DisplayBuffer about some specifics, and a feedback system that takes into account the DisplayBuffer, the need to draw, and the GPU load (as best as the system can determine) would have less chance of making a bad guess.

Also, can we use the mir::options instead of getenv?

I'm still chewing on if this will play nice with the android drivers. (which my initial intuition is that it should, but worth thinking through)

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-17:

@Alan's needs discussion

I think that measurements are becoming a more and more pressing need as we optimize the system further. We shouldn't increase latency once we have landed latency-reducing features, and we don't have a way to measure this thats well known.

Also, we're getting to the tricky part where our latency optimizations are dynamic... they're good in some cases, and not so good in others. A tricky problem like optimizing latency this really needs a tool to measure with a variety of interesting loads. This lets us see what tradeoffs we're making as we change the code, and guard against true regressions, where we accidentally increase latency accross-the-board inadventently.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-17:

Hah, no sooner than I said that I noticed Alexandros's https://trello.com/c/UK9uIdnd off the backlog, hopefully we have the tool soon!

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-18:

Those are reasonable questions but are issues I have already encountered and dealt with, so need some explaining...

"Why not sleep before post() instead of prior to it?"
I did, in the first prototype. It was unfortunately complex and only ever prototyped for Mesa. it was disliked by kdub, and I also realised it failed to address the issue of snapshotting the scene late (buffer contents were younger but attributes like window position were not). Sleeping at the end of the frame guarantees scene snapshotting (surface attributes and buffer contents) are never fixed too early.

"The vsync period is the minimum we can hope to guarantee"
False. Proof 1: This branch. Proof 2: Try it with a framedropping client like "mir_demo_client_target -n" and you will see one frame latency is noticeable and we can visibly do better than one frame latency using this branch.

"Now, this means we have to make some good guesses as to how much to sleep, based on the GPU load though, as bad guesses can also increase latency."
Only partyly true. I have explicitly only applied the optimization to bypass/overlays so that there is no GPU GL render load to contend with. And to make sure the timing is right I have spent two weeks testing all our devices, and the slowest netbooks I can find.

"Also, we're getting to the tricky part where our latency optimizations are dynamic"
It's not dynamic on Android at least. I discussed that in the description.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-18:

"In the meantime, why is this done on the mg{a,m}::DisplayBuffer level?"
Again, to ensure safety and that we are never guessing about render times, the optimization is only applied when there is no render time -- bypass/overlay mode. This is good for Unity8 because it's always bypassed/overlayed.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-18:

Also remember "6ms latency" means from the scene snapshot to display. For swap-interval-1 clients that are double buffered you need to add another 16ms to that (or 33ms for triple). But you can avoid the 16-32ms addition by using swap interval zero, which is why I've recommended trying "mir_demo_client_target -n" on desktop.

This branch only shortens the top layer (system compositor) in my original diagram. While using a swap interval zero shortens the middle/bottom layer:
https://docs.google.com/document/d/116i4TC0rls4wKFmbaRrHL_UT_Jg22XI8YqpXGLLTVCc/edit

The eye and the brain can easily perceive latency lower than 16ms and you quickly get used to it (and _want_ it lower than 16ms). Low enough latency and you start to imagine a fixed line or bar between your mouse and the screen. And then you never want it any other way...

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-18:

Forgot to mention: On desktop it's essentially the hardware cursor we're trying to catch up to here so that when moving things on screen, those things update fast enough to stick to the hardware cursor (which is sampled from the kernel super-late for minimal latency).

This branch represents the first time we've actually been able to display frames fast enough to stick to the hardware cursor (with a slight margin of error of a few milliseconds).

Revision history for this message

Andreas Pokorny (andreas-pokorny) wrote on 2015-06-18:

> "In the meantime, why is this done on the mg{a,m}::DisplayBuffer level?"
> Again, to ensure safety and that we are never guessing about render times, the
> optimization is only applied when there is no render time -- bypass/overlay
> mode. This is good for Unity8 because it's always bypassed/overlayed.

Ok I complained about the - "at the end of post()" mostly because of making the decision inside DisplayBuffer instead of inside the rendering loop.. just as kdub stated. I think this behavior makes also sense if you have to render and if the render time is low enough. Of course then one has to deal with nesting. For a nested-bypass to make sense it should submit its updates as soon as possible. (Or as late as possible while ensuring that the host compositor still has enough time to redraw or bypass/overlay).

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-18:

>
> "The vsync period is the minimum we can hope to guarantee"
> False. Proof 1: This branch. Proof 2: Try it with a framedropping client like
> "mir_demo_client_target -n" and you will see one frame latency is noticeable
> and we can visibly do better than one frame latency using this branch.
>

I think we're on the same page here... In a vsync system, we can /guarantee/ that the minimum is one vsync period, but by delaying like in this branch we can reliably get smaller-than-on-vsync period latency. I'm pointing out that sleeping or scheduling is the only way to dance around the limitations that we're given by vsync.

> "Now, this means we have to make some good guesses as to how much to sleep,
> based on the GPU load though, as bad guesses can also increase latency."
> Only partyly true. I have explicitly only applied the optimization to
> bypass/overlays so that there is no GPU GL render load to contend with. And to
> make sure the timing is right I have spent two weeks testing all our devices,
> and the slowest netbooks I can find.

We can't make sure that there's no GPU render load, as the rest of the system could be drawing and waiting on Buffers. An example is unity8/usc... USC knows it hasn't drawn, but doesn't know if U8 or U8-clients have used the gpu.

>
> "Also, we're getting to the tricky part where our latency optimizations are
> dynamic"
> It's not dynamic on Android at least. I discussed that in the description.

Right, they aren't in this branch. Can't sleeping eat up time that the the clients (or clients-of-nested) would want to draw in? so USC doesn't miss the deadline, but the clients might start to. Having a feedback system that adjusts the amount of sleep seems like it could be helpful.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-19:

On desktop (which I understand better than Android) GPU render load is irrelevant. Because the display hardware we're doing page flipping with does not use a GPU. I was doing this stuff years before GPUs existed. They are functionally separate even if they exist in the same silicon these days.

On Android I suspect the driver (which is tightly woven with OpenGL) might be affected. But I have spent weeks testing many devices making sure it is not affected.

"Can't sleeping eat up time that the the clients (or clients-of-nested) would want to draw in?"
Absolutely not. Clients are separate processes that are unaffected by the server sleeping in a place where it's not holding any shared resources. Only GL compositing (the server) would be affected but I've made sure that's impossible by only implementing the sleep on bypass/overlay code paths where there is no GL compositing. This is mentioned a few times in this proposal as well as the original Google doc.

"Having a feedback system that adjusts the amount of sleep seems like it could be helpful."
I thought so too and the first protoype of this used feedback, for a couple of months. Unfortunately it was unreliable. Once you notice a hiccup and back off you can't (shouldn't) un-back-off. And so if that hiccup was just a hiccup (e.g. bug 1452579 is a problem) then you've lost all benefit effectively turning the optimization off. A feedback system is also deeply platform-specific and I only ever got it going on Mesa (lp:~vanvugt/mir/old-predictive-bypass), but it was unacceptably unreliable. This proposal however avoids the detection unreliability problem and brings the optimization to Android too.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-19:

@android
The part that I'm worried about is that on android with surfaceflinger, the driver schedules its own compositions, and tinkering with that timing (as we've done to adapt the android code to the way the mir compositors work) can expose some bugs in the driver that we have to fix (case-in-point is krillin). I guess though that's just a past scar that might not be around in the future.
Also, for what its worth, most android drivers will look like mesa and just post to the display controller, although your intuition is right that they can activate different parts of the GPU sometimes (like, to scale or crop if the display controller can't handle this)

@sleeping
I'll think a bit more about this... It seems like it can't be truly be unaffected though, as we are holding the resources a bit differently and shifting the timing of the system. It is getting difficult to tell that a tradeoff is good or bad, as the system is complex enough that we need tools to inspect the data that different scenarios create. I trust that in your manual study of the effects of the change, things that were tested improved. In situations where changing the how long we're sleeping can change the system quite a bit, introspection with tools would help us figure out what tradeoffs are being made.

@feedback system
If we have a new device to support, it seems strange to have to manually check and adjust the amount we're sleeping in the loop. A feedback system would alleviate this, but if that was too unreliable, then it seems like we're still missing enough data to make a good decision.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-22:

If it makes people feel better I can propose the Mesa implementation first (which is safer in theory) and Android second. They're completely independent. But in my mind both are finished.

As for sleeping and safety, I added the environment variables late in the game as a safety net (and also as a testing tool) so you can evaluate the benefits in production. Especially if you suspect that turning off predictive bypass will help you; you can test that theory with MIR_DRIVER_FORCE_BYPASS_SLEEP=0. (better variable name?)

Yes, it's possible future devices will require additional tweaks to avoid skipping frames, but unlikely. I find it difficult to believe we'll encounter any device that is slower than arale (which seems to require 6ms just to post a frame without any GL rendering).

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-22:

Note: If you are thinking about doing any testing on desktop, remember to add:
usbhid.mousepoll=1
to your kernel command line.

That will reduce your mouse latency from 8ms to 1ms, which becomes quite noticeable when you're on the bleeding edge of performance. Having your mouse configured for lower latency also means it's then easier to see improvements in Mir's performance (like this branch), without having what you see clouded by high latency kernel events.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-06-22:

*Needs Discussion*

Like Kevin I think this is a special case of "schedule compositing as late as possible (but no later)" which logically belongs in the compositor. Detecting a special case in the drivers and delaying the return from post() (as in this MP) provides this effect from outside the compositor.

I think it would be better preparation for handling further cases to provide timing feedback to the compositor and allowed the scheduling decision to be made there. The compositor also has the opportunity to detect scene changes (e.g. surfaces switching to/from fullscreen) that affect the applicability of this logic.

In any case, we really need some automated benchmarking scenarios to track the effect of changes like this.

review: Needs Information

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-22:

> I think it would be better preparation for handling further cases to provide
> timing feedback to the compositor and allowed the scheduling decision to be
> made there. The compositor also has the opportunity to detect scene changes
> (e.g. surfaces switching to/from fullscreen) that affect the applicability of
> this logic.

Having a role in the scheduling is how the android platform natively works (surfaceflinger). The display hardware starts the event loop that does the composition and posting. So having the platform inform the MTC about appropriate times would be good to solve this problem of how to drive sub-vsync-period latencies. It would also avoid problems like what we saw on krillin bringup, because the driver is operating along a more well-tested path.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-22:

@tweaking, 6ms seems reasonable, but if a new device bringup goes wrong, it'll be painful to have to explain to non-mir bringup folks what is being tweaked, and why this number has to be tuned, and how to tune manually. (and has to be compiled to be reset, which often bringup teams aren't optimized to do)

So, it seems best to tie this to the MTC via an active scheduling request, or a set of parameters that the MTC queries from the driver. Failing that, we have to have the options plumbed up properly (ie, the standard mir::options stuff), so that its changeable without compiling.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-23:

I did consider and wanted to do it in the compositor too. However two major factors make that impractical:

1. In a single algorithm you need to (a) know that bypass/overlays are in use and (b) sleep at the _end_ of the post function. That used to be easy but the introduction of posting via DisplayGroup outside of DisplayBufferCompositor has made it unworkable.

2. Ideally the algorithm should vary timing based on the refresh rate of the display, and also disable itself in clone mode. I do all that for Mesa already, but it would be impossible to do in the compositor logic because you have know knowledge of the output devices there. Only in the DisplayBuffer.

Obviously a portable solution in the compositor would be nicer (and less effort). But there are solid reasons why it hasn't been done that way.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-23:

* it would be impossible to do in the compositor logic because you have "no" knowledge of the output devices there

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-06-23:

> I did consider and wanted to do it in the compositor too. However two major
> factors make that impractical:

It sounds like all three of us think the right place is the compositor.

At a minimum we ought to think about what change is needed to make that possible. (You're ahead of me here as I've not worked it through in any detail.)

> 1. In a single algorithm you need to (a) know that bypass/overlays are in
> use and (b) sleep at the _end_ of the post function. That used to be easy but
> the introduction of posting via DisplayGroup outside of
> DisplayBufferCompositor has made it unworkable.

You keep saying that we need to sleep *at the end of post()* am I missing something? Or is that just a proxy for delaying the start of the next composite cycle for that output?

The reason I ask is that compositing is already scheduled based on triggers from scene changes/buffer posts and that logic seems the right place to incorporate any delay.

> 2. Ideally the algorithm should vary timing based on the refresh rate of the
> display, and also disable itself in clone mode. I do all that for Mesa
> already, but it would be impossible to do in the compositor logic because you
> have know knowledge of the output devices there. Only in the DisplayBuffer.

Yes, we've discussed that some timing information would need to be fed into compositor scheduling. But why can't that be provided by the DisplayBuffer?

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-24:

Moving things around:
I have tried. Undoing or modifying the DisplayGroup work is a very large effort, as it was a very large effort when introduced. On a smaller scale I have proposed a branch that simply ignored the DisplayGroup approach for Mesa where it is irrelevant but kdub didn't like it because it made the architecture inconsistent with Android. Although they are already quite different for platform-specific reasons anyway.

End of the post function:
Keep in mind the "prediction" is all about predicting the next frame will be bypassed. That's the core concept and the prediction is extremely reliable because we don't switch in/out bypass/overylays that often compared to the total number of frames displayed. Yes, you could emit information from DisplayBuffer about how the previous frame was rendered and what the refresh rate is and whether we are cloning, but I feel that's much less clean than what's proposed here. Because all that would require new interfaces and using them to expose information to the caller that we otherwise don't need to expose.
Although... one could distil all that information into a smaller piece of information like a "recommended sleep duration" passed from DisplayBuffer to the Compositor. I could try that but again, it's just the same algorithm implemented using more lines of code and less information hiding than what's proposed here. Although sleeping instead of providing information about a recommended sleep is arguably less ideal.

Using the existing compositor with info from DisplayBuffer:
See above.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-24:

Now working on a predictive-bypass-v3 branch separately to see if the above suggestions wind up being nicer than this branch...

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-29:

You may prefer this branch instead:
https://code.launchpad.net/~vanvugt/mir/predictive-bypass-v3/+merge/263213

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-29:

/me prefers other branch

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Daniel van Vugt

Emanuele Antonio Faraone

Gerry Boland

Mir development team

 === modified file 'src/platforms/android/server/hwc_device.cpp'
 --- src/platforms/android/server/hwc_device.cpp	2015-05-19 17:17:39 +0000
 +++ src/platforms/android/server/hwc_device.cpp	2015-06-16 07:34:11 +0000
@@ -27,6 +27,9 @@
  #include "mir/raii.h"
  #include <limits>
  #include <algorithm>
++#include <chrono>
++#include <thread>
++#include <mutex>
  namespace mg = mir::graphics;
  namespace mga=mir::graphics::android;
@@ -42,6 +45,20 @@
      };
      return (renderable.alpha() < 1.0f - tolerance);
+ }
++
++std::once_flag checked_environment;
++std::chrono::milliseconds force_bypass_sleep(-1);
++
++void check_environment()
++{
++    std::call_once(checked_environment, []()
++    {
++        char const* env = getenv("MIR_DRIVER_FORCE_BYPASS_SLEEP");
++        if (env != NULL)
++            force_bypass_sleep = std::chrono::milliseconds(atoi(env));
++    });
++}
++
+ }
  bool mga::HwcDevice::compatible_renderlist(RenderableList const& list)
@@ -65,6 +82,7 @@
  mga::HwcDevice::HwcDevice(std::shared_ptr<HwcWrapper> const& hwc_wrapper) :
      hwc_wrapper(hwc_wrapper)
+ {
++    check_environment();
+ }
  bool mga::HwcDevice::buffer_is_onscreen(mg::Buffer const& buffer) const
@@ -97,6 +115,8 @@
      hwc_wrapper->prepare(lists);
++    bool purely_overlays = true;
++
      for (auto& content : contents)
+     {
          if (content.list.needs_swapbuffers())
@@ -111,6 +131,7 @@
                  content.compositor.render(std::move(rejected_renderables), content.list_offset, content.context);
              content.list.setup_fb(content.context.last_rendered_buffer());
              content.list.swap_occurred();
++            purely_overlays = false;
+         }
          //setup overlays
@@ -136,6 +157,30 @@
          mir::Fd retire_fd(content.list.retirement_fence());
+     }
++
++    if (purely_overlays)
++    {
++        /*
++         * "Predictive bypass": If we're just displaying pure overlays on this
++         * frame then it's very likely the same will be true on the next frame.
++         * In that case we don't need to spare any time for GL rendering.
++         * Instead just sleep for a significant portion of the frame. This will
++         * ensure the scene snapshot that goes into the next frame is
++         * younger when it hits the screen, and so has visibly lower latency.
++         *   This has extra benefit for the overlays of software cursors and
++         * touchpoints as the client surface underneath is then given enough
++         * time to update itself and better match (ie. "stick to") the cursor
++         * above it.
++         */
++
++        // Test results (how long can we sleep for without missing a frame?):
++        //   arale:   10ms  (TODO: Find out why arale is so slow)
++        //   mako:    15ms
++        //   krillin: 11ms (to be fair, the display is 67Hz)
++        using namespace std;
++        auto delay = force_bypass_sleep >= 0ms ? force_bypass_sleep : 10ms;
++        std::this_thread::sleep_for(delay);
++    }
+ }
  void mga::HwcDevice::content_cleared()
 === modified file 'src/platforms/mesa/server/kms/display_buffer.cpp'
 --- src/platforms/mesa/server/kms/display_buffer.cpp	2015-06-03 12:07:08 +0000
 +++ src/platforms/mesa/server/kms/display_buffer.cpp	2015-06-16 07:34:11 +0000
@@ -28,6 +28,9 @@
  #include <GLES2/gl2.h>
  #include <stdexcept>
++#include <chrono>
++#include <thread>
++#include <mutex>
  namespace mgm = mir::graphics::mesa;
  namespace geom = mir::geometry;
@@ -85,6 +88,19 @@
          BOOST_THROW_EXCEPTION(std::runtime_error("GLES2 implementation doesn't support GL_OES_EGL_image extension"));
+ }
++std::once_flag checked_environment;
++std::chrono::milliseconds force_bypass_sleep(-1);
++
++void check_environment()
++{
++    std::call_once(checked_environment, []()
++    {
++        char const* env = getenv("MIR_DRIVER_FORCE_BYPASS_SLEEP");
++        if (env != NULL)
++            force_bypass_sleep = std::chrono::milliseconds(atoi(env));
++    });
++}
++
+ }
  mgm::DisplayBuffer::DisplayBuffer(
@@ -109,6 +125,8 @@
        needs_set_crtc{false},
        page_flips_pending{false}
+ {
++    check_environment();
++
      uint32_t area_width = area.size.width.as_uint32_t();
      uint32_t area_height = area.size.height.as_uint32_t();
      if (rotation == mir_orientation_left || rotation == mir_orientation_right)
@@ -282,6 +300,11 @@
          needs_set_crtc = false;
+     }
++    using namespace std;  // For operator""ms()
++
++    // Predicted worst case render time for the next frame...
++    auto predicted_render_time = 50ms;
++
      if (bypass_buf)
+     {
          /*
@@ -295,6 +318,10 @@
           */
          scheduled_bypass_frame = bypass_buf;
          wait_for_page_flip();
++
++        // It's very likely the next frame will be bypassed like this one so
++        // we only need time for kernel page flip scheduling...
++        predicted_render_time = 5ms;
+     }
      else
+     {
@@ -306,11 +333,46 @@
          scheduled_composite_frame = bufobj;
          if (outputs.size() == 1)
              wait_for_page_flip();
++
++        /*
++         * TODO: If you're optimistic about your GPU performance and/or
++         *       measure it carefully you may wish to set predicted_render_time
++         *       to a lower value here for lower latency.
++         *
++         *predicted_render_time = 9ms; // e.g. about the same as Weston
++         */
+     }
      // Buffer lifetimes are managed exclusively by scheduled*/visible* now
      bypass_buf = nullptr;
      bypass_bufobj = nullptr;
++
++    /*
++     * Introducing "predictive bypass":
++     * If the current frame is bypassed then there is an extremely high
++     * likelihood the next one will be too. If it is then we can reduce
++     * the latency of that next frame (make the compositor sample the
++     * scene later) by almost a whole frame. Because we don't need to
++     * spare any time for rendering. Just milliseconds at most for the
++     * kernel to get around to scheduling a pageflip. Note: this prediction
++     * only works for non-clone modes as the full set of outputs must be
++     * perfectly in phase and we only know how to guarantee that with one.
++     */
++    if (outputs.size() == 1)
++    {
++        if (force_bypass_sleep >= 0ms)
++        {
++            std::this_thread::sleep_for(force_bypass_sleep);
++        }
++        else
++        {
++            auto const& output = outputs.front();
++            auto const min_frame_interval = 1000ms / output->max_refresh_rate();
++            auto const delay = min_frame_interval - predicted_render_time;
++            if (delay > 0ms)
++                std::this_thread::sleep_for(delay);
++        }
++    }
+ }
  mgm::BufferObject* mgm::DisplayBuffer::get_front_buffer_object()
 === modified file 'src/platforms/mesa/server/kms/kms_output.h'
 --- src/platforms/mesa/server/kms/kms_output.h	2015-05-28 21:16:37 +0000
 +++ src/platforms/mesa/server/kms/kms_output.h	2015-06-16 07:34:11 +0000
@@ -43,6 +43,14 @@
      virtual void configure(geometry::Displacement fb_offset, size_t kms_mode_index) = 0;
      virtual geometry::Size size() const = 0;
++    /**
++     * Approximate maximum refresh rate of this output to within 1Hz.
++     * Typically the rate is fixed (e.g. 60Hz) but it may also be variable as
++     * in Nvidia G-Sync/AMD FreeSync/VESA Adaptive Sync. So this function
++     * returns the maximum rate to expect.
++     */
++    virtual int max_refresh_rate() const = 0;
++
      virtual bool set_crtc(uint32_t fb_id) = 0;
      virtual void clear_crtc() = 0;
      virtual bool schedule_page_flip(uint32_t fb_id) = 0;
 === modified file 'src/platforms/mesa/server/kms/real_kms_output.cpp'
 --- src/platforms/mesa/server/kms/real_kms_output.cpp	2015-05-28 21:16:37 +0000
 +++ src/platforms/mesa/server/kms/real_kms_output.cpp	2015-06-16 07:34:11 +0000
@@ -188,6 +188,14 @@
      return {mode.hdisplay, mode.vdisplay};
+ }
++int mgm::RealKMSOutput::max_refresh_rate() const
++{
++    // TODO: In future when DRM exposes FreeSync/Adaptive Sync/G-Sync info
++    //       this value may be calculated differently.
++    drmModeModeInfo const& current_mode = connector->modes[mode_index];
++    return current_mode.vrefresh;
++}
++
  void mgm::RealKMSOutput::configure(geom::Displacement offset, size_t kms_mode_index)
+ {
      fb_offset = offset;
 === modified file 'src/platforms/mesa/server/kms/real_kms_output.h'
 --- src/platforms/mesa/server/kms/real_kms_output.h	2015-05-28 21:16:37 +0000
 +++ src/platforms/mesa/server/kms/real_kms_output.h	2015-06-16 07:34:11 +0000
@@ -44,6 +44,7 @@
      void reset();
      void configure(geometry::Displacement fb_offset, size_t kms_mode_index);
      geometry::Size size() const;
++    int max_refresh_rate() const;
      bool set_crtc(uint32_t fb_id);
      void clear_crtc();
 === modified file 'tests/mir_test_doubles/mock_drm.cpp'
 --- tests/mir_test_doubles/mock_drm.cpp	2015-01-21 07:34:50 +0000
 +++ tests/mir_test_doubles/mock_drm.cpp	2015-06-16 07:34:11 +0000
@@ -216,6 +216,11 @@
      mode.clock = clock;
      mode.htotal = htotal;
      mode.vtotal = vtotal;
++
++    uint32_t total = htotal;
++    total *= vtotal;  // extend to 32 bits
++    mode.vrefresh = clock * 1000UL / total;
++
      if (preferred)
          mode.type |= DRM_MODE_TYPE_PREFERRED;
 === modified file 'tests/unit-tests/graphics/android/test_hwc_device.cpp'
 --- tests/unit-tests/graphics/android/test_hwc_device.cpp	2015-05-29 17:53:49 +0000
 +++ tests/unit-tests/graphics/android/test_hwc_device.cpp	2015-06-16 07:34:11 +0000
@@ -439,6 +439,29 @@
      EXPECT_THAT(stub_buffer1.use_count(), Eq(use_count_before));
+ }
++TEST_F(HwcDevice, overlays_are_throttled_per_predictive_bypass)
++{
++    using namespace testing;
++    EXPECT_CALL(*mock_device, prepare(_))
++        .WillRepeatedly(Invoke(set_all_layers_to_overlay));
++
++    mga::HwcDevice device(mock_device);
++
++    mga::LayerList list(layer_adapter, {stub_renderable1}, {0,0});
++    mga::DisplayContents content{primary, list, offset, stub_context,
++                                 stub_compositor};
++
++    for (int frame = 0; frame < 5; ++frame)
++    {
++        using namespace std::chrono;
++        auto start = system_clock::now();
++        device.commit({content});
++        auto duration = system_clock::now() - start;
++        // Duration cast to a simple type so that test failures are readable
++        ASSERT_THAT(duration_cast<milliseconds>(duration).count(), Ge(8));
++    }
++}
++
  TEST_F(HwcDevice, does_not_set_acquirefences_when_it_has_set_them_previously_without_update)
+ {
      using namespace testing;
 === modified file 'tests/unit-tests/graphics/mesa/kms/mock_kms_output.h'
 --- tests/unit-tests/graphics/mesa/kms/mock_kms_output.h	2015-06-03 15:57:44 +0000
 +++ tests/unit-tests/graphics/mesa/kms/mock_kms_output.h	2015-06-16 07:34:11 +0000
@@ -32,6 +32,7 @@
      MOCK_METHOD0(reset, void());
      MOCK_METHOD2(configure, void(geometry::Displacement, size_t));
      MOCK_CONST_METHOD0(size, geometry::Size());
++    MOCK_CONST_METHOD0(max_refresh_rate, int());
      MOCK_METHOD1(set_crtc, bool(uint32_t));
      MOCK_METHOD0(clear_crtc, void());
 === modified file 'tests/unit-tests/graphics/mesa/kms/test_display_buffer.cpp'
 --- tests/unit-tests/graphics/mesa/kms/test_display_buffer.cpp	2015-06-04 23:48:38 +0000
 +++ tests/unit-tests/graphics/mesa/kms/test_display_buffer.cpp	2015-06-16 07:34:11 +0000
@@ -46,6 +46,8 @@
  class MesaDisplayBufferTest : public Test
+ {
  public:
++    int const mock_refresh_rate = 60;
++
      MesaDisplayBufferTest()
          : mock_bypassable_buffer{std::make_shared<NiceMock<MockBuffer>>()}
          , fake_bypassable_renderable{
@@ -78,6 +80,8 @@
              .WillByDefault(Return(true));
          ON_CALL(*mock_kms_output, schedule_page_flip(_))
              .WillByDefault(Return(true));
++        ON_CALL(*mock_kms_output, max_refresh_rate())
++            .WillByDefault(Return(mock_refresh_rate));
          ON_CALL(*mock_bypassable_buffer, size())
              .WillByDefault(Return(display_area.size));
@@ -154,6 +158,39 @@
      EXPECT_EQ(original_count, mock_bypassable_buffer.use_count());
+ }
++TEST_F(MesaDisplayBufferTest, predictive_bypass_is_throttled)
++{
++    graphics::mesa::DisplayBuffer db(
++        create_platform(),
++        null_display_report(),
++        {mock_kms_output},
++        nullptr,
++        display_area,
++        mir_orientation_normal,
++        gl_config,
++        mock_egl.fake_egl_context);
++
++    /*
++     * Test that predictive bypass does not return from post for at least half
++     * the frame time. This is a reliable test regardless of system load.
++     * We would also like the test the converse but that would be unreliable...
++     */
++    for (int frame = 0; frame < 5; ++frame)
++    {
++        ASSERT_TRUE(db.post_renderables_if_optimizable(bypassable_list));
++
++        using namespace std::chrono;
++        auto start = system_clock::now();
++        db.post();
++        auto duration = system_clock::now() - start;
++
++        // Duration cast to a simple type so that test failures are readable
++        int milliseconds_per_frame = 1000 / mock_refresh_rate;
++        ASSERT_THAT(duration_cast<milliseconds>(duration).count(),
++                    Ge(milliseconds_per_frame/2));
++    }
++}
++
  TEST_F(MesaDisplayBufferTest, bypass_buffer_only_referenced_once_by_db)
+ {
      graphics::mesa::DisplayBuffer db(