Mir

Merge lp:~hikiko/mir/mir.drm-master-check-at-startup into lp:mir

Proposed by Eleni Maria Stea
Status: Work in progress
Proposed branch: lp:~hikiko/mir/mir.drm-master-check-at-startup
Merge into: lp:mir
Diff against target: 76 lines (+19/-5)
2 files modified
src/server/graphics/gbm/gbm_display_helpers.cpp (+16/-4)
tests/unit-tests/graphics/gbm/test_gbm_display.cpp (+3/-1)
To merge this branch: bzr merge lp:~hikiko/mir/mir.drm-master-check-at-startup
Reviewer Review Type Date Requested Status
Daniel van Vugt Needs Information
Kevin DuBois (community) Abstain
Alexandros Frantzis (community) Needs Fixing
Alan Griffiths Needs Fixing
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+187450@code.launchpad.net

Commit message

if mir doesnt manage to become the drm master after attempting that for 1 second it exits with a message (this is useful in case that another drm master is running along with mir)

Description of the change

if mir doesnt manage to become the drm master after attempting that for 1 second it exits with a message (this is useful in case that another drm master is running along with mir)

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

19 int drm_set_master(int fd)
20 +{
21 + if (fd < 0)
22 + return -1;
23 +
24 + return drmSetMaster(fd);
25 +}

Do we need a wrapper around drmSetMaster()? drmSetMaster() seems to be able to handle the invalid fd case properly itself (i.e. it returns -1). Plus, in this particular case, drm_set_master() is never invoked with an fd < 0.

53 + if (std::chrono::system_clock::now() > time_limit)

What's the benefit of waiting? Do we have a use case where waiting actually enables us to become DRM master?

review: Needs Fixing
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Is this the right logic?

What I see is that we wait for a second for each drm device in turn hammering it with almost continuous drmSetMaster() calls. And if any of them succeed, then we're happy. So, if we had a hundred drm devices (I realize that's an exaggeration), we could wait 99 seconds before we tried the last one - which might have worked immediately.

But why not have a single timeout around a the loop trying each drm device in turn? That would means a maximum wait of 1 second. And why not save CPU power by having a pause (not just a yield()) before each retry - we need to allow time for whatever change might release the drm to us.

review: Needs Fixing
Revision history for this message
Eleni Maria Stea (hikiko) wrote :

Well, the problem we were trying to solve was the following:

If there's a process running (eg plymouth) which is the drm master already, mir wont be the drm master at startup, so in open_drm_device: the drmSetInterfaceVersion will fail for all the devices (because it can only be called by the master) and all the drm devices will be marked as unusable although some of them can be used. At first, I added a "drm_set_master" call before the drmSetInterfaceVersion check to solve this problem, but then we had to solve the following race condition:
when a program (for example plymouth is about to terminate) and mir has already started, it might need to wait for a very small amount of time for the drm device used by plymouth to become available. So, I was told that waiting for a sec is fine because there are only a few drm devices and there's no significant delay.

I checked myself the following case: I wrote a small program that becomes the drm master and then ran mir to get the exception. Mir checks all the devices and exits with time:

real 0m 1.409s
user 0m 0.536s
sys 0m 0.493s

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> then we had to solve the following race condition:
> when a program (for example plymouth is about to terminate) and mir has already started, it might need to wait for a > very small amount of time for the drm device used by plymouth to become available. So, I was told that waiting for a > sec is fine because there are only a few drm devices and there's no significant delay.

IMO, Mir/u-s-c is not the correct location to solve this synchronization problem. u-s-c should only be started when plymouth has released the DRM device. Ideally no wait should be involved (that's what upstart is for!), but even if a wait is required, it should be handled externally, for example in the upstart job or in the unity-system-compositor.sleep script (which already has "sleep 0.1" in it). Mir doesn't and shouldn't care about sync issues like this.

Revision history for this message
Eleni Maria Stea (hikiko) wrote :

I could probably just set the drm master then, and if the race cond occurs handle it elsewhere. Opinions?

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Eleni Maria Stea (hikiko) wrote :

It seems that lightdm only starts mir after plymouth exits so, there's no need to add a delay.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

67 + //the first correct return is there because drmSetMaster gets called during startup
68 EXPECT_CALL(mock_drm, drmSetMaster(_))
69 + .WillOnce(Return(0))

Surely there ought to be a new test in which drmSetMaster() returns an error on the first call.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

There is an issue with error reporting (some aspects of it are pre-existing). We are using the 'error' variable to hold the error number but unfortunately the situation is complicated by the inconsistency in DRM's error returning scheme:

* drmSetMaster() returns -1 on error, and sets errno
* drmSetInterfaceVersion() returns -errno (errno is also set of course)
* open() returns -1, and sets errno

We are
1. Currently using '-error' in the exception code, which means the drmSetMaster() and open() cases will be broken (pre-existing)
2. We are not setting error to errno at all when the drmSetMaster call fails (new).

One solution would be to set error = errno in all three cases and use 'error' in the exception code (instead of '-error').

The whole error/exception mechanism in open_drm_device() is a bit strange since later errors overwrite previous ones, causing us to miss potentially helpful information if we have many candidate devices.

review: Needs Fixing
Revision history for this message
Kevin DuBois (kdub) wrote :

pre-existing issues, but I've thought a few times that we could have a resource safe fd wrapper... then we wouldn't have to worry about the close() before the throw.

Revision history for this message
Kevin DuBois (kdub) :
review: Abstain
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Where is the 1 second delay? I can't see it...

More broadly however, I'm not sure this is the right answer. Shouldn't we be fixing the DRM mastership race condition (with correct upstart scripting?) instead of just working around it?

review: Needs Information
Revision history for this message
Eleni Maria Stea (hikiko) wrote :

I removed the 1 sec delay, because the lightdm starts mir only after plymouth has exited so, there's no need to solve the race condition. I 'll try to improve the error reporting mechanism.

Revision history for this message
Eleni Maria Stea (hikiko) wrote :

Well, I changed the error to be the errno but I don't like it very much to be honest :) I think that like Alexandros said this code is a bit strange... If you have any suggestions on how to improve it, I'd like to hear them. The wrapper (Kevin) sounds a good idea actually, I could add one small function that closes the device and sets the error, tmp_fd etc for the moment.

Revision history for this message
Eleni Maria Stea (hikiko) wrote :

+We still override the errors... Maybe we could have a vector with all the errors?

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/server/graphics/gbm/gbm_display_helpers.cpp'
2--- src/server/graphics/gbm/gbm_display_helpers.cpp 2013-09-02 11:32:59 +0000
3+++ src/server/graphics/gbm/gbm_display_helpers.cpp 2013-09-30 10:46:49 +0000
4@@ -209,9 +209,11 @@
5 }
6
7 int mggh::DRMHelper::open_drm_device(UdevHelper const& udev)
8-{
9+{
10 int tmp_fd = -1;
11- int error;
12+ int error = 0;
13+
14+ std::string err;
15
16 // TODO: Wrap this up in a nice class
17 udev_enumerate* enumerator = udev_enumerate_new(udev.ctx);
18@@ -254,6 +256,14 @@
19
20 udev_device_unref(dev);
21
22+ if ((error = drmSetMaster(tmp_fd)) < 0)
23+ {
24+ err = "Failed to become the DRM master. Is there another process using the device?";
25+ close(tmp_fd);
26+ tmp_fd = -1;
27+ continue;
28+ }
29+
30 // Check that the drm device is usable by setting the interface version we use (1.4)
31 drmSetVersion sv;
32 sv.drm_di_major = 1;
33@@ -261,8 +271,10 @@
34 sv.drm_dd_major = -1; /* Don't care */
35 sv.drm_dd_minor = -1; /* Don't care */
36
37- if ((error = drmSetInterfaceVersion(tmp_fd, &sv)))
38+ if ((error = drmSetInterfaceVersion(tmp_fd, &sv)) < 0)
39 {
40+ err = "Failed to set interface version";
41+ error = -error;
42 close(tmp_fd);
43 tmp_fd = -1;
44 continue;
45@@ -281,7 +293,7 @@
46 {
47 BOOST_THROW_EXCEPTION(
48 boost::enable_error_info(
49- std::runtime_error("Error opening DRM device")) << boost::errinfo_errno(-error));
50+ std::runtime_error("Error opening DRM device: " + err)) << boost::errinfo_errno(error));
51 }
52
53 return tmp_fd;
54
55=== modified file 'tests/unit-tests/graphics/gbm/test_gbm_display.cpp'
56--- tests/unit-tests/graphics/gbm/test_gbm_display.cpp 2013-08-28 03:41:48 +0000
57+++ tests/unit-tests/graphics/gbm/test_gbm_display.cpp 2013-09-30 10:46:49 +0000
58@@ -659,7 +659,7 @@
59 using namespace testing;
60
61 EXPECT_CALL(mock_drm, drmSetMaster(mock_drm.fake_drm.fd()))
62- .Times(1);
63+ .Times(2);
64
65 auto display = create_display(create_platform());
66
67@@ -673,7 +673,9 @@
68 EXPECT_CALL(mock_drm, drmDropMaster(_))
69 .WillOnce(SetErrnoAndReturn(EACCES, -EACCES));
70
71+ //the first correct return is there because drmSetMaster gets called during startup
72 EXPECT_CALL(mock_drm, drmSetMaster(_))
73+ .WillOnce(Return(0))
74 .WillOnce(SetErrnoAndReturn(EPERM, -EPERM));
75
76 EXPECT_CALL(*mock_report, report_drm_master_failure(EACCES));

Subscribers

People subscribed via source and target branches