Merge lp:~jamesodhunt/upstart/bug-901038 into lp:upstart

Proposed by James Hunt
Status: Merged
Merged at revision: 1664
Proposed branch: lp:~jamesodhunt/upstart/bug-901038
Merge into: lp:upstart
Diff against target: 123 lines (+77/-16) (has conflicts)
2 files modified
ChangeLog (+15/-0)
util/telinit.c (+62/-16)
Text conflict in ChangeLog
To merge this branch: bzr merge lp:~jamesodhunt/upstart/bug-901038
Reviewer Review Type Date Requested Status
Steve Langasek Approve
Review via email: mp+231705@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Stéphane Graber (stgraber) wrote :

Should there be some kind of timeout, say a minute or so to catch the unlikely case where upstart is messed up an never starts listening again on dbus?

Also, you mention that a rejected reexec request is unlikely, but what happens if a non-root user does it, wouldn't that fail and then have the rest of the code return success anyway?

Revision history for this message
James Hunt (jamesodhunt) wrote :

>Should there be some kind of timeout, say a minute or so to catch the unlikely case where upstart
> is messed up an never starts listening again on dbus?
I can understand the concern, but I don't think we can realistically provide a timeout as we don't know what a reasonable re-exec time is in all circumstances: infinity's >1 minute re-exec time was clearly the result of a bug (bug 1338637), but the re-exec _did_ complete correctly - it just took a very long time.

Also, telinit is connecting to the private D-Bus socket and if that isn't available, Upstarts capabilities are then severly restricted.

> Also, you mention that a rejected reexec request is unlikely, but what happens if a non-root user
> does it, wouldn't that fail and then have the rest of the code return success anyway?
If a non-root user runs 'telinit u', they get rejected immediately as telinit ensures the user running the command is root. I think the only scenarios where the re-exec request would be rejected are:

1) Where telinit is attempting to talk to an old version of upstart that doesn't support re-exec.
   We can ignore this scenario as we don't support down-grade scenarios.

2) Where telinit is attempting to talk to a non-Upstart init daemon.

   I guess this could be an issue in Ubuntu, depending on exactly how we manage the init
   transition. However, we should be able to arrange for maintainer scripts to set
   UPSTART_TELINIT_U_NO_WAIT to avoid that issue.

Revision history for this message
Steve Langasek (vorlon) wrote :

> 1) Where telinit is attempting to talk to an old version of upstart that doesn't support
> re-exec. We can ignore this scenario as we don't support down-grade scenarios.

> 2) Where telinit is attempting to talk to a non-Upstart init daemon.

> I guess this could be an issue in Ubuntu, depending on exactly how we manage the init
> transition. However, we should be able to arrange for maintainer scripts to set
> UPSTART_TELINIT_U_NO_WAIT to avoid that issue.

I don't think any solution should require maintainer scripts to set an extra variable to get sane behavior out of telinit.

If telinit is trying to talk to a non-upstart init, shouldn't it know it? In that case, it surely should not wait for a reconnect on an upstart-specific private socket, but should automatically fall back gracefully to a non-blocking restart.

However, the current upstart 'telinit u' code simply exits non-zero when pid 1 is not upstart - as does the proposed new code. So this is something we need to deal with if/when adding support for telinit u on top of non-upstart pid 1, but not something that should block this change now.

Given this, what is the expected use case for "UPSTART_TELINIT_U_NO_WAIT"? If you don't have a concrete user for this, please omit. Unnecessary interfaces always carry an incremental cost, and I can't see any case where it would ever be correct to use this interface, so this looks like overengineering to me.

Otherwise, this looks fine to me for merging.

review: Needs Fixing
lp:~jamesodhunt/upstart/bug-901038 updated
1662. By James Hunt

* util/telinit.c: Remove UPSTART_TELINIT_U_NO_WAIT check as it
  shouldn't realistically be needed.

Revision history for this message
James Hunt (jamesodhunt) wrote :

OK - removed the UPSTART_TELINIT_U_NO_WAIT logic.

Revision history for this message
Steve Langasek (vorlon) wrote :

Ok, merging.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ChangeLog'
2--- ChangeLog 2014-09-04 11:07:22 +0000
3+++ ChangeLog 2014-09-08 10:12:37 +0000
4@@ -1,3 +1,4 @@
5+<<<<<<< TREE
6 2014-09-04 James Hunt <james.hunt@ubuntu.com>
7
8 * NEWS: Release 1.13.2
9@@ -23,6 +24,20 @@
10 * init/tests/test_job.c: test_job_last_process(): New test for
11 job_last_process().
12
13+=======
14+2014-09-08 James Hunt <james.hunt@ubuntu.com>
15+
16+ * util/telinit.c: Remove UPSTART_TELINIT_U_NO_WAIT check as it
17+ shouldn't realistically be needed.
18+
19+2014-08-21 James Hunt <james.hunt@ubuntu.com>
20+
21+ * util/telinit.c: restart_upstart():
22+ - Revert to synchronous behaviour coupled with unavoidable
23+ poll to ensure telinit only returns once a re-exec has
24+ completed (LP: #901038).
25+
26+>>>>>>> MERGE-SOURCE
27 2014-08-14 James Hunt <james.hunt@ubuntu.com>
28
29 * init/control.c: Disallow modifying system jobs via SetEnv,
30
31=== modified file 'util/telinit.c'
32--- util/telinit.c 2014-03-10 13:43:50 +0000
33+++ util/telinit.c 2014-09-08 10:12:37 +0000
34@@ -160,27 +160,73 @@
35 int
36 restart_upstart (void)
37 {
38- nih_local NihDBusProxy *upstart = NULL;
39- DBusPendingCall *ret;
40+ nih_local NihDBusProxy *upstart = NULL;
41+ NihError *err;
42+ int ret;
43
44 upstart = upstart_open (NULL);
45 if (! upstart)
46 return -1;
47
48- /* Fire and forget:
49- *
50- * Ask Upstart to restart itself using the async interface to
51- * avoid the client-side complaining if and when it detects that
52- * Upstart has severed all connections to perform the re-exec.
53- */
54- ret = upstart_restart (upstart, NULL, NULL, NULL, 0);
55- dbus_connection_flush(upstart->connection);
56-
57- /* We don't care about the return code, but we have to keep
58- * the compiler happy.
59- */
60- if (ret != (DBusPendingCall *)TRUE)
61- return 0;
62+ /* Ask Upstart to restart itself.
63+ *
64+ * Since it is not possible to serialise a D-Bus connection,
65+ * Upstart is forced to sever all D-Bus client connections,
66+ * including this one.
67+ *
68+ * Further, since the user expects telinit to block _until the
69+ * re-exec has finished and Upstart is accepting connections
70+ * once again_, the only solution is to wait for the forced
71+ * disconnect, then poll until it is possible to create a new
72+ * connection.
73+ *
74+ * Note that we don't (can't) care about the return code since
75+ * it's not reliable:
76+ *
77+ * - either the re-exec request completed and D-Bus returned zero
78+ * before Upstart started the re-exec.
79+ *
80+ * - or the re-exec request completed but upstart started the
81+ * re-exec (severing all D-Bus connections) before D-Bus got a
82+ * chance to finish cleanly meaning we receive a return of -1.
83+ *
84+ * We cannot know exactly what happened so have to allow for
85+ * both scenarios. Note the implicit assumption that the re-exec
86+ * request itself was accepted. If this assumption is incorrect
87+ * (should not be possible), the worst case scenario is that
88+ * upstart does not re-exec and then we quickly drop out of the
89+ * reconnect block since it never went offline.
90+ */
91+ ret = upstart_restart_sync (NULL, upstart);
92+
93+ if (ret < 0) {
94+ err = nih_error_get ();
95+ nih_free (err);
96+ }
97+
98+ nih_free (upstart);
99+
100+ nih_debug ("Waiting for upstart to finish re-exec");
101+
102+ /* We believe Upstart is now in the process of
103+ * re-exec'ing so attempt forever to reconnect.
104+ *
105+ * This sounds dangerous but there is no other option,
106+ * and a connection must be possible unless the system
107+ * is completely broken.
108+ */
109+ while (TRUE) {
110+
111+ upstart = upstart_open (NULL);
112+ if (upstart)
113+ break;
114+
115+ err = nih_error_get ();
116+ nih_free (err);
117+
118+ /* Avoid DoS'ing the system whilst we wait */
119+ usleep (100000);
120+ }
121
122 return 0;
123 }

Subscribers

People subscribed via source and target branches