Merge lp:~jamesodhunt/upstart/bug-1227212 into lp:upstart

Proposed by James Hunt
Status: Merged
Merged at revision: 1534
Proposed branch: lp:~jamesodhunt/upstart/bug-1227212
Merge into: lp:upstart
Diff against target: 568 lines (+321/-34)
8 files modified
ChangeLog (+35/-0)
init/event.c (+20/-0)
init/job_process.c (+14/-3)
init/quiesce.c (+176/-23)
init/quiesce.h (+3/-0)
test/test_util_common.c (+24/-5)
test/test_util_common.h (+4/-1)
util/tests/test_initctl.c (+45/-2)
To merge this branch: bzr merge lp:~jamesodhunt/upstart/bug-1227212
Reviewer Review Type Date Requested Status
Dimitri John Ledkov Approve
Review via email: mp+187844@code.launchpad.net

Description of the change

* init/event.c: event_pending_handle_jobs(): Force quiesce when all job
  instances have finished to speed session shutdown.
* init/job_process.c: job_process_jobs_running(): Only consider job
  instances with associated pids to avoid abstract jobs confusing the
  shutdown.
* init/quiesce.c:
  - quiesce(): Optimise session shutdown
    - Skip wait phase if no jobs care about the 'session-end' event
      (LP: #1227212).
    - Stop already running instances if other jobs care about
      'session-end' to allow the already-running jobs to shut down in
       parallel with the newly-started session-end jobs.
  - quiesce_wait_callback():
    - Simplify logic.
    - Improve wait phase checks to detect earliest time to finalise.
  - quiesce_finalise(): Display time to shutdown.
  - quiesce_complete(): New function to force final shutdown phase.
  - quiesce_event_match(): New function to determine if any jobs
    'start on' contains a particular event.
  - quiesce_in_progress(): Determine if shutdown is being handled.
* test/test_util_common.c:
  - _start_upstart(): Call get_upstart_binary() rather than relying on
    UPSTART_BINARY define.
  - start_upstart_common(): Remove '--no-startup-event' as this is now
    needed by a test.
  - get_upstart_binary(): Assert that file exists.
  - file_exists(): New helper function.
* test/test_util_common.h: Typo and prototype.
* util/tests/test_initctl.c: test_quiesce():
  - New test "session shutdown: one long-running job which starts on
    startup".
  - Adjusted expected shutdown times.

-----------------------------

Expected behaviour:

= Common Case =

- If no jobs 'start on session-end', shutdown is "immediate". This is the common case.
  => Shutdown time: 1 second.

= Jobs Specifying 'session-end' =

- If any job specifies 'start on session-end' (no standard jobs do):
  - The shutdown will stop all running job instances.
  - The shutdown will allow session-end jobs to run as long as they complete within 5 seconds.
  => Shutdown time: 5 seconds.

= Jobs Specifying 'session-end' and blocking SIGTERM =

- If no jobs specify 'start on session-end' but some jobs block SIGTERM:
  => Shutdown time: 5 seconds.

- If any job specifies 'start on session-end' and blocks SIGTERM (but
  not other jobs block SIGTERM):
  - The shutdown will wait for up to 5 seconds for the jobs to run.
  - The shutdown will then signal them with SIGTERM and wait for up to a further 5 seconds before sending SIGKILL.
  => Shutdown time: 5+5 = 10 seconds.

- If any job specifies 'start on session-end' and blocks SIGTERM *and* other non-session-end jobs
  running at shutdown also block SIGTERM:
  => Shutdown time: 5+5 = 10 seconds (*).

= Jobs with Long-Running pre-start stanzas =

Job that have long-running pre-start stanzas could also slow down the session shutdown since changing their state will take the duration of the time to execute the pre-start stanza. One of the standard session jobs fell into this category ('logrotate') but has now been fixed.

Note that the SIGTERM case is pathological and should be surmountable using the 'kill signal' stanza (see init(5) and
http://upstart.ubuntu.com/cookbook/#kill-signal).

(*) - Not 15 seconds, since the shutdown will send SIGTERM to all non-session-end jobs such that their kill timer runs in parallel with the session-end jobs wait time.

To post a comment you must log in.
Revision history for this message
Dimitri John Ledkov (xnox) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ChangeLog'
2--- ChangeLog 2013-09-13 04:44:55 +0000
3+++ ChangeLog 2013-09-26 16:38:20 +0000
4@@ -1,3 +1,38 @@
5+2013-09-26 James Hunt <james.hunt@ubuntu.com>
6+
7+ * init/event.c: event_pending_handle_jobs(): Force quiesce when all job
8+ instances have finished to speed session shutdown.
9+ * init/job_process.c: job_process_jobs_running(): Only consider job
10+ instances with associated pids to avoid abstract jobs confusing the
11+ shutdown.
12+ * init/quiesce.c:
13+ - quiesce(): Optimise session shutdown
14+ - Skip wait phase if no jobs care about the 'session-end' event
15+ (LP: #1227212).
16+ - Stop already running instances if other jobs care about
17+ 'session-end' to allow the already-running jobs to shut down in
18+ parallel with the newly-started session-end jobs.
19+ - quiesce_wait_callback():
20+ - Simplify logic.
21+ - Improve wait phase checks to detect earliest time to finalise.
22+ - quiesce_finalise(): Display time to shutdown.
23+ - quiesce_complete(): New function to force final shutdown phase.
24+ - quiesce_event_match(): New function to determine if any jobs
25+ 'start on' contains a particular event.
26+ - quiesce_in_progress(): Determine if shutdown is being handled.
27+ * test/test_util_common.c:
28+ - _start_upstart(): Call get_upstart_binary() rather than relying on
29+ UPSTART_BINARY define.
30+ - start_upstart_common(): Remove '--no-startup-event' as this is now
31+ needed by a test.
32+ - get_upstart_binary(): Assert that file exists.
33+ - file_exists(): New helper function.
34+ * test/test_util_common.h: Typo and prototype.
35+ * util/tests/test_initctl.c: test_quiesce():
36+ - New test "session shutdown: one long-running job which starts on
37+ startup".
38+ - Adjusted expected shutdown times.
39+
40 2013-09-12 Steve Langasek <steve.langasek@ubuntu.com>
41
42 * configure.ac:
43
44=== modified file 'init/event.c'
45--- init/event.c 2013-06-04 16:51:55 +0000
46+++ init/event.c 2013-09-26 16:38:20 +0000
47@@ -44,6 +44,7 @@
48 #include "blocked.h"
49 #include "control.h"
50 #include "errors.h"
51+#include "quiesce.h"
52
53 #include "com.ubuntu.Upstart.h"
54
55@@ -299,6 +300,8 @@
56 static void
57 event_pending_handle_jobs (Event *event)
58 {
59+ int empty = TRUE;
60+
61 nih_assert (event != NULL);
62
63 job_class_init ();
64@@ -432,6 +435,23 @@
65 event_operator_reset (class->start_on);
66 }
67 }
68+
69+ /* Determine if any job instances remain */
70+ NIH_HASH_FOREACH_SAFE (job_classes, iter) {
71+ JobClass *class = (JobClass *)iter;
72+
73+ NIH_HASH_FOREACH_SAFE (class->instances, job_iter) {
74+ empty = FALSE;
75+ break;
76+ }
77+
78+ if (! empty)
79+ break;
80+ }
81+
82+ /* If no instances remain, force quiesce to finish */
83+ if (empty && quiesce_in_progress ())
84+ quiesce_complete ();
85 }
86
87
88
89=== modified file 'init/job_process.c'
90--- init/job_process.c 2013-07-01 21:05:15 +0000
91+++ init/job_process.c 2013-09-26 16:38:20 +0000
92@@ -1259,7 +1259,9 @@
93 /**
94 * job_process_jobs_running:
95 *
96- * Determine if any jobs are running.
97+ * Determine if any jobs are running. Note that simply checking if class
98+ * instances exist is insufficient: since this call is used for shutdown
99+ * abstract jobs must not be able to block the shutdown.
100 *
101 * Returns: TRUE if jobs are still running, else FALSE.
102 **/
103@@ -1271,8 +1273,17 @@
104 NIH_HASH_FOREACH (job_classes, iter) {
105 JobClass *class = (JobClass *)iter;
106
107- NIH_HASH_FOREACH (class->instances, job_iter)
108- return TRUE;
109+ NIH_HASH_FOREACH (class->instances, job_iter) {
110+ Job *job = (Job *)job_iter;
111+ nih_local char *cmd = NULL;
112+ int i;
113+ nih_local char *pids = NULL;
114+
115+ for (i = 0; i < PROCESS_LAST; i++) {
116+ if (job->pid[i])
117+ return TRUE;
118+ }
119+ }
120 }
121
122 return FALSE;
123
124=== modified file 'init/quiesce.c'
125--- init/quiesce.c 2013-07-31 09:28:48 +0000
126+++ init/quiesce.c 2013-09-26 16:38:20 +0000
127@@ -48,9 +48,16 @@
128 static QuiescePhase quiesce_phase = QUIESCE_PHASE_NOT_QUIESCED;
129
130 /**
131+ * quiesce_reason:
132+ *
133+ * Human-readable string denoting what triggered the quiesce.
134+ **/
135+static char *quiesce_reason = NULL;
136+
137+/**
138 * max_kill_timeout:
139 *
140- * Maxiumum kill_timout value calculated from all running jobs used to
141+ * Maxiumum kill_timeout value calculated from all running jobs used to
142 * determine how long to wait before exiting.
143 **/
144 static time_t max_kill_timeout = 0;
145@@ -62,6 +69,24 @@
146 **/
147 static time_t quiesce_phase_time = 0;
148
149+/**
150+ * quiesce_start_time:
151+ *
152+ * Time quiesce commenced.
153+ **/
154+static time_t quiesce_start_time = 0;
155+
156+/**
157+ * session_end_jobs:
158+ *
159+ * TRUE if any job specifies a 'start on' including SESSION_END_EVENT.
160+ *
161+ **/
162+static int session_end_jobs = FALSE;
163+
164+static int quiesce_event_match (Event *event)
165+ __attribute__ ((warn_unused_result));
166+
167 /* External definitions */
168 extern int disable_respawn;
169
170@@ -76,7 +101,7 @@
171 quiesce (QuiesceRequester requester)
172 {
173 nih_local char **env = NULL;
174- const char *reason;
175+ Event *event;
176
177 job_class_init ();
178
179@@ -98,12 +123,12 @@
180 ? QUIESCE_PHASE_KILL
181 : QUIESCE_PHASE_WAIT;
182
183- reason = (requester == QUIESCE_REQUESTER_SESSION)
184+ quiesce_reason = (requester == QUIESCE_REQUESTER_SESSION)
185 ? _("logout") : _("shutdown");
186
187- nih_info (_("Quiescing due to %s request"), reason);
188+ nih_info (_("Quiescing due to %s request"), quiesce_reason);
189
190- quiesce_phase_time = time (NULL);
191+ quiesce_start_time = quiesce_phase_time = time (NULL);
192
193 /* Stop existing jobs from respawning */
194 disable_respawn = TRUE;
195@@ -116,11 +141,43 @@
196 env = NIH_MUST (nih_str_array_new (NULL));
197
198 NIH_MUST (environ_set (&env, NULL, NULL, TRUE,
199- "TYPE=%s", reason));
200-
201- NIH_MUST (event_new (NULL, SESSION_END_EVENT, env));
202-
203- if (requester == QUIESCE_REQUESTER_SYSTEM) {
204+ "TYPE=%s", quiesce_reason));
205+
206+ event = NIH_MUST (event_new (NULL, SESSION_END_EVENT, env));
207+
208+ /* Check if any jobs care about the session end event. If not,
209+ * the wait phase can be avoided entirely resulting in a much
210+ * faster shutdown.
211+ *
212+ * Note that simply checking if running instances exist is not
213+ * sufficient since if a job cares about the session end event,
214+ * it won't yet have started but needs to be given a chance to
215+ * run.
216+ */
217+ if (quiesce_phase == QUIESCE_PHASE_WAIT) {
218+
219+ session_end_jobs = quiesce_event_match (event);
220+
221+ if (session_end_jobs) {
222+ /* Some as-yet unscheduled jobs care about the
223+ * session end event. They will be started the
224+ * next time through the main loop and will be
225+ * waited for (hence the quiesce phase is not
226+ * changed).
227+ *
228+ * However, already-running jobs *can* be stopped
229+ * at this time since by definition they do not
230+ * care about the session end event and may just
231+ * as well die now to avoid slowing the shutdown.
232+ */
233+ job_process_stop_all ();
234+ } else {
235+ nih_debug ("Skipping wait phase");
236+ quiesce_phase = QUIESCE_PHASE_KILL;
237+ }
238+ }
239+
240+ if (quiesce_phase == QUIESCE_PHASE_KILL) {
241 /* We'll attempt to wait for this long, but system
242 * policy may prevent it such that we just get killed
243 * and job processes reparented to PID 1.
244@@ -153,32 +210,34 @@
245
246 nih_assert (timer);
247 nih_assert (quiesce_phase_time);
248+ nih_assert (quiesce_requester != QUIESCE_REQUESTER_INVALID);
249
250 now = time (NULL);
251
252- nih_assert (quiesce_requester != QUIESCE_REQUESTER_INVALID);
253-
254- if (quiesce_requester == QUIESCE_REQUESTER_SYSTEM) {
255- nih_assert (quiesce_phase == QUIESCE_PHASE_KILL);
256+ if (quiesce_phase == QUIESCE_PHASE_KILL) {
257+ nih_assert (max_kill_timeout);
258
259 if ((now - quiesce_phase_time) > max_kill_timeout)
260- goto out;
261+ goto timed_out;
262
263 } else if (quiesce_phase == QUIESCE_PHASE_WAIT) {
264-
265- if ((now - quiesce_phase_time) > QUIESCE_DEFAULT_JOB_RUNTIME) {
266+ int timed_out = 0;
267+
268+ timed_out = ((now - quiesce_phase_time) >= QUIESCE_DEFAULT_JOB_RUNTIME);
269+
270+ if (timed_out
271+ || (session_end_jobs && ! job_process_jobs_running ())
272+ || ! job_process_jobs_running ()) {
273+
274 quiesce_phase = QUIESCE_PHASE_KILL;
275
276 /* reset for new phase */
277 quiesce_phase_time = time (NULL);
278
279 max_kill_timeout = job_class_max_kill_timeout ();
280+
281 job_process_stop_all ();
282 }
283- } else if (quiesce_phase == QUIESCE_PHASE_KILL) {
284-
285- if ((now - quiesce_phase_time) > max_kill_timeout)
286- goto out;
287 } else {
288 nih_assert_not_reached ();
289 }
290@@ -188,9 +247,10 @@
291
292 return;
293
294+timed_out:
295+ quiesce_show_slow_jobs ();
296+
297 out:
298- quiesce_show_slow_jobs ();
299-
300 /* Note that we might skip the kill phase for the session
301 * requestor if no jobs are actually running at this point.
302 */
303@@ -237,7 +297,100 @@
304 void
305 quiesce_finalise (void)
306 {
307+ static int finalising = FALSE;
308+ time_t diff;
309+
310+ nih_assert (quiesce_start_time);
311 nih_assert (quiesce_phase == QUIESCE_PHASE_CLEANUP);
312
313+ if (finalising)
314+ return;
315+
316+ finalising = TRUE;
317+
318+ diff = time (NULL) - quiesce_start_time;
319+
320+ nih_info (_("Quiesce %s sequence took %s%d second%s"),
321+ quiesce_reason,
322+ ! (int)diff ? "<" : "",
323+ (int)diff ? (int)diff : 1,
324+ diff <= 1 ? "" : "s");
325+
326 nih_main_loop_exit (0);
327+
328+}
329+
330+/**
331+ * quiesce_complete:
332+ *
333+ * Force quiesce phase to finish.
334+ **/
335+void
336+quiesce_complete (void)
337+{
338+ quiesce_phase = QUIESCE_PHASE_CLEANUP;
339+
340+ quiesce_finalise ();
341+}
342+
343+/**
344+ * quiesce_event_match:
345+ * @event: event.
346+ *
347+ * Identify if any jobs _may_ start when the session ends.
348+ *
349+ * A simple heuristic is used such that there is no guarantee that the
350+ * jobs entire start condition will be satisfied at session-end.
351+ *
352+ * Returns: TRUE if any class specifies @event in its start
353+ * condition, else FALSE.
354+ **/
355+static int
356+quiesce_event_match (Event *event)
357+{
358+ nih_assert (event);
359+
360+ job_class_init ();
361+
362+ NIH_HASH_FOREACH (job_classes, iter) {
363+ JobClass *class = (JobClass *)iter;
364+
365+ if (! class->start_on)
366+ continue;
367+
368+ /* Note that only the jobs start on condition is
369+ * relevant.
370+ */
371+ NIH_TREE_FOREACH_POST (&class->start_on->node, iter) {
372+ EventOperator *oper = (EventOperator *)iter;
373+
374+ switch (oper->type) {
375+ case EVENT_OR:
376+ case EVENT_AND:
377+ break;
378+ case EVENT_MATCH:
379+ /* Job may attempt to start as the session ends */
380+ if (event_operator_match (oper, event, NULL))
381+ return TRUE;
382+ break;
383+ default:
384+ nih_assert_not_reached ();
385+ }
386+ }
387+ }
388+
389+ return FALSE;
390+}
391+
392+/**
393+ * quiesce_in_progress:
394+ *
395+ * Determine if shutdown is in progress.
396+ *
397+ * Returns: TRUE if quiesce is in progress, else FALSE.
398+ **/
399+int
400+quiesce_in_progress (void)
401+{
402+ return quiesce_phase != QUIESCE_PHASE_NOT_QUIESCED;
403 }
404
405=== modified file 'init/quiesce.h'
406--- init/quiesce.h 2013-02-08 16:15:23 +0000
407+++ init/quiesce.h 2013-09-26 16:38:20 +0000
408@@ -70,6 +70,9 @@
409 void quiesce_wait_callback (void *data, NihTimer *timer);
410 void quiesce_show_slow_jobs (void);
411 void quiesce_finalise (void);
412+void quiesce_complete (void);
413+int quiesce_in_progress (void)
414+ __attribute__ ((warn_unused_result));
415
416 NIH_END_EXTERN
417
418
419=== modified file 'test/test_util_common.c'
420--- test/test_util_common.c 2013-07-17 14:18:42 +0000
421+++ test/test_util_common.c 2013-09-26 16:38:20 +0000
422@@ -384,7 +384,7 @@
423 argv = NIH_MUST (nih_str_array_new (NULL));
424
425 NIH_MUST (nih_str_array_add (&argv, NULL, NULL,
426- UPSTART_BINARY));
427+ get_upstart_binary ()));
428
429 if (args)
430 NIH_MUST (nih_str_array_append (&argv, NULL, NULL, args));
431@@ -447,9 +447,6 @@
432 }
433
434 NIH_MUST (nih_str_array_add (&args, NULL, NULL,
435- "--no-startup-event"));
436-
437- NIH_MUST (nih_str_array_add (&args, NULL, NULL,
438 "--no-sessions"));
439
440 NIH_MUST (nih_str_array_add (&args, NULL, NULL,
441@@ -561,7 +558,11 @@
442 const char *
443 get_upstart_binary (void)
444 {
445- return UPSTART_BINARY;
446+ static const char *upstart_binary = UPSTART_BINARY;
447+
448+ TEST_TRUE (file_exists (upstart_binary));
449+
450+ return upstart_binary;
451 }
452
453 const char *
454@@ -740,3 +741,21 @@
455
456 return new;
457 }
458+
459+/**
460+ * file_exists:
461+ * @path: file to check.
462+ *
463+ * Determine if specified file exists.
464+ *
465+ * Returns: TRUE if @path exists, else FALSE.
466+ **/
467+int
468+file_exists (const char *path)
469+{
470+ struct stat st;
471+
472+ nih_assert (path);
473+
474+ return ! stat (path, &st);
475+}
476
477=== modified file 'test/test_util_common.h'
478--- test/test_util_common.h 2013-07-17 14:18:42 +0000
479+++ test/test_util_common.h 2013-09-26 16:38:20 +0000
480@@ -10,7 +10,7 @@
481 #define BUFFER_SIZE 1024
482
483 /**
484- * TEST_QUIESCE_WAIT_PHASE:
485+ * TEST_EXIT_TIME:
486 *
487 * Maximum time we expect upstart to wait in the QUIESCE_PHASE_WAIT
488 * phase.
489@@ -733,4 +733,7 @@
490 const char *from, const char *to)
491 __attribute__ ((warn_unused_result));
492
493+int file_exists (const char *path)
494+ __attribute__ ((warn_unused_result));
495+
496 #endif /* TEST_UTIL_COMMON_H */
497
498=== modified file 'util/tests/test_initctl.c'
499--- util/tests/test_initctl.c 2013-09-05 16:19:06 +0000
500+++ util/tests/test_initctl.c 2013-09-26 16:38:20 +0000
501@@ -11590,6 +11590,49 @@
502 DELETE_FILE (confdir, "long-running.conf");
503
504 /*******************************************************************/
505+ TEST_FEATURE ("session shutdown: one long-running job which starts on startup");
506+
507+ CREATE_FILE (confdir, "startup.conf",
508+ "start on startup\n"
509+ "exec sleep 999");
510+
511+ start_upstart_common (&upstart_pid, TRUE, confdir, logdir, NULL);
512+
513+ upstart = upstart_open (NULL);
514+ TEST_NE_P (upstart, NULL);
515+
516+ /* Should be running */
517+ assert0 (kill (upstart_pid, 0));
518+
519+ job_pid = job_to_pid ("startup");
520+ TEST_NE (job_pid, -1);
521+
522+ /* Force reset */
523+ test_user_mode = FALSE;
524+
525+ /* Trigger session shutdown */
526+ assert0 (upstart_end_session_sync (NULL, upstart));
527+
528+ /* Session Init should end very quickly since there will be no
529+ * wait phase.
530+ */
531+ TEST_EQ (timed_waitpid (upstart_pid, TEST_QUIESCE_KILL_PHASE), upstart_pid);
532+
533+ /* Should not now be running */
534+ TEST_EQ (kill (upstart_pid, 0), -1);
535+ TEST_EQ (errno, ESRCH);
536+
537+ session_file = NIH_MUST (nih_sprintf (NULL, "%s/upstart/sessions/%d.session",
538+ sessiondir, (int)upstart_pid));
539+ unlink (session_file);
540+
541+ /* pid should no longer exist */
542+ TEST_EQ (kill (job_pid, SIGKILL), -1);
543+ TEST_EQ (errno, ESRCH);
544+
545+ DELETE_FILE (confdir, "startup.conf");
546+
547+ /*******************************************************************/
548 TEST_FEATURE ("session shutdown: one long-running job which ignores SIGTERM");
549
550 CREATE_FILE (confdir, "long-running-term.conf",
551@@ -11668,7 +11711,7 @@
552 /* Trigger session shutdown */
553 assert0 (upstart_end_session_sync (NULL, upstart));
554
555- TEST_EQ (timed_waitpid (upstart_pid, TEST_QUIESCE_KILL_PHASE), upstart_pid);
556+ TEST_EQ (timed_waitpid (upstart_pid, TEST_QUIESCE_TOTAL_WAIT_TIME), upstart_pid);
557
558 /* Should not now be running */
559 TEST_EQ (kill (upstart_pid, 0), -1);
560@@ -11731,7 +11774,7 @@
561 /* Trigger session shutdown */
562 assert0 (upstart_end_session_sync (NULL, upstart));
563
564- TEST_EQ (timed_waitpid (upstart_pid, TEST_QUIESCE_KILL_PHASE), upstart_pid);
565+ TEST_EQ (timed_waitpid (upstart_pid, TEST_QUIESCE_TOTAL_WAIT_TIME), upstart_pid);
566
567 /* Should not now be running */
568 TEST_EQ (kill (upstart_pid, 0), -1);

Subscribers

People subscribed via source and target branches