upstart

Code review comment for lp:~jamesodhunt/upstart/bug-530779

bug-530779
Merge into trunk

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-12-02:

Hi Steve,

I've now updated the deserialisation to make an attempt on the downgrade scenario. Note the comment in process.h though regarding the ordering of ProcessType. I've also raised bug 1256985 whilst looking at this.

> + TEST_FEATURE ("full double-fork daemon test where parent waits for ultimate child");
>
> does not fit the expected structure of a test feature name. Suggest instead:
> TEST_FEATURE ("with daemon where parent waits for ultimate child before exiting");

Fixed.

>
> + assert0 (prctl (PR_SET_CHILD_SUBREAPER, 1));
>
> Upstart does not depend on PR_SET_CHILD_SUBREAPER being available on the running kernel - the test suite shouldn't either. The test needs to be skipped if this facility is unavailable, or it needs to be rewritten to avoid subreaper entirely.

Fixed.

>
> + /* Low byte is signal */
> + TEST_EQ (siginfo.si_status, SIGTRAP);
> +
> + /* Normally, the next byte would be the ptrace event,
> + * but upstart hasn't yet set PTRACE_O_TRACEEXEC, hence
> + * no ptrace event will be available.
> + */
> + TEST_EQ (((siginfo.si_status & ~0x7f)>>8), 0);
>
> The second test here is redundant with the first; if siginfo.si_status == SIGTRAP, the high byte must be 0. Either the second test should be dropped, or, if you want to keep it for clarity, the first test should be modified to check siginfo.si_status & 0x7f.

Fixed.

>
> + TIMED_BLOCK (5) {
>
> I'm not thrilled about this TIMED_BLOCK() idea. Why should we not just use inotify?

To give some context here, it's worth pointing out that this test has been through a few iterations. An earlier incarnation of it was using signals entirely but that is problematic since:

  - test_daemon needed to raise SIGSTOP on itself.
  - test_daemon is being ptraced hence is also getting SIGSTOP's sent to it by the kernel.
  - This is a multi-process test so we need to guarantee behaviour given that we can't know the order the kernel will schedule each process.
  - We don't want to call nih_main_loop() as we cannot then control the test in "single-step" increments (well, as close as we can get to that anyway :-).

Certainly inotify would be the obvious choice but for that fact that it would add further code and complexity to the test since we're avoiding using nih_main_loop().

TIMED_BLOCK offers a rather useful generic facility I think, particularly since a common requirement for the tests is to perform some operation "within a reasonable amount of time" and then fail. Yes, it's tricky to determine the value of "reasonable", but we do need to guarantee a hard test failure, rather than an indefinite hang.

Admittedly, using TIMED_BLOCK() with file_line_count() is rather grisly, but even if we were to use inotify, we'd still need to check the line count on each event as we need to know when a particular parent is in a particular state.

> For that matter, why is this pidfile needed at all - what is this meant to guard against? The test case already includes its own direct check of the pids with ptrace, so the only thing this pidfile does is make sure the daemon agrees with the test what the PIDs are. That seems unnecessary to me.

It's not immediately obvious, but that's not the whole reason for the pidlist-file specificically: since each parent updates pidlist-file, once the expected number of pids are in that file, we know that the parent has reached a particular state. This goes back to the iterations previously mentioned - this test is attemping as best it can to perform and end-to-end daemon test whilst minimising deviation from how the actual init daemon would react. test_daemon facility allows us to abuse it purely to satisfy this single requirement.

>
> + /* Wait for child to send itself SIGSTOP denoting its
> + * final resting state.
> + */
> + got = FALSE;
> + TIMED_BLOCK (5) {
> +
> + memset (&siginfo, 0, sizeof (siginfo));
> + ret = waitid (P_PID, final_pid, &siginfo, WSTOPPED | WNOWAIT | WUNTRACED);
> +
> + if (! ret && siginfo.si_code == CLD_STOPPED &&
> + siginfo.si_status == SIGSTOP) {
> + got = TRUE;
> + break;
> + }
> + }
>
> This section seems overly complicated. The child process could just call pause() instead, and let the test runner kill the process when it's done.

The child process cannot just call pause() since no notification of that occurs via waitid(), hence the test wouldn't know when the final pid is ready. We could use pidfile I guess, but a signal seemed like a superior option in this particular scenario.

>
> + TIMED_BLOCK (5) {
> + nih_child_poll ();
> +
> + if (kill (final_pid, 0) == -1 && errno == ESRCH) {
> + got = TRUE;
> + break;
> + }
> + }
>
> Definitely no reason for this to be in a TIMED_BLOCK(). Or called at all. We've already received CLD_EXITED above, so the kill() should immediately return ESRCH. But again, this code all goes away if we just kill SIGKILL the final_pid, instead of including these checks that are unrelated to the purpose of the test case.

The test does this so that it knows definitively that once that block has exited, the process has gone. Since nih_child_poll() is called repeatedly in that block (as Upstart does via its main loop), we also know that the Job associated with the final pid will have been updated. We then proceed to check that the update was as expected. If we don't do this, there is a good chance that the Job state will be incorrect and the test will fail intermittently as a result.

>
> static void
> selfpipe_setup (void)
> {
> - static struct sigaction act;
> - int read_flags;
> - int write_flags;
> [...]
>
> This mixes formatting changes with functional changes, making it much harder to follow the history. Please split the format changes into a separate commit and rebase the functional changes on top.
>

Formatting changes undone.

Hi Steve,

> + TEST_FEATURE ("full double-fork daemon test where parent waits for ultimate child");
> 
> does not fit the expected structure of a test feature name. Suggest instead:
>         TEST_FEATURE ("with daemon where parent waits for ultimate child before exiting");

Fixed.

> 
> + assert0 (prctl (PR_SET_CHILD_SUBREAPER, 1));
> 
> Upstart does not depend on PR_SET_CHILD_SUBREAPER being available on the running kernel - the test suite shouldn't either. The test needs to be skipped if this facility is unavailable, or it needs to be rewritten to avoid subreaper entirely.

Fixed.

> 
> + /* Low byte is signal */
> + TEST_EQ (siginfo.si_status, SIGTRAP);
> +
> + /* Normally, the next byte would be the ptrace event,
> + * but upstart hasn't yet set PTRACE_O_TRACEEXEC, hence
> + * no ptrace event will be available.
> + */
> + TEST_EQ (((siginfo.si_status & ~0x7f)>>8), 0);
> 
> The second test here is redundant with the first; if siginfo.si_status == SIGTRAP, the high byte must be 0. Either the second test should be dropped, or, if you want to keep it for clarity, the first test should be modified to check siginfo.si_status & 0x7f.

Fixed.

> 
> + TIMED_BLOCK (5) {
> 
> I'm not thrilled about this TIMED_BLOCK() idea. Why should we not just use inotify?

To give some context here, it's worth pointing out that this test has been through a few iterations.  An earlier incarnation of it was using signals entirely but that is problematic since:

Certainly inotify would be the obvious choice but for that fact that it would add further code and complexity to the test since we're avoiding using nih_main_loop().

> 
> + /* Wait for child to send itself SIGSTOP denoting its
> + * final resting state.
> + */
> + got = FALSE;
> + TIMED_BLOCK (5) {
> +
> + memset (&siginfo, 0, sizeof (siginfo));
> + ret = waitid (P_PID, final_pid, &siginfo, WSTOPPED | WNOWAIT | WUNTRACED);
> +
> + if (! ret && siginfo.si_code == CLD_STOPPED &&
> + siginfo.si_status == SIGSTOP) {
> + got = TRUE;
> + break;
> + }
> + }
> 
> This section seems overly complicated. The child process could just call pause() instead, and let the test runner kill the process when it's done.

> 
> + TIMED_BLOCK (5) {
> + nih_child_poll ();
> +
> + if (kill (final_pid, 0) == -1 && errno == ESRCH) {
> + got = TRUE;
> + break;
> + }
> + }
> 
> Definitely no reason for this to be in a TIMED_BLOCK(). Or called at all. We've already received CLD_EXITED above, so the kill() should immediately return ESRCH. But again, this code all goes away if we just kill SIGKILL the final_pid, instead of including these checks that are unrelated to the purpose of the test case.

The test does this so that it knows definitively that once that block has exited, the process has gone. Since nih_child_poll() is called repeatedly in that block (as Upstart does via its main loop), we also know that the Job associated with the final pid will have been updated.  We then proceed to check that the update was as expected. If we don't do this, there is a good chance that the Job state will be incorrect and the test will fail intermittently as a result.

> 
>  static void
>  selfpipe_setup (void)
>  {
> - static struct sigaction act;
> - int read_flags;
> - int write_flags;
> [...]
> 
> This mixes formatting changes with functional changes, making it much harder to follow the history. Please split the format changes into a separate commit and rebase the functional changes on top.
>

Formatting changes undone.

« Back to merge proposal