upstart

init: possible issue with blocked events?

Bug #387216 reported by Scott James Remnant (Canonical) on 2009-06-15

This bug report is a duplicate of: Bug #406408: init: job.c: 283: Assertion failed in job_changed_state: job->blocker == NULL. Edit Remove

Affects		Status	Importance	Assigned to	Milestone
	upstart	Invalid	Undecided	Unassigned
	0.3	Incomplete	Medium	Unassigned

Bug Description

I've found 3 and fixed (patched, more like) 2 problems. My question is - has this been found and fixed already in later versions?

Details:

1) The test below makes init dump core and restart, triggered by an assert in job.c::job_chile_reaper(). The PROCESS_POST_START case expects JOB_POST_START, but gets JOB_RUNNING.

service
respawn

script
exit 0
end script

post-start script
sleep 4
end script

As far as I could see JOB_RUNNING was legitimate, so I updated the assert condition to allow it, and respawning worked.

2) after the change above, stopping this job produced another core. This time PROCESS_POST_START got JOB_WAITING.

I patched this as well by allowing JOB_WAITING and setting state and stop to FALSE in this case.

I'm not sure the second fix is a good one.

I tried to compare against 0.5, but it's been re-written in C++, so there's no simple diff.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-15:

The bug is that the running process exiting transitions the state - it should stay in POST_START if the post-start script is still running, and then transition directly to stopping after.

0.5 should be checked to see if it's still got the same bug.

If possible, we should backport a patch to 0.3 since it's still in active use

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-06-15:

state machine fixes Edit (49.7 KiB, text/plain)

this is my second attempt at fixing it in 0.3.8. The first fix was bad.

I added checks to job_next_state() to handle respawning and post-start/pre-stop processes as you suggested.

I also made jobs forget their "blocked" status when advanced to WAITING. Now totally sure about this one.

In main.c, I made it reboot on crash rather than restart init.

As a result, my tests passed, but I still got a new core from init reported just today. The problem is, these things are getting harder to reproduce.

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-06-15:

reboot on failure fix Edit (13.8 KiB, text/plain)

and the main.c

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-06-15:

pre-stop vs. main race Edit (254 bytes, text/plain)

here's a test that let me catch the main/pre-stop race, but it would usually need to be run:

#!/bin/sh

while true
do
    if ls /cores/core*
    then
        echo cores!
        cp /var/log/messages ${HOME}
        exit 1
    fi
    initctl status test2
    if ! initctl start test2
    then
        echo start failed
        exit 1
    fi
    initctl status test2
    if ! initctl stop test2
    then
        echo stop failed
        exit 1
    fi
done

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-15:

Could you supply your proposed changes as a diff/patch rather than new copies of the source files?

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-15:

A similar bug exists in 0.5, bug #381048

This is caused by the respawn handling being done before the check for not transitioning a state. It's very possible that this is just a duplicate, and the bug has existed that long.

I'm not clear as to why you've modified job_next_state() or why you've introduced checks for JOB_WAITING anywhere - could you explain those? To me it looks like the job_child_reaper() change is sufficient - which is the proposed fix for bug #381048

Changed in upstart:
milestone:	none → 0.3.10

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-06-15:

as far as I understand job_child_reaper() fix addresses the respawn problem only. There's a general problem of advancing job state beyond POST_START when post-start is still running, etc. As far as I could tell, the idea is to remain in PRE/POST_START or PRE/POST_STOP until those processes have finished, since we don't kill them. Is that so?

job_child_reaper() is one place where job state is advanced. The other is job_handle_event_finished(). I've seen it do this: there is an emission when the job goes into JOB_STARTING or JOB_STOPPING state and the job gets a "blocked" flag in job_emit_event(). There are no other jobs subscribed to this event, so job_handle_event_finished() gets called, finds the blocked flag matching the emission, and calls job_change_state(job_next_state()) without checking for a live pre-start, etc. This is why I placed the checks in job_next_state().

The clearing of "blocked" flag when the job goes to JOB_WAITING was mainly due to my lack of understanding of what DELETED means. I was getting my jobs advanced to JOB_DELETED; in my environment, we don't delete jobs, so the fix works for me, but I wouldn't propose it as a permanent fix. If other places are fixed, it shouldn't ever come to that situation.

I'll do the diff/patch thing next time. I'm not yet used to working with open source projects.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-17: Re: [Bug 387216] Re: 0.3 assert if post-start process ends after main process

On Mon, 2009-06-15 at 15:48 +0000, Alex Nekrasov wrote:

> as far as I understand job_child_reaper() fix addresses the respawn
> problem only. There's a general problem of advancing job state beyond
> POST_START when post-start is still running, etc. As far as I could
> tell, the idea is to remain in PRE/POST_START or PRE/POST_STOP until
> those processes have finished, since we don't kill them. Is that so?
>
That's absolutely correct, yes.

> job_child_reaper() is one place where job state is advanced. The other
> is job_handle_event_finished().
>
job_handle_event_finished can only be called if job->blocked matches the
event being handled.

job->blocked will only be non-NULL if the job is in one of the blocking
states, that is JOB_STARTING and JOB_STOPPING.

Since the only thing that will advance the state beyond these, when
job->blocked has been set, is job_event_handle_finished() which resets
job->blocked back to NULL again, I'm not sure that you're correct.

Could you provide a test case?

Pure code review:

job.c:970-974 job->blocked should never be non-NULL in the JOB_WAITING
state, this seems unnecessary.

job.c:1044-1048,1063-1067,1072-1080,1099-1108 I tried to avoid such
logic in this code. In particular, the assertion that we should stop if
there's no main process is wrong. It's better to not try and change the
state at all.

job.c:1751-1757 this can never be reached, the case asserts that the
job state is only JOB_POST_STOP, so checking for a different value is
meaningless

Scott
--
Scott James Remnant
<email address hidden>

summary:

- 0.3 assert if post-start process ends after main process
+ 0.3 possible issue with blocked events?

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-17: Re: 0.3 assert if post-start process ends after main process

The respawn/post-start issue is being tracked at bug #381048,

we'll continue the discussion about job->blocked here

Thanks

Scott James Remnant (Canonical) (canonical-scott) on 2009-06-17

Changed in upstart:
milestone:	0.3.10 → 0.3-later

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-18: Re: 0.3 possible issue with blocked events?

#10

Did you mean "without checking for a live pre-start", or did you mean "post-start/pre-stop" ?

The pre-start script is not run until the JOB_PRE_START state, which happens only after job_handle_event_finished() moves it there from JOB_STARTING (clearing blocked)

Changed in upstart:
milestone:	0.3-later → none
importance:	Undecided → Medium
status:	New → Incomplete

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-18:

#11

(status mess to move to 0.3)

Changed in upstart:
importance:	Medium → Undecided
status:	Incomplete → Invalid
summary:	- 0.3 possible issue with blocked events? + init: possible issue with blocked events?

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-06-18:

#12

give me a couple days. I'll need to revert to the original 0.3.8 and run the test, then I can give you trace

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-18: Re: [Bug 387216] Re: init: possible issue with blocked events?

#13

On Thu, 2009-06-18 at 15:42 +0000, Alex Nekrasov wrote:

> give me a couple days. I'll need to revert to the original 0.3.8 and run
> the test, then I can give you trace
>
Thanks!

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-06-19:

#14

Err, sorry, wrong bug :-)

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-08:

#15

Any luck re-running that test?

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-07-09:

#16

test that panics init without use of respaen Edit (2.5 MiB, application/x-tar)

sorry, been on vacation. But no matter, here we go.

The test2 job tries to create a race between the main process and pre-stop. As you see there's no respawn flag. The test.sh script starts and stops test2 until detected failure, for which it looks for a core file in /cores.

I ran test.sh for quite some time on an otherwise idle machine without failures. Then I added load to the machine in the hope of altering timing and within several minutes got init dump core.

Attached are the test, core file, init binary, and /var/log/messages filtered by "spa init:". spa is our hostname.

This must be the original 0.3.8 init binary, i.e., without my changes.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-13:

#17

On Thu, 2009-07-09 at 11:35 +0000, Alex Nekrasov wrote:

> The test2 job tries to create a race between the main process and pre-
> stop. As you see there's no respawn flag. The test.sh script starts and
> stops test2 until detected failure, for which it looks for a core file
> in /cores.
>
Can you try the same with 0.6.0?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message

Alex Nekrasov (ennnot) wrote on 2009-07-13:

#18

not easy. AFAIK 0.6 uses different/incompatible files and locations. I
mean the basic runlevels, stuff like /etc/even.d, etc. Is that so?

On Mon, Jul 13, 2009 at 6:25 PM, Scott James Remnant<email address hidden> wrote:
> On Thu, 2009-07-09 at 11:35 +0000, Alex Nekrasov wrote:
>
>> The test2 job tries to create a race between the main process and pre-
>> stop. As you see there's no respawn flag. The test.sh script starts and
>> stops test2 until detected failure, for which it looks for a core file
>> in /cores.
>>
> Can you try the same with 0.6.0?
>
> Scott
> --
> Scott James Remnant
> <email address hidden>
>
> --
> init: possible issue with blocked events?
> https://bugs.launchpad.net/bugs/387216
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Upstart: Invalid
> Status in upstart 0.3 series: Incomplete
>
> Bug description:
> I've found 3 and fixed (patched, more like) 2 problems. My question is - has this been found and fixed already in later versions?
>
> Details:
>
> 1) The test below makes init dump core and restart, triggered by an assert in job.c::job_chile_reaper(). The PROCESS_POST_START case expects JOB_POST_START, but gets JOB_RUNNING.
>
> service
> respawn
>
> script
> exit 0
> end script
>
> post-start script
> sleep 4
> end script
>
> As far as I could see JOB_RUNNING was legitimate, so I updated the assert condition to allow it, and respawning worked.
>
> 2) after the change above, stopping this job produced another core. This time PROCESS_POST_START got JOB_WAITING.
>
> I patched this as well by allowing JOB_WAITING and setting state and stop to FALSE in this case.
>
> I'm not sure the second fix is a good one.
>
> I tried to compare against 0.5, but it's been re-written in C++, so there's no simple diff.
>
>
>

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-14:

#19

On Mon, 2009-07-13 at 18:11 +0000, Alex Nekrasov wrote:

> not easy. AFAIK 0.6 uses different/incompatible files and locations. I
> mean the basic runlevels, stuff like /etc/even.d, etc. Is that so?
>
They're largely compatible, aside from a few minor differences. The
main one is that:

  start on foo
  start on bar
  start on baz

should be changed to:

start on foo or bar or baz

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-31:

#20

Tracked this one down, marked it as a duplicate of the more recent bug with the reason for the crash.