Ubuntu Distributed Development

Code review comment for lp:~vila/udd/735477-kill-harder

735477-kill-harder
Merge into import-scripts

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-03-17:

>>>>> John Arbash Meinel <email address hidden> writes:

    > On 3/17/2011 5:39 PM, Vincent Ladeuil wrote:
    >>
    >>> I think 2 seconds is a bit short for a grace period. If bzr is doing
    >>> anything significant.
    >>
    >> Keep in mind that this grace period is between 2 kills not between the job start and the kill.
    >>
    >> There is actually two situations where we kill imports:
    >> - they have exceeded their time quota,
    >> - the mass_import script is killed and should stop as soon as possible.
    >>
    >>> Something like 1 minute would be generous,
    >>
    >> losas will hate us if we make them wait 1 minute just because.
    >>

> I think something like 10s is reasonable to give it a chance to
> shutdown cleanly,

It won't shutdown cleanly if it can't obey SIGTERM, that's why it needs
to be SIGKILLED.

> rather than killing it and leaving locks open, etc. Remember, by
> this point the system is probably under load,

Not especially, the processes that don't respond to SIGTERM do so on
lightly loaded systems as well.

I would even argue that SIGKILL may be the one to use if the system is
under too much load to get back to a situation where the admin/operator
is back in control.

If there is more cleanup to do at that point, then having a responsive
system is a pre-requisite anyway.

> and may take a bit to context switch. I think 2s is a bit short,
> but we can live with it.

Ob both jubany and my local importer, killing the mass_import script and
8 running importers go down in less than 1 second:

2011-03-16 17:01:27,974 - __main__ - INFO - Received signal
....
2011-03-16 17:01:28,916 - __main__ - INFO - Driver finished: stopping

This includes collecting the tracebacks for all of them (which imply
they all got ample time to propagate the exception raised when they
receive the SIGTERM signal).

This may change if we don't use a multi-core CPU once we leave jubany,
but at this point in time, I hope this constant will be obtained from a
config file anyway so we could tune it at will.

<snip/>

    > Honestly, I think mkfifo is going to be significantly more reliable. It
    > is what it was made for, versus hacking around getting half the text
    > written to a file, or any other such thing for a regular file.

But it's not production code and the tests don't care about the file
content, creating the files just allows the test to ensure that the
subprocess has reached a specific point in the execution much like I use
threading.Event's to synchronize threads in tests.

Now if you have working and tested code using mkfifo to implement the
same kind of synchronization, I'd be happy to use it. I just don't want
to invest effort into this area without a need for it.

    >>>
    >>> If you are worried that they'll never open the file and you'll block.
    >>> You can O_NONBLOCK and poll like you are doing now. Or use something
    >>> like signal.alarm() so that the test suite dies if the file never opens.
    >>> (You could even install a SIGALRM handler and just have EINTR triggered
    >>> on the open() call.)
    >>
    >> No, no, no, signals and threads can't be mixed (and even if they
    >> could be used in this particular case, I'd prefer to avoid them
    >> as long as possible).
    >>

> Except you're already using signals...

Only targeted at external processes, I don't mix them with threads nor
do I try to receive them from a thread (which can't be done except for
the main one).

>>>>> John Arbash Meinel <john@arbash-meinel.com> writes:

> On 3/17/2011 5:39 PM, Vincent Ladeuil wrote:
    >> 
    >>> I think 2 seconds is a bit short for a grace period. If bzr is doing
    >>> anything significant. 
    >> 
    >> Keep in mind that this grace period is between 2 kills not between the job start and the kill.
    >> 
    >> There is actually two situations where we kill imports:
    >> - they have exceeded their time quota,
    >> - the mass_import script is killed and should stop as soon as possible.
    >> 
    >>> Something like 1 minute would be generous,
    >> 
    >> losas will hate us if we make them wait 1 minute just because.
    >>

> I think something like 10s is reasonable to give it a chance to
    > shutdown cleanly,

It won't shutdown cleanly if it can't obey SIGTERM, that's why it needs
to be SIGKILLED.

> rather than killing it and leaving locks open, etc. Remember, by
    > this point the system is probably under load,

Not especially, the processes that don't respond to SIGTERM do so on
lightly loaded systems as well.

I would even argue that SIGKILL may be the one to use if the system is
under too much load to get back to a situation where the admin/operator
is back in control.

If there is more cleanup to do at that point, then having a responsive
system is a pre-requisite anyway.

> and may take a bit to context switch. I think 2s is a bit short,
    > but we can live with it.

Ob both jubany and my local importer, killing the mass_import script and
8 running importers go down in less than 1 second:

2011-03-16 17:01:27,974 - __main__ - INFO - Received signal
....
2011-03-16 17:01:28,916 - __main__ - INFO - Driver finished: stopping

This includes collecting the tracebacks for all of them (which imply
they all got ample time to propagate the exception raised when they
receive the SIGTERM signal).

This may change if we don't use a multi-core CPU once we leave jubany,
but at this point in time, I hope this constant will be obtained from a
config file anyway so we could tune it at will.

<snip/>

Now if you have working and tested code using mkfifo to implement the
same kind of synchronization, I'd be happy to use it. I just don't want
to invest effort into this area without a need for it.

>>> 
    >>> If you are worried that they'll never open the file and you'll block.
    >>> You can O_NONBLOCK and poll like you are doing now. Or use something
    >>> like signal.alarm() so that the test suite dies if the file never opens.
    >>> (You could even install a SIGALRM handler and just have EINTR triggered
    >>> on the open() call.)
    >> 
    >> No, no, no, signals and threads can't be mixed (and even if they
    >> could be used in this particular case, I'd prefer to avoid them
    >> as long as possible).
    >>

> Except you're already using signals...

Only targeted at external processes, I don't mix them with threads nor
do I try to receive them from a thread (which can't be done except for
the main one).

« Back to merge proposal