Merge lp:~vila/udd/735477-kill-harder into lp:udd

Proposed by Vincent Ladeuil on 2011-03-17
Status: Merged
Merged at revision: 416
Proposed branch: lp:~vila/udd/735477-kill-harder
Merge into: lp:udd
Diff against target: 366 lines (+156/-56)
4 files modified
icommon.py (+37/-6)
tests.py (+73/-50)
tests/die_on_sigterm.py (+22/-0)
tests/survive_sigterm.py (+24/-0)
To merge this branch: bzr merge lp:~vila/udd/735477-kill-harder
Reviewer Review Type Date Requested Status
John A Meinel 2011-03-17 Approve on 2011-03-21
Martin Pool Needs Fixing on 2011-03-18
Review via email: mp+53837@code.launchpad.net

Description of the change

BackgBackground
==========

Due to the way the import subprocesses are handled, several
attempts to kill them reliably have been tried. This led to
SubprocessMonitor.kill() taking a parameter for the signal to use
as the subprocess was spawned with 'shell=True' and that didn't
fly very well with SIGTERM. To avoid this issue, 'shell=True' is
no longer used so SIGTERM can be used.

Use cases
=========

There are two cases when an import must be killed:

- when it exceeds its time quota,

- when the mass_import script and all the imports it controls
  should be shutdown.

For the first case, we don't really care about what is currently
going on since this denotes a bug: either the time quota is
misconfigured or the import just takes too long for yet unkown
reasons.

The second case is for emergency issues where the imports should
be stopped as soon as possible because "something" is
wrong. Here, we'd like to allow the imports to shut down cleanly
"if possible".

Bugs
====

The ImportDriver implements a time quota check and kill the
imports that exceed it. It is implemented by checking the time
quota at each iteration of the driver main loop and calling kill
for all imports that need it.

There are two bugs here:
- repeatedly calling kill for a given subprocess led to a bloated log,
- some subprocesses just don't obey SIGTERM.

Fix
===

To avoid issuing log messages repeatedly, a grace period is
implemented during which no kills are attempted so "normal"
imports can die gracefully.

If an import doesn't die during this grace period, kills are
allowed again but with SIGKILL to really shut down over-resistant
jobs.

The grace period is arbitrarily set to 10 seconds which, by
consensus, seems enough.

The "if possible" mentioned above for mass_import emergency kills
is implemented by using SIGTERM so we get a traceback that could
help cleanup kill fallouts. But in case of bugs in the kill
handling of import itself, this should still switch to SIGKILL
quickly (think log bloating here or any bogus behaviour).

To post a comment you must log in.
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/17/2011 3:50 PM, Vincent Ladeuil wrote:
> Vincent Ladeuil has proposed merging lp:~vila/udd/735477-kill-harder into lp:udd.
>
> Requested reviews:
> Ubuntu Distributed Development Developers (udd)
> Related bugs:
> Bug #735477 in Ubuntu Distributed Development: "some imports survive a kill -SIGTERM leading to massive log output and no kill"
> https://bugs.launchpad.net/udd/+bug/735477
>
> For more details, see:
> https://code.launchpad.net/~vila/udd/735477-kill-harder/+merge/53837
>
> At least one import didn't die when killed with SIGTERM (nexuiz-data).
>
> This led to a bloated log with repeated attempts to kill it again.
>
> While SIGTERM should still be the preferred way to kill the imports (or we won't get meaningful tracebacks), we still need to be able to kill the imports reliably :)
>
> So this patches ensures that only the first kill for a given subprocess use SIGTERM and that, after a grace period (defaulting to 2 seconds), we'll use SIGKILL instead.

I think 2 seconds is a bit short for a grace period. If bzr is doing
anything significant. Something like 1 minute would be generous, and
still avoid having a child sitting around for days before manually
killing it.

Maybe 30s if you want something lower.

Why not use 'mkfifo' and then actually do inter-process communication,
instead of playing tricks with file content. mkfifo + open() will block
until the file actually is opened by the other side.

If you are worried that they'll never open the file and you'll block.
You can O_NONBLOCK and poll like you are doing now. Or use something
like signal.alarm() so that the test suite dies if the file never opens.
(You could even install a SIGALRM handler and just have EINTR triggered
on the open() call.)

Otherwise something like this seems fine. I certainly support falling
back to SIGKILL if SIGTERM/SIGINT aren't doing their job.

 review: needsfixing
 merge: approve

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2CIyoACgkQJdeBCYSNAAMWLgCfe/e5ZMU5yltRfWDg+BCJKbgl
8XQAn3JCj73LrgRgyKaLdIBESI3UAdoY
=1waC
-----END PGP SIGNATURE-----

review: Needs Fixing
Vincent Ladeuil (vila) wrote :

> I think 2 seconds is a bit short for a grace period. If bzr is doing
> anything significant.

Keep in mind that this grace period is between 2 kills not between the job start and the kill.

There is actually two situations where we kill imports:
- they have exceeded their time quota,
- the mass_import script is killed and should stop as soon as possible.

> Something like 1 minute would be generous,

losas will hate us if we make them wait 1 minute just because.

> and
> still avoid having a child sitting around for days before manually
> killing it.

For the nexuis-data case, the issue is that SIGTERM is not honored so again, there is no point in waiting, only SIGKILL will do.

> Why not use 'mkfifo' and then actually do inter-process communication,
> instead of playing tricks with file content. mkfifo + open() will block
> until the file actually is opened by the other side.

Reliability. I don't want to test mkfifo nor dive into whatever problems I can encounter with inter-process communications nor python/Ubuntu specifics. These tricks are used only in the tests and are easy enough to grasp for test writers.

>
> If you are worried that they'll never open the file and you'll block.
> You can O_NONBLOCK and poll like you are doing now. Or use something
> like signal.alarm() so that the test suite dies if the file never opens.
> (You could even install a SIGALRM handler and just have EINTR triggered
> on the open() call.)

No, no, no, signals and threads can't be mixed (and even if they could be used in this particular case, I'd prefer to avoid them as long as possible).

>
> Otherwise something like this seems fine. I certainly support falling
> back to SIGKILL if SIGTERM/SIGINT aren't doing their job.
>
> review: needsfixing
> merge: approve

Thanks.

Vincent Ladeuil (vila) wrote :

Oh, and the mass_import can use a STOP_PLEASE file to stop gracefully, so here we're talking about really *killing* as fast as possible as opposed to letting the imports finish peacefully.

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/17/2011 5:39 PM, Vincent Ladeuil wrote:
>
>> I think 2 seconds is a bit short for a grace period. If bzr is doing
>> anything significant.
>
> Keep in mind that this grace period is between 2 kills not between the job start and the kill.
>
> There is actually two situations where we kill imports:
> - they have exceeded their time quota,
> - the mass_import script is killed and should stop as soon as possible.
>
>> Something like 1 minute would be generous,
>
> losas will hate us if we make them wait 1 minute just because.
>

I think something like 10s is reasonable to give it a chance to shutdown
cleanly, rather than killing it and leaving locks open, etc. Remember,
by this point the system is probably under load, and may take a bit to
context switch. I think 2s is a bit short, but we can live with it.

>> and
>> still avoid having a child sitting around for days before manually
>> killing it.
>
> For the nexuis-data case, the issue is that SIGTERM is not honored so again, there is no point in waiting, only SIGKILL will do.
>
>> Why not use 'mkfifo' and then actually do inter-process communication,
>> instead of playing tricks with file content. mkfifo + open() will block
>> until the file actually is opened by the other side.
>
> Reliability. I don't want to test mkfifo nor dive into whatever problems I can encounter with inter-process communications nor python/Ubuntu specifics. These tricks are used only in the tests and are easy enough to grasp for test writers.
>

Honestly, I think mkfifo is going to be significantly more reliable. It
is what it was made for, versus hacking around getting half the text
written to a file, or any other such thing for a regular file.

>>
>> If you are worried that they'll never open the file and you'll block.
>> You can O_NONBLOCK and poll like you are doing now. Or use something
>> like signal.alarm() so that the test suite dies if the file never opens.
>> (You could even install a SIGALRM handler and just have EINTR triggered
>> on the open() call.)
>
> No, no, no, signals and threads can't be mixed (and even if they could be used in this particular case, I'd prefer to avoid them as long as possible).
>

Except you're already using signals...

John
=:->

>>
>> Otherwise something like this seems fine. I certainly support falling
>> back to SIGKILL if SIGTERM/SIGINT aren't doing their job.
>>
>> review: needsfixing
>> merge: approve
>
> Thanks.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2CTQ8ACgkQJdeBCYSNAAMmegCdGPeee4xg+4hIrX9bIaLD2EO3
bBsAniA64QsHqA2M709fcx+Iv6GW9VQL
=LWVa
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/17/2011 5:58 PM, Vincent Ladeuil wrote:
> Oh, and the mass_import can use a STOP_PLEASE file to stop gracefully, so here we're talking about really *killing* as fast as possible as opposed to letting the imports finish peacefully.

Isn't that for the main script, not the individual importers?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2CTSkACgkQJdeBCYSNAAMrXQCdHSd0ZevG67FO4rg5W16YiWc+
uQsAoLJsjAGmdQ3mDeWJmBJAdGH0DExc
=tNPY
-----END PGP SIGNATURE-----

Vincent Ladeuil (vila) wrote :

>>>>> John Arbash Meinel <email address hidden> writes:

    > On 3/17/2011 5:58 PM, Vincent Ladeuil wrote:

    >> Oh, and the mass_import can use a STOP_PLEASE file to stop
    >> gracefully, so here we're talking about really *killing* as fast
    >> as possible as opposed to letting the imports finish peacefully.

    > Isn't that for the main script, not the individual importers?

Yes, mass_import will stop starting new importers and wait for the
existing ones to finish.

Vincent Ladeuil (vila) wrote :
Download full text (3.4 KiB)

>>>>> John Arbash Meinel <email address hidden> writes:

    > On 3/17/2011 5:39 PM, Vincent Ladeuil wrote:
    >>
    >>> I think 2 seconds is a bit short for a grace period. If bzr is doing
    >>> anything significant.
    >>
    >> Keep in mind that this grace period is between 2 kills not between the job start and the kill.
    >>
    >> There is actually two situations where we kill imports:
    >> - they have exceeded their time quota,
    >> - the mass_import script is killed and should stop as soon as possible.
    >>
    >>> Something like 1 minute would be generous,
    >>
    >> losas will hate us if we make them wait 1 minute just because.
    >>

    > I think something like 10s is reasonable to give it a chance to
    > shutdown cleanly,

It won't shutdown cleanly if it can't obey SIGTERM, that's why it needs
to be SIGKILLED.

    > rather than killing it and leaving locks open, etc. Remember, by
    > this point the system is probably under load,

Not especially, the processes that don't respond to SIGTERM do so on
lightly loaded systems as well.

I would even argue that SIGKILL may be the one to use if the system is
under too much load to get back to a situation where the admin/operator
is back in control.

If there is more cleanup to do at that point, then having a responsive
system is a pre-requisite anyway.

    > and may take a bit to context switch. I think 2s is a bit short,
    > but we can live with it.

Ob both jubany and my local importer, killing the mass_import script and
8 running importers go down in less than 1 second:

2011-03-16 17:01:27,974 - __main__ - INFO - Received signal
....
2011-03-16 17:01:28,916 - __main__ - INFO - Driver finished: stopping

This includes collecting the tracebacks for all of them (which imply
they all got ample time to propagate the exception raised when they
receive the SIGTERM signal).

This may change if we don't use a multi-core CPU once we leave jubany,
but at this point in time, I hope this constant will be obtained from a
config file anyway so we could tune it at will.

<snip/>

    > Honestly, I think mkfifo is going to be significantly more reliable. It
    > is what it was made for, versus hacking around getting half the text
    > written to a file, or any other such thing for a regular file.

But it's not production code and the tests don't care about the file
content, creating the files just allows the test to ensure that the
subprocess has reached a specific point in the execution much like I use
threading.Event's to synchronize threads in tests.

Now if you have working and tested code using mkfifo to implement the
same kind of synchronization, I'd be happy to use it. I just don't want
to invest effort into this area without a need for it.

    >>>
    >>> If you are worried that they'll never open the file and you'll block.
    >>> You can O_NONBLOCK and poll like you are doing now. Or use something
    >>> like signal.alarm() so that the test suite dies if the file never opens.
    >>> (You could even install a SIGALRM handler and just have EINTR triggered
    >>> on the open() call.)
    >>
    >> No, no, no, signals and threads can't be mixed (and even ...

Read more...

Martin Pool (mbp) wrote :

I agree with John that 2s is not enough time for a process to finish cleanly if the machine is heavily loaded. If you're not going to wait long enough to give it a chance to finish cleanly, you might as well sigkill it in the first place.

typo 'loating'

The docstring for 'kill' strongly implies it will actually wait that many seconds and then kill it, whereas in fact it will use sigkill only if it's recently been already signalled.

kill(signal.SIGINT) and _kill(signal.SIGINT) will both pass, but the first won't do what you expect.

All in all I think you should call your new method something different, like kill_with_escalation.

review: Needs Fixing
lp:~vila/udd/735477-kill-harder updated on 2011-03-18
417. By Vincent Ladeuil on 2011-03-18

Better docstrings for SubprocessMonitor kill methods and a larger default grace period.

Vincent Ladeuil (vila) wrote :

Cover letter rewritten for clarity and attempt to address misunderstandings, please re-review.

Vincent Ladeuil (vila) wrote :

As a data point, in mass-import-in-etc we do:

    restart)
        echo -n "Restarting $NAME"
        d_stop
        sleep 1
        d_start
        echo "."
        ;;

d_stop being:

d_stop() {
    start-stop-daemon --stop --pidfile $PIDFILE --exec /usr/bin/python
}

which by default send a SIGTERM.

So we wait 1 second for the mass_import.py to shut down before starting it again.

'graceful_stop' now is a different beast but that;s not what this proposal is about.

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/18/2011 5:56 PM, Vincent Ladeuil wrote:
> As a data point, in mass-import-in-etc we do:
>
> restart)
> echo -n "Restarting $NAME"
> d_stop
> sleep 1
> d_start
> echo "."
> ;;
>
> d_stop being:
>
> d_stop() {
> start-stop-daemon --stop --pidfile $PIDFILE --exec /usr/bin/python
> }
>
> which by default send a SIGTERM.
>
> So we wait 1 second for the mass_import.py to shut down before starting it again.
>
> 'graceful_stop' now is a different beast but that;s not what this proposal is about.

Wouldn't it make more sense to actually wait for the pid to disappear?
Note, though, if it is something that is listening on a socket, etc. You
often don't have to wait for it to die completely before restarting, as
long as it won't try to respond to new requests.

I don't think mass_import follows that specific logic, but as long as it
won't try to start new imports, you can bring up the next process, even
if this one isn't dead yet.

I wonder how scripts like apache's start/stop/restart logic work, since
they certainly don't just sleep expecting the service to stop in that time.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2EaeQACgkQJdeBCYSNAAO31wCgnV3zGPbZfQ+adwyDTUPtv7JY
jeQAoMtB2zL7sD7xnfUk5+y53codEAAE
=zlRY
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

It seems like this breaks the restart script, since any grace period is going to not have the service stopped before starting the next one. But the actual logic seems fine to me.

review: Approve
John A Meinel (jameinel) wrote :

One small nit, it would be nice to have the 10 second constant pulled out of the code and into a module/class level constant.

Vincent Ladeuil (vila) wrote :

> One small nit, it would be nice to have the 10 second constant pulled out of
> the code and into a module/class level constant.

You mean you prefer a module/class constant over a config option as indicated in the FIXME ???

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/21/2011 4:07 PM, Vincent Ladeuil wrote:
>> One small nit, it would be nice to have the 10 second constant pulled out of
>> the code and into a module/class level constant.
>
> You mean you prefer a module/class constant over a config option as indicated in the FIXME ???

I prefer it to a hard-coded:

 if x is None:
  x = 10

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2HbDoACgkQJdeBCYSNAAN1JwCgpjdu2rP5BuJBQxxo73E4NH97
xVQAoIYgEBAQav9iaFeCtNim8pVICSm2
=/qHH
-----END PGP SIGNATURE-----

James Westby (james-w) wrote :

On Mon, 21 Mar 2011 14:57:42 -0000, John A Meinel <email address hidden> wrote:
> Review: Approve
> It seems like this breaks the restart script, since any grace period
> is going to not have the service stopped before starting the next
> one. But the actual logic seems fine to me.

The stop call could use "--retry SOMENUMBER" with start-stop-daemon,
which would have it wait some time for the daemon to exit, and then kill
it after that, and wait some more. See the man page for details.

Thanks,

James

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'icommon.py'
2--- icommon.py 2011-03-16 10:26:17 +0000
3+++ icommon.py 2011-03-18 07:43:31 +0000
4@@ -1980,6 +1980,8 @@
5 self.out = self.err = None
6 self.started_at = self.ended_at = None
7 self.max_duration = max_duration
8+ self.killed_with = None
9+ self.killed_at = None
10
11 def spawn(self):
12 """Spawn the python command in a subprocess."""
13@@ -2015,18 +2017,47 @@
14 """
15 self.join(timeout)
16
17- def kill(self, signum=None):
18- """Kill the subprocess.
19+ def kill_with_escalation(self, grace_period=None):
20+ """Attempt to kill the subprocess.
21+
22+ :param grace_period: A floating number of seconds to wait before using
23+ SIGKILL to kill the subprocess.
24+
25+ When first called, `signal.SIGTERM` is used to allow the subprocess to
26+ handle the kill cleanly. If called repeatedly, this will not try to
27+ kill the process again until the end of the specified
28+ `grace_period`. When the grace period expires, `signal.SIGKILL` is used
29+ so that subprocesses that don't shutdown when receiving
30+ `signal.SIGTERM` can still be shut down.
31
32 This must be called by the controlling thread.
33 """
34 killed = False
35+ if self.killed_with is None:
36+ killed = self._kill(signal.SIGTERM)
37+ else:
38+ # We already tried to kill this this subprocess, if it didn't
39+ # during the grace period, we'll kill it harder
40+ now = time.time()
41+ if grace_period is None:
42+ # FIXME: This should go into a config file -- vila 2011-03-18
43+ # 10 seconds should be more than enough for a process to die or
44+ # something wrong is happening in which case it's better to get
45+ # this problematic subprocess down as quickly as possible.
46+ grace_period = 10.0
47+ if now - self.killed_at > grace_period:
48+ killed = self._kill(signal.SIGKILL)
49+ return killed
50+
51+ def _kill(self, signum):
52+ killed = False
53+ # Don't attempt to kill the subprocess if it hasn't been started
54 if self.proc_pid is not None:
55 try:
56- if signum is None:
57- signum = signal.SIGTERM
58 os.kill(self.proc_pid, signum)
59 killed = True
60+ self.killed_with = signum
61+ self.killed_at = time.time()
62 except OSError, e:
63 if e.errno in (errno.ECHILD, errno.ESRCH):
64 # The process doesn't exist anymore.
65@@ -2105,7 +2136,7 @@
66 for t in self.threads:
67 if t.has_exceeded_time_quota():
68 # Die you sucker
69- t.kill()
70+ t.kill_with_escalation()
71 killed.append(t)
72 return killed
73
74@@ -2144,7 +2175,7 @@
75 def kill_remaining_threads(self):
76 # Kill all the remaining threads
77 for t in self.threads[:]:
78- t.kill()
79+ t.kill_with_escalation()
80 # Collect the thread whether we could kill it or not
81 t.collect()
82 self.threads.remove(t)
83
84=== added directory 'tests'
85=== modified file 'tests.py'
86--- tests.py 2011-03-15 18:35:11 +0000
87+++ tests.py 2011-03-18 07:43:31 +0000
88@@ -20,10 +20,16 @@
89 except ImportError:
90 from debian_bundle import changelog
91
92+
93 sys.path.insert(0, os.path.dirname(__file__))
94 import icommon
95 import import_package
96
97+base_dir = os.path.abspath(os.path.dirname(__file__))
98+def get_test_script_path(name):
99+# global base_dir
100+ return os.path.join(base_dir, 'tests', name)
101+
102
103 class PackageToImportTests(unittest.TestCase):
104
105@@ -503,14 +509,14 @@
106 def test_kill(self):
107 sub = self.run_sleeper_in_subprocess()
108 sub.started.wait()
109- self.assertEquals(True, sub.kill())
110+ self.assertEquals(True, sub.kill_with_escalation())
111 sub.stopped.wait()
112 self.assertEquals(-signal.SIGTERM, sub.retcode)
113
114 def test_kill_SIGKILL(self):
115 sub = self.run_sleeper_in_subprocess()
116 sub.started.wait()
117- self.assertEquals(True, sub.kill(signal.SIGKILL))
118+ self.assertEquals(True, sub._kill(signal.SIGKILL))
119 sub.stopped.wait()
120 self.assertEquals(-signal.SIGKILL, sub.retcode)
121
122@@ -541,7 +547,7 @@
123 # Release the thread so it can call proc.communicate
124 resume.set()
125 # Now pretend we don't know the subprocess is dead
126- self.assertEquals(False, sub.kill())
127+ self.assertEquals(False, sub.kill_with_escalation())
128 sub.join()
129 # No useful retcode available
130 self.assertEquals(None, sub.retcode)
131@@ -561,48 +567,65 @@
132 sub.start()
133 return sub
134
135- def test_kill_caught(self):
136- self.build_tree_contents([('test.py',
137- """\
138-class Killed(Exception):
139-
140- def __str__(self):
141-
142- return "I'm so dead"
143-
144-def die(signum, frame):
145- raise Killed
146-import signal
147-signal.signal(signal.SIGTERM, die)
148-f = open('ready_to_die', 'w')
149-try:
150- f.write('Yes I am\\n')
151-finally:
152- f.close()
153-import time
154-slice = 0.001 # seconds
155-max = 120.0 # seconds
156-# We want to die long after the test fail (it it succeeds we're dead already)
157-while xrange(max/slice):
158- time.sleep(0.001)
159-""")])
160- sub = self.run_in_subprocess(['test.py'])
161- sub.started.wait()
162- # Gross hack to ensure the script ran long enough to catch the signal
163- ready = False
164- while not ready:
165+ def wait_for_file(self, path, timeout=2.0, a_slice=0.001):
166+ """Wait for the given file to appear on the file system.
167+
168+ This is used as a simple inter-process synchronization mechanism.
169+ """
170+ waited_time = 0.0
171+ while waited_time < timeout:
172 try:
173- f = open('ready_to_die')
174- ready = True
175+ f = open(path)
176+ f.close()
177+ break
178 except:
179- pass
180- self.assertEquals(True, sub.kill())
181+ time.sleep(a_slice)
182+ waited_time += a_slice
183+ else:
184+ self.fail('%s never appeared' % (path,))
185+
186+ def test_kill_caught(self):
187+ sub = self.run_in_subprocess(
188+ [get_test_script_path('die_on_sigterm.py')])
189+ sub.started.wait()
190+ self.wait_for_file('ready_to_die')
191+ self.assertEquals(True, sub.kill_with_escalation())
192 sub.stopped.wait()
193- # The signal has been caught, we just get a failure return code, not
194+ # The signal has been caught, we just get a failure return code, no
195 # funky stuff with the signal in the low byte and so on
196 self.assertEquals(1, sub.retcode)
197- # after the split, the last line is empty
198- self.assertEndsWith(sub.err.split('\n')[-2], "I'm so dead")
199+ self.assertEndsWith(sub.err.splitlines()[-1], "I'm so dead")
200+
201+ def test_kill_harder_directly(self):
202+ sub = self.run_in_subprocess(
203+ [get_test_script_path('survive_sigterm.py')])
204+ sub.started.wait()
205+ self.wait_for_file('ready_to_die')
206+ self.assertEquals(True, sub.kill_with_escalation())
207+ self.wait_for_file('surviving')
208+ # The subprocess is still alive !
209+ self.assertEquals(True, sub._kill(signal.SIGKILL))
210+ sub.stopped.wait()
211+ # The SIGKILL was the succeeding one
212+ self.assertEquals(-9, sub.retcode)
213+ self.assertEquals(sub.out, 'Try harder\n')
214+
215+ def test_kill_survivor(self):
216+ sub = self.run_in_subprocess(
217+ [get_test_script_path('survive_sigterm.py')])
218+ sub.started.wait()
219+ self.wait_for_file('ready_to_die')
220+ self.assertEquals(True, sub.kill_with_escalation())
221+ self.wait_for_file('surviving')
222+ # The subprocess is still alive !
223+ self.assertEquals(False, sub.kill_with_escalation(grace_period=1.0))
224+ # Ok, we tried to soon, we'll be less patient now
225+ self.assertEquals(True, sub.kill_with_escalation(grace_period=0.0))
226+ sub.stopped.wait()
227+ # The SIGKILL was the succeeding one
228+ self.assertEquals(-9, sub.retcode)
229+ self.assertEquals(sub.out, 'Try harder\n')
230+
231
232 class Sleeper(icommon.SubprocessMonitor):
233
234@@ -762,7 +785,7 @@
235 for i in range(3):
236 s = Sleeper()
237 self.addCleanup(s.join, 0)
238- self.addCleanup(s.kill)
239+ self.addCleanup(s.kill_with_escalation)
240 q.put(s)
241 driver = self.get_started_driver(queue=q, max_threads=3)
242 driver.started.wait()
243@@ -784,7 +807,7 @@
244 sleepers = [Sleeper() for _ in range(3)]
245 for s in sleepers:
246 self.addCleanup(s.join, 0)
247- self.addCleanup(s.kill)
248+ self.addCleanup(s.kill_with_escalation)
249 q.put(s)
250 sleepers[0].started.wait()
251 sleepers[1].started.wait()
252@@ -792,7 +815,7 @@
253 self.assertEquals([sleepers[0], sleepers[1]], driver.threads)
254 # sleepers[2] is still... awake
255 self.assertFalse(sleepers[2].started.isSet())
256- sleepers[0].kill()
257+ sleepers[0].kill_with_escalation()
258 sleepers[0].stopped.wait()
259 # Now the last sleeper will be started
260 sleepers[2].started.wait()
261@@ -804,7 +827,7 @@
262 sleepers = [Sleeper() for _ in range(4)]
263 for s in sleepers:
264 self.addCleanup(s.join, 0)
265- self.addCleanup(s.kill)
266+ self.addCleanup(s.kill_with_escalation)
267 q.put(s)
268 driver = self.get_started_driver(queue=q, max_threads=3)
269 driver.started.wait()
270@@ -817,9 +840,9 @@
271 driver.threads)
272 driver.max_threads = 2
273 # Kill 2 sleepers and wait
274- sleepers[0].kill()
275+ sleepers[0].kill_with_escalation()
276 sleepers[0].stopped.wait()
277- sleepers[1].kill()
278+ sleepers[1].kill_with_escalation()
279 sleepers[1].stopped.wait()
280 # Wait for another sleeper to start (because it's easier to sync on a
281 # sleeper start than trying to catch when a sleeper goes out of the
282@@ -832,7 +855,7 @@
283 sleepers = [Sleeper() for _ in range(3)]
284 for s in sleepers[:-1]: # the third sleeper will never be run
285 self.addCleanup(s.join, 0)
286- self.addCleanup(s.kill)
287+ self.addCleanup(s.kill_with_escalation)
288 q.put(s)
289 driver = self.get_started_driver(queue=q, max_threads=2)
290 driver.started.wait()
291@@ -841,8 +864,8 @@
292 sleepers[1].started.wait()
293 # close the queue
294 driver.queue_closed.set()
295- sleepers[0].kill()
296- sleepers[1].kill()
297+ sleepers[0].kill_with_escalation()
298+ sleepers[1].kill_with_escalation()
299 driver.stopped.wait()
300
301 def test_check_quotas(self):
302@@ -850,7 +873,7 @@
303 sleeper = Sleeper(sleep_time=1, max_duration=0.0)
304 q = Queue.Queue(1)
305 self.addCleanup(sleeper.join, 0)
306- self.addCleanup(sleeper.kill)
307+ self.addCleanup(sleeper.kill_with_escalation)
308 q.put(sleeper)
309 self.killed = None
310 class TestDriver(icommon.ThreadDriver):
311
312=== added file 'tests/die_on_sigterm.py'
313--- tests/die_on_sigterm.py 1970-01-01 00:00:00 +0000
314+++ tests/die_on_sigterm.py 2011-03-18 07:43:31 +0000
315@@ -0,0 +1,22 @@
316+import signal
317+import time
318+
319+class Killed(Exception):
320+
321+ def __str__(self):
322+
323+ return "I'm so dead"
324+
325+def die(signum, frame):
326+ raise Killed
327+signal.signal(signal.SIGTERM, die)
328+f = open('ready_to_die', 'w')
329+try:
330+ f.write('Yes I am\\n')
331+finally:
332+ f.close()
333+slice = 0.001 # seconds
334+max = 120.0 # seconds
335+# We want to die long after the test fail (it it succeeds we're dead already)
336+for i in xrange(int(max/slice)):
337+ time.sleep(slice)
338
339=== added file 'tests/survive_sigterm.py'
340--- tests/survive_sigterm.py 1970-01-01 00:00:00 +0000
341+++ tests/survive_sigterm.py 2011-03-18 07:43:31 +0000
342@@ -0,0 +1,24 @@
343+import signal
344+import sys
345+import time
346+
347+def survive(signum, frame):
348+ print "Try harder"
349+ sys.stdout.flush()
350+ f = open('surviving', 'w')
351+ try:
352+ f.write("No, you'll have to kill me really hard\\n")
353+ finally:
354+ f.close()
355+
356+signal.signal(signal.SIGTERM, survive)
357+f = open('ready_to_die', 'w')
358+try:
359+ f.write("No, you'll have to kill me really hard\\n")
360+finally:
361+ f.close()
362+slice = 0.001 # seconds
363+max = 120.0 # seconds
364+# We want to die long after the test fail (it it succeeds we're dead already)
365+for i in xrange(int(max/slice)):
366+ time.sleep(slice)

Subscribers

People subscribed via source and target branches