Launchpad itself

mailman doesn't shut down cleanly

Bug #753306 reported by Tom Haddon on 2011-04-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	Curtis Hovey

Bug Description

During the last few deployments (possibly longer) mailman hasn't shut down cleaning. We run the initscript stop - https://pastebin.canonical.com/45791/ - and the mailmanctl process is still running and needs to manually killed. This makes mailman a tough target for a "nodowntime" deployment as it would block on failing to shut down.

Tags:

Related branches

lp:~sinzui/launchpad/shutdown-mailman

Merged into lp:launchpad at revision 16412

Benji York (community): Approve (code) on 2013-01-07

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-04-07:

Per Rob Collins, marking as high priority.

tags:	added: canonical-losa-lp
Changed in launchpad:
importance:	Undecided → High

j.c.sackett (jcsackett) on 2011-04-07

Changed in launchpad:
status:	New → Triaged

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-04-11:

I'm going to have to remove it from the "nodowntime" deployment target. See https://pastebin.canonical.com/45912/ and https://pastebin.canonical.com/45913/ (taken from after the first paste completed). I had to kill -9 the processes to get them to go away.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-04-11:

Fwiw, looking at https://pastebin.canonical.com/45915/ it looks like the shutdown script is targeting "/usr/bin/python2.6 -S bin/run -i production-mailman" for shutdown rather than "/usr/bin/python2.6 ./mailmanctl -s start". I'm not sure if this is correct or not, but it explains why the mailmanctl process is still around after a shutdown.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-11: Re: [Bug 753306] Re: mailman doesn't shut down cleanly

I'll move these notes to the bug.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-11:

On Tue, Apr 12, 2011 at 7:43 AM, Robert Collins
<email address hidden> wrote:
> I'll move these notes to the bug.

<FAIL> I thought this was the RT; my bad.

Revision history for this message

Barry Warsaw (barry) wrote on 2011-04-11:

It's been ages since I looked at this stuff, and IIRC we had to play funny games to get Mailman's startup/shutdown procedure hooked in, but Mailman itself really wants to be shutdown with `mailmanctl stop` so that's what you eventually need to ensure gets called.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-06-09:

This happened again at the rollout for 11.06

Revision history for this message

Barry Warsaw (barry) wrote on 2011-06-09:

Ping me if you need some help looking into this.

Francis J. Lacoste (flacoste) on 2011-06-09

Changed in launchpad:
importance:	High → Critical

Revision history for this message

Gary Poster (gary) wrote on 2011-07-26:

I wonder if this is tied in with my diagnosis of bug 791492: mailman's xmlrpc client to LP does not have a socket timeout set, so if LP is brought down while Mailman is trying to talk to it, then maybe its the socket that's not letting the process die. I'll only claim to try and fix the other bug now, but if this problem goes away because of adding a timeout...wouldn't that be nice!

Revision history for this message

Barry Warsaw (barry) wrote on 2011-07-26:

#10

On Jul 26, 2011, at 06:12 PM, Gary Poster wrote:

>I wonder if this is tied in with my diagnosis of bug 791492: mailman's
>xmlrpc client to LP does not have a socket timeout set, so if LP is
>brought down while Mailman is trying to talk to it, then maybe its the
>socket that's not letting the process die. I'll only claim to try and
>fix the other bug now, but if this problem goes away because of adding a
>timeout...wouldn't that be nice!

One thing to look at. When Mailman doesn't shut down, which of the qrunners
is still running? All of them, or only XMLRPCRunner?

If it's all of them, then there may be a problem propagating signals from the
master to the child qrunners. If it's just the XMLRPCRunner, then your
scenario could indeed be happening. Note that if any of the child qrunners
don't exit, the master won't exit either since it's waiting on the pids of all
its children.

Revision history for this message

Gary Poster (gary) wrote on 2011-07-28:

#11

Thanks, Barry. How can the LOSAs determine which qrunners are still running in this circumstance? The ps listing that Tom gave earlier does not show anything clearly that I see. https://pastebin.canonical.com/45913/

Revision history for this message

Curtis Hovey (sinzui) wrote on 2011-07-28:

#12

The pastbin does show that XMLRPCRunner is running. It is the only qrunner in fact.

\_ /usr/bin/python2.6 /.../lib/mailman/bin/qrunner --runner=XMLRPCRunner:0:1 -s

Revision history for this message

Curtis Hovey (sinzui) wrote on 2011-07-28:

#13

Oh, and compare https://pastebin.canonical.com/45915/ from 2011-04-11 with https://pastebin.canonical.com/45913/ from 2011-07-28. It is clear that all are running in the former, but only XMLRPCRunner is up in the latter

Revision history for this message

Gary Poster (gary) wrote on 2011-07-28:

#14

Ah, thanks Curtis! My screen wasn't wide enough.

I'm going to claim that my fix for 791492 is also a fix for this bug then, because it is the only clear problem I see. If I am wrong, we can reopen when we discover that.

Changed in launchpad:
assignee:	nobody → Gary Poster (gary)
status:	Triaged → In Progress

Robert Collins (lifeless) on 2011-07-28

Changed in launchpad:
status:	In Progress → Fix Released

Revision history for this message

Haw Loeung (hloeung) wrote on 2011-08-02:

#15

It seems this is still a problem - https://pastebin.canonical.com/50568/

I have removed it from the nodowntime set for the time being.

Changed in launchpad:
status:	Fix Released → Triaged

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

#16

Barry suggested getting a list of the running processes, so we need to
gather that.

Can you please do a deploy to mailman but capture the processes that
keep running?

Revision history for this message

Haw Loeung (hloeung) wrote on 2011-08-02:

#17

Process list when trying to stop:

https://pastebin.canonical.com/50569/

Process list after when deployment failed:

https://pastebin.canonical.com/50570/

Revision history for this message

Gary Poster (gary) wrote on 2011-08-02:

#18

Haw's process list shows what Tom's did from before my change: the XMLRPCRunner seems to be the one that is hanging.

One interesting thing is that it is also the only runner going *before* it hangs.

I still wonder if this is tied to the XMLRPCRunner starting to talk to LP before it goes down, but simply providing a socket timeout is apparently not sufficient, unfortunately.

I suspect that, for the next hang, we should get a Python gdb analysis of the Mailman processes with backtrace.py, a la what we do for LP (https://dev.launchpad.net/Debugging/GDB).

Gary Poster (gary) on 2012-01-06

Changed in launchpad:
assignee:	Gary Poster (gary) → nobody

Revision history for this message

Tom Haddon (mthaddon) wrote on 2012-04-23:

#19

Bitten by this again today.

William Grant (wgrant) on 2012-10-19

tags:

added: mailing-lists

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-01-04:

#20

Mailman shutdowns when each queue when it completes a loop. The xmlrpc runner takes 15 minutes to accomplish oneloop() on the current hardware. On the older server that loop would have been closer to 30 minutes. Shutting down mailman will always take as much time as needed to complete the loops.

The two slowest queues are the xmlrpc runner and the archive runner. For the former, we might add checks to exist the loop early betwen the steps in oneloop(). For the archive runner is a separate problem because the runner waits for mhonarc to complete the regeneration of the archive indexes, which can be 3 minutes for the large archives. Since the pastebins only show xmlrpc runner, I propose just fixing it to exist early.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-01-04:

#21

mailman's Runner class provides shortcircuit() which can be used by oneloop() to return early. This is not used by the XMLRPCRunner because it ignores slices. Since we know that the many minutes pass for each call to oneloop(), the method can check shortcircuit() between the many atomic steps. The method has 3 calls in it which can be guarded with shortcircuit(). The call to get_subscriptions() can be very long. get_subscriptions() has a looping strategy to ensure it works in small batches which allow for shortcircuit() to be checked for an early exit.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-01-04:

#22

This bug is a symptom of Bug #889326, but the proposed changes will separate the two issues.

Curtis Hovey (sinzui) on 2013-01-04

Changed in launchpad:
assignee:	nobody → Curtis Hovey (sinzui)
status:	Triaged → In Progress

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2013-01-08:

#23

Fixed in stable r16412 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/16412>.

tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-01-08:

#24

We confirmed staging's mailman shutdown and restarted correctly with the fix.

tags:

added: qa-ok
removed: qa-needstesting

Steve Kowalik (stevenk) on 2013-01-09

Changed in launchpad:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.