Launchpad itself

Restarting codehosting SSH server drops existing connections

Bug #702024 reported by Andrew Bennetts on 2011-01-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	High	Unassigned

Bug Description

Restarting the codehosting SSH server drops existing connections. This means that deploying new versions is quite disruptive, enough that we apparently don't deploy codehosting updates as frequently as lpnet updates. (i.e. monthly rather than daily, IIUC)

A relatively cheap fix would be to change the server to accept a signal of some sort (possibly a POSIX signal like SIGHUP, possibly some other mechanism) that tells it to stop listening on the SSH port. This would allow a new process to begin listening on that port. Existing connections wouldn't be disrupted. Also the server would need to be changed so that after receiving that signal the process would shut itself down once all outstanding connections have closed.

Other resources like pidfiles and perhaps logfiles may need to be handled differently too, a crude mechanism would be to have “a” and “b” directories for those, and when restarting the admin would simply pick the one not in use.

A more elegant approach may be to run N copies of the server behind a load balancer, similarly to how lpnet is deployed, so that individual processes can be taken out of the rotation and stopped and restarted without affecting the rest of the service. This is probably a good idea for scalability reasons anyway.

Tags:

Related branches

lp:~spiv/launchpad/haproxy-for-twisted-services

Merged into lp:launchpad at revision 13434

Robert Collins (community): Approve on 2011-06-15

John A Meinel (community): Approve on 2011-02-03

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-12:

See also rt #40480 which requests an HA configuration. This makes it a lot simpler: have a status page HA proxy can query, on shutdown requests start issuing 500's to that page, and when the last active client d/c's actually terminate.

Changed in launchpad:
status:	New → Triaged
importance:	Undecided → High

Andrew Bennetts (spiv) on 2011-01-17

Changed in launchpad:
assignee:	nobody → Andrew Bennetts (spiv)

Andrew Bennetts (spiv) on 2011-01-19

Changed in launchpad:
status:	Triaged → In Progress

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-01-21:

The linked branch adds support for graceful shutdowns for running behind HAProxy. I have a few thoughts about how it could be better...

Among other things it delays process exit until all the existing connections have closed. It occurs to me that because these are SSH connections maybe we want something slightly smarter, because it's possible that users will be holding SSH connections open with no active channels by using the connection sharing/reuse features of clients like OpenSSH. Eventually those users would trip the inactivity timeouts I suppose.

Also, perhaps when waiting for connections to end so we can exit we should shorten the inactivity timeout?

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-02:

We need to consider how this should interact with forking lp-serve. istm there should be

1:haproxy -> n{1:conch -> 1:lp-serve}

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-02:

I guess we need an RT for the production changes. Is there one?

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-02:

Yes, https://rt.admin.canonical.com/Ticket/Display.html?id=40480 will do for this, but it needs a plan for the specific changes.

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-02-02:

Fixed in stable r12303 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12303>.

Changed in launchpad:
milestone:	none → 11.02
tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

Revision history for this message

Robert Collins (lifeless) wrote on 2011-02-03:

The current patch is ok to deploy, but doesn't switch to 500 after TERM is received.

tags:

added: qa-ok
removed: qa-needstesting

William Grant (wgrant) on 2011-02-03

Changed in launchpad:
status:	Fix Committed → Fix Released
status:	Fix Released → In Progress

Curtis Hovey (sinzui) on 2011-02-09

Changed in launchpad:
milestone:	11.02 → 11.03

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-02-11:

So I think the status is currently:

* code is deployed
* production is running essentially the same configuration as before: one SSH server instance listening directly on the SSH port
* LOSA testing on staging (or qastaging?) of running instances behind HAProxy (rather than directly on SSH port) failed, as the HTTP service always returns 200 OK even after SIGTERM (should return 503)
* testing by me in development environment in a VM works as expected, and I don't have any idea why staging behaves differently.
* (less importantly, there's an outstanding merge proposal to fix a cosmetic glitch in the debug text of the HTTP service: <https://code.launchpad.net/~spiv/launchpad/haproxy-for-twisted-services/+merge/48427>)

I guess the next steps are to spend more time with a LOSA debugging why their tests on staging fail when it works in a development environment.

Also, it's probably worth trying to configure HAProxy to ignore the HTTP service, and just do TCP proxying without polling the HTTP service first. After an instance receives SIGTERM it should immediately stop listening on its TCP port (even though it will continue to service its existing connections), which AIUI should interact nicely with HAProxy (if can't connect to the first instance it tries it can try another instance instead without any disruption to the client).

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-11: Re: [Bug 702024] Re: Restarting codehosting SSH server drops existing connections

On 11 February 2011 11:48, Andrew Bennetts
<email address hidden> wrote:
> Also, it's probably worth trying to configure HAProxy to ignore the HTTP
> service, and just do TCP proxying without polling the HTTP service
> first. After an instance receives SIGTERM it should immediately stop
> listening on its TCP port (even though it will continue to service its
> existing connections), which AIUI should interact nicely with HAProxy
> (if can't connect to the first instance it tries it can try another
> instance instead without any disruption to the client).

+1. In some ways having a http status indication seems like an
unnecessary frill: if the server is accepting tcp connections it's
reasonable for haproxy to send connections to it. If it's accepting
connections but misbehaving then a sysadmin can kill it.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-02-11:

#10

On Fri, Feb 11, 2011 at 2:21 PM, Martin Pool <email address hidden> wrote:
> +1. In some ways having a http status indication seems like an
> unnecessary frill: if the server is accepting tcp connections it's
> reasonable for haproxy to send connections to it. If it's accepting
> connections but misbehaving then a sysadmin can kill it.

Its not really a frill, its very useful to distinguish graceful
restarts from down services; simply turning off the listening port
shows as down; having the status page change appropriately tells
haproxy to quiesce it. That said, it turns out 404 seems to be the
right status code to use, but I'm not going to chase that until we've
got several moving parts sorted:
- appservers using the quiesce framework
- forking service live

I'm -1 (yes, veto) on shutting the listening port to indicated 'being quiesced'.

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-11:

#11

On 11 February 2011 14:07, Robert Collins <email address hidden> wrote:
> On Fri, Feb 11, 2011 at 2:21 PM, Martin Pool <email address hidden> wrote:
>> +1. In some ways having a http status indication seems like an
>> unnecessary frill: if the server is accepting tcp connections it's
>> reasonable for haproxy to send connections to it. If it's accepting
>> connections but misbehaving then a sysadmin can kill it.
>
> Its not really a frill, its very useful to distinguish graceful
> restarts from down services; simply turning off the listening port
> shows as down; having the status page change appropriately tells
> haproxy to quiesce it. That said, it turns out 404 seems to be the
> right status code to use, but I'm not going to chase that until we've
> got several moving parts sorted:
> - appservers using the quiesce framework
> - forking service live
>
> I'm -1 (yes, veto) on shutting the listening port to indicated 'being
> quiesced'.

Fair enough. I can see how having that show up in the haproxy page
would be useful, even if haproxy's actually behaviour of sending
requests elsewhere would be the same. (Or would it actually be
different?)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-02-11:

#12

On Fri, Feb 11, 2011 at 4:36 PM, Martin Pool <email address hidden> wrote:
> Fair enough. I can see how having that show up in the haproxy page
> would be useful, even if haproxy's actually behaviour of sending
> requests elsewhere would be the same. (Or would it actually be
> different?)

For an SSH session the behaviour will be the same, for http it can be different.

Relatedly nagios will be looking at the listening port - this is
another thing we want to be able to tell the difference between
being-rebooted and down.

Curtis Hovey (sinzui) on 2011-03-05

Changed in launchpad:
milestone:	11.03 → none

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-03-23:

#13

Robert said:
> but I'm not going to chase that until we've
> got several moving parts sorted:
> - appservers using the quiesce framework
> - forking service live

These points appear to be the immediate blockers for this bug, and they aren't issues I'm working on, so I'm going to reassign this to Rob, who's more closely involved with that work. (I'm sure he'll reassign again if that's not right.)

The only other big outstanding problem I know of is that test deployment on staging apparently didn't work correctly, but I expect the experience of nailing down exactly how best to use HAProxy for the appservers will help us to debug that.

Changed in launchpad:
assignee:	Andrew Bennetts (spiv) → Robert Collins (lifeless)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-01:

#14

So, this can sit for the moment; capacity fixing is the primary losa task at the moment for LP, and when and only when that is sorted can we get back to improving our nodowntime rollout capacity.

Changed in launchpad:
assignee:	Robert Collins (lifeless) → nobody

William Grant (wgrant) on 2011-05-16

tags:

removed: qa-ok

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-07-14:

#15

Fixed in stable r13434 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/13434>.

Changed in launchpad:
assignee:	nobody → Andrew Bennetts (spiv)
tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

William Grant (wgrant) on 2011-07-15

Changed in launchpad:
status:	Fix Committed → In Progress

William Grant (wgrant) on 2011-07-15

tags:

added: qa-ok
removed: qa-needstesting

Curtis Hovey (sinzui) on 2012-10-08

Changed in launchpad:
assignee:	Andrew Bennetts (spiv) → nobody
status:	In Progress → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-rt.admin.canonical.com #40480 Edit

Bug watches keep track of this bug in other bug trackers.