MAAS becomes unstable after rack controller restart

Bug #1707971 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse
2.2
Fix Released
Critical
Unassigned

Bug Description

Problem
=======
We have an HA setup - 3 region API nodes and 2 rack controllers. When we restart a rack controller, the MAAS API becomes unresponsive/unstable/slow for a varying period of time. Sometimes it never responds to an API request, sometimes the UI shows as disconnected, sometimes an API request takes 30+ second to get responded to, other times less than a second.

Here's a 'zones read' call that fails once and then succeeds. This is done immediately after restarting both rack controllers:
http://paste.ubuntu.com/25221159/

The amount of time it stays this way varies - we currently have a 5 minute sleep after restarting maas-rackd before trying to setup networks through the API and that isn't always long enough - we sometimes get API calls disconnected without a response.

Also, the racks sometimes never show up as fully connected again. They show up as 8% connected here:
http://paste.ubuntu.com/25221156/

The logs are full of questionable stuff, "Successfully configured DNS" is repeated over and over:
2017-08-01 16:35:28 maasserver.region_controller: [info] Successfully configured DNS.
2017-08-01 16:35:30 maasserver.region_controller: [info] Successfully configured DNS.
2017-08-01 16:35:32 maasserver.region_controller: [info] Successfully configured DNS.

So are errors like this:
Failed to register rack controller '4shpr4' into the database. Connection will be dropped.

And repeated messages like this:
Aug 1 16:37:05 infra1 maas.rpc.rackcontrollers: [info] Existing rack controller 'infra2' has connected to region 'infra1'.
Aug 1 16:37:12 infra1 maas.rpc.rackcontrollers: [info] Existing rack controller 'infra2' has connected to region 'infra1'.
Aug 1 16:37:22 infra1 maas.service_monitor: [info] Service 'ntp' has been restarted. Its current state is 'on' and 'running'.
Aug 1 16:37:52 infra1 maas.service_monitor: [info] Service 'ntp' has been restarted. Its current state is 'on' and 'running'.

And this:
2017-08-01 16:37:39 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at b'http://[::ffff:10.245.208.33]/MAAS/rpc/').

Expected Behavior
=================
- Restarting a rack controller should not affect region controller API availability. We should be able to restart rack controllers and immediately use the API.
- Restarted rack controllers should not remain in a 'degraded' 8% connected state.

We're using 2.2.2 (6099-g8751f91-0ubuntu1~16.04.1)

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Logs and config files from the 3 maas nodes.

tags: added: foundation-engine
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

Can you provide more information of how the Maas is configured? E.g ips each region has, how they are configured, where is the DB and how it is configured? Rack controllers, etc.

Also, I think that since you have 3 region controllers all connecting to the same DB, you may need to increase the connection Postgres can accept. I think that may be part of the problem.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sure - region API's are at 10.245.208.30, 10.245.208.31 and 10.245.208.32. We're using hacluster to load balance, and a VIP in front at 10.245.208.33. There are rack controllers on 10.245.208.30 and 10.245.208.31.

Primary postgres is on 10.245.208.30, it's being replicated to backup postgres on 10.245.208.31. It has a VIP at 10.245.208.34.

Let me know if there is more config info you need to know there.

Revision history for this message
Chris Gregan (cgregan) wrote :

PostgreSQL DB connections should be increased to 120 and retest.

Changed in maas:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :

max_connections is already set to 300.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

To test the theory that postgres connection count maxing out is causing this, I edited a MAAS's postgres to only allow 5 connections. It gives this error in that case:

http://pastebin.ubuntu.com/25233991/

Since we aren't seeing that error, I don't believe this has anything to do with postgres connections. Is there some evidence to support that theory that I don't know about?

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Jason,

The theory was that you did not have enough connections so it was causing issues with the HA. 300 connections is plenty of connections and that is not the issue. It is something else, unknown at the moment.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

This seems to be the reason why the rack fails to register (the 'transport' object is None):

    http://paste.ubuntu.com/25240443/

But now we have to ask ourselves why that might be.

tags: added: foundations-engine
removed: foundation-engine
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We tested with 2.3.0 alpha1 and it did not fix the issue or change the behavior.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here is out haproxy config: http://paste.ubuntu.com/25285768/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

oops, that was cut off. http://pastebin.ubuntu.com/25285797/

Changed in maas:
milestone: none → 2.2.3
assignee: nobody → Blake Rouse (blake-rouse)
importance: Undecided → High
status: New → Triaged
Changed in maas:
status: Triaged → In Progress
Changed in maas:
importance: High → Critical
milestone: 2.2.3 → 2.3.0
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In our test setup, we removed all of the extra IPs so we only have one per region controller now. We still have this issue, although it doesn't remain unstable as long. We still have to delay talkign to the region controller after the rack controller is restarted.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

The reason why you are probably still experiencing the issue is because we have not a 2.2 nor 2.3 with this fix yet.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1707971] Re: MAAS becomes unstable after rack controller restart
Download full text (3.3 KiB)

Ok well I thought limiting to one IP on our machine would have the same
effect. Let me know when there is a build we can test.

On Mon, Aug 21, 2017 at 10:31 AM, Andres Rodriguez <email address hidden>
wrote:

> The reason why you are probably still experiencing the issue is because
> we have not a 2.2 nor 2.3 with this fix yet.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> Status in MAAS:
> Fix Committed
> Status in MAAS 2.2 series:
> Fix Committed
>
> Bug description:
> Problem
> =======
> We have an HA setup - 3 region API nodes and 2 rack controllers. When
> we restart a rack controller, the MAAS API becomes
> unresponsive/unstable/slow for a varying period of time. Sometimes it
> never responds to an API request, sometimes the UI shows as disconnected,
> sometimes an API request takes 30+ second to get responded to, other times
> less than a second.
>
> Here's a 'zones read' call that fails once and then succeeds. This is
> done immediately after restarting both rack controllers:
> http://paste.ubuntu.com/25221159/
>
> The amount of time it stays this way varies - we currently have a 5
> minute sleep after restarting maas-rackd before trying to setup
> networks through the API and that isn't always long enough - we
> sometimes get API calls disconnected without a response.
>
> Also, the racks sometimes never show up as fully connected again. They
> show up as 8% connected here:
> http://paste.ubuntu.com/25221156/
>
> The logs are full of questionable stuff, "Successfully configured DNS"
> is repeated over and over:
> 2017-08-01 16:35:28 maasserver.region_controller: [info] Successfully
> configured DNS.
> 2017-08-01 16:35:30 maasserver.region_controller: [info] Successfully
> configured DNS.
> 2017-08-01 16:35:32 maasserver.region_controller: [info] Successfully
> configured DNS.
>
> So are errors like this:
> Failed to register rack controller '4shpr4' into the database.
> Connection will be dropped.
>
> And repeated messages like this:
> Aug 1 16:37:05 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
> Aug 1 16:37:12 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
> Aug 1 16:37:22 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
> Aug 1 16:37:52 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
>
> And this:
> 2017-08-01 16:37:39 provisioningserver.rpc.clusterservice: [critical]
> Failed to contact region. (While requesting RPC info at b'http://
> [::ffff:10.245.208.33]/MAAS/rpc/').
>
>
> Expected Behavior
> =================
> - Restarting a rack controller should not affect region controller API
> availability. We should be able to restart rack controllers and immediately
> use the API.
> - Restarted rack controllers should not remain in a 'degraded' 8%
> connec...

Read more...

Changed in maas:
milestone: 2.3.0 → 2.3.0alpha2
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Jason,

ppa:maas/next-proposed is available. Mind testing that?

On Mon, Aug 21, 2017 at 1:04 PM, Andres Rodriguez <email address hidden>
wrote:

> ** Changed in: maas
> Milestone: 2.3.0 => 2.3.0alpha2
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707971/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

testing with 2.3, the behavior here is definitely improved. we no longer see the region controller not responding. It does slow down for about 40 seconds when we restart all the rack controllers - "zones read" takes up to 11 seconds to respond when it usually takes ~2 seconds:

http://paste.ubuntu.com/25587402/

When will this be released in 2.2?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

This is already part of 2.2.3 which is available in ppa:Maas/proposed

Hope this helps!

On Thu, Sep 21, 2017 at 12:41 PM Jason Hobbs <email address hidden>
wrote:

> testing with 2.3, the behavior here is definitely improved. we no
> longer see the region controller not responding. It does slow down for
> about 40 seconds when we restart all the rack controllers - "zones read"
> takes up to 11 seconds to respond when it usually takes ~2 seconds:
>
> http://paste.ubuntu.com/25587402/
>
> When will this be released in 2.2?
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707971/+subscriptions
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.