MAAS

MAAS becomes unstable after rack controller restart

Bug #1707971 reported by Jason Hobbs on 2017-08-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Blake Rouse	MAAS 2.3.0alpha2
	2.2	Fix Released	Critical	Unassigned	MAAS 2.2.3

Bug Description

Problem
=======
We have an HA setup - 3 region API nodes and 2 rack controllers. When we restart a rack controller, the MAAS API becomes unresponsive/unstable/slow for a varying period of time. Sometimes it never responds to an API request, sometimes the UI shows as disconnected, sometimes an API request takes 30+ second to get responded to, other times less than a second.

Here's a 'zones read' call that fails once and then succeeds. This is done immediately after restarting both rack controllers:
http://paste.ubuntu.com/25221159/

The amount of time it stays this way varies - we currently have a 5 minute sleep after restarting maas-rackd before trying to setup networks through the API and that isn't always long enough - we sometimes get API calls disconnected without a response.

Also, the racks sometimes never show up as fully connected again. They show up as 8% connected here:
http://paste.ubuntu.com/25221156/

The logs are full of questionable stuff, "Successfully configured DNS" is repeated over and over:
2017-08-01 16:35:28 maasserver.region_controller: [info] Successfully configured DNS.
2017-08-01 16:35:30 maasserver.region_controller: [info] Successfully configured DNS.
2017-08-01 16:35:32 maasserver.region_controller: [info] Successfully configured DNS.

So are errors like this:
Failed to register rack controller '4shpr4' into the database. Connection will be dropped.

And repeated messages like this:
Aug 1 16:37:05 infra1 maas.rpc.rackcontrollers: [info] Existing rack controller 'infra2' has connected to region 'infra1'.
Aug 1 16:37:12 infra1 maas.rpc.rackcontrollers: [info] Existing rack controller 'infra2' has connected to region 'infra1'.
Aug 1 16:37:22 infra1 maas.service_monitor: [info] Service 'ntp' has been restarted. Its current state is 'on' and 'running'.
Aug 1 16:37:52 infra1 maas.service_monitor: [info] Service 'ntp' has been restarted. Its current state is 'on' and 'running'.

And this:
2017-08-01 16:37:39 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at b'http://[::ffff:10.245.208.33]/MAAS/rpc/').

Expected Behavior
=================
- Restarting a rack controller should not affect region controller API availability. We should be able to restart rack controllers and immediately use the API.
- Restarted rack controllers should not remain in a 'degraded' 8% connected state.

We're using 2.2.2 (6099-g8751f91-0ubuntu1~16.04.1)

Tags:

Related branches

~blake-rouse/maas:fix-1707971-2.2

Merged into maas:2.2

Blake Rouse (community): Approve on 2017-08-15

~blake-rouse/maas:fix-1707971

Merged into maas:master

Mike Pontillo (community): Approve on 2017-08-15

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-01:

logs-2017-08-01-16.47.40.tar Edit (210.0 KiB, application/x-tar)

Logs and config files from the 3 maas nodes.

tags:

added: foundation-engine

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-08-01:

Hi Jason,

Can you provide more information of how the Maas is configured? E.g ips each region has, how they are configured, where is the DB and how it is configured? Rack controllers, etc.

Also, I think that since you have 3 region controllers all connecting to the same DB, you may need to increase the connection Postgres can accept. I think that may be part of the problem.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-01:

Sure - region API's are at 10.245.208.30, 10.245.208.31 and 10.245.208.32. We're using hacluster to load balance, and a VIP in front at 10.245.208.33. There are rack controllers on 10.245.208.30 and 10.245.208.31.

Primary postgres is on 10.245.208.30, it's being replicated to backup postgres on 10.245.208.31. It has a VIP at 10.245.208.34.

Let me know if there is more config info you need to know there.

Revision history for this message

Chris Gregan (cgregan) wrote on 2017-08-02:

PostgreSQL DB connections should be increased to 120 and retest.

Changed in maas:
status:	New → Incomplete

Revision history for this message

John George (jog) wrote on 2017-08-02:

show all; output Edit (46.2 KiB, text/plain)

max_connections is already set to 300.

Jason Hobbs (jason-hobbs) on 2017-08-03

Changed in maas:
status:	Incomplete → New

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-03:

To test the theory that postgres connection count maxing out is causing this, I edited a MAAS's postgres to only allow 5 connections. It gives this error in that case:

http://pastebin.ubuntu.com/25233991/

Since we aren't seeing that error, I don't believe this has anything to do with postgres connections. Is there some evidence to support that theory that I don't know about?

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2017-08-04:

Jason,

The theory was that you did not have enough connections so it was causing issues with the HA. 300 connections is plenty of connections and that is not the issue. It is something else, unknown at the moment.

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2017-08-04:

This seems to be the reason why the rack fails to register (the 'transport' object is None):

http://paste.ubuntu.com/25240443/

But now we have to ask ourselves why that might be.

Jason Hobbs (jason-hobbs) on 2017-08-08

tags:

added: foundations-engine
removed: foundation-engine

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-10:

We tested with 2.3.0 alpha1 and it did not fix the issue or change the behavior.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-10:

#10

ps auxf output: http://paste.ubuntu.com/25285741/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-10:

#11

Here is out haproxy config: http://paste.ubuntu.com/25285768/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-10:

#12

Here is the output of http://10.245.208.33/MAAS/rpc/: http://paste.ubuntu.com/25285791/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-10:

#13

oops, that was cut off. http://pastebin.ubuntu.com/25285797/

Andres Rodriguez (andreserl) on 2017-08-15

Changed in maas:
milestone:	none → 2.2.3
assignee:	nobody → Blake Rouse (blake-rouse)
importance:	Undecided → High
status:	New → Triaged

Andres Rodriguez (andreserl) on 2017-08-15

Changed in maas:
status:	Triaged → In Progress

Blake Rouse (blake-rouse) on 2017-08-15

Changed in maas:
importance:	High → Critical
milestone:	2.2.3 → 2.3.0

MAAS Lander (maas-lander) on 2017-08-15

Changed in maas:
status:	In Progress → Fix Committed

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-21:

#14

In our test setup, we removed all of the extra IPs so we only have one per region controller now. We still have this issue, although it doesn't remain unstable as long. We still have to delay talkign to the region controller after the rack controller is restarted.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-08-21:

#15

The reason why you are probably still experiencing the issue is because we have not a 2.2 nor 2.3 with this fix yet.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-08-21: Re: [Bug 1707971] Re: MAAS becomes unstable after rack controller restart

#16

Download full text (3.3 KiB)

Ok well I thought limiting to one IP on our machine would have the same
effect. Let me know when there is a build we can test.

On Mon, Aug 21, 2017 at 10:31 AM, Andres Rodriguez <email address hidden>
wrote:

> The reason why you are probably still experiencing the issue is because
> we have not a 2.2 nor 2.3 with this fix yet.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> Status in MAAS:
> Fix Committed
> Status in MAAS 2.2 series:
> Fix Committed
>
> Bug description:
> Problem
> =======
> We have an HA setup - 3 region API nodes and 2 rack controllers. When
> we restart a rack controller, the MAAS API becomes
> unresponsive/unstable/slow for a varying period of time. Sometimes it
> never responds to an API request, sometimes the UI shows as disconnected,
> sometimes an API request takes 30+ second to get responded to, other times
> less than a second.
>
> Here's a 'zones read' call that fails once and then succeeds. This is
> done immediately after restarting both rack controllers:
> http://paste.ubuntu.com/25221159/
>
> The amount of time it stays this way varies - we currently have a 5
> minute sleep after restarting maas-rackd before trying to setup
> networks through the API and that isn't always long enough - we
> sometimes get API calls disconnected without a response.
>
> Also, the racks sometimes never show up as fully connected again. They
> show up as 8% connected here:
> http://paste.ubuntu.com/25221156/
>
> The logs are full of questionable stuff, "Successfully configured DNS"
> is repeated over and over:
> 2017-08-01 16:35:28 maasserver.region_controller: [info] Successfully
> configured DNS.
> 2017-08-01 16:35:30 maasserver.region_controller: [info] Successfully
> configured DNS.
> 2017-08-01 16:35:32 maasserver.region_controller: [info] Successfully
> configured DNS.
>
> So are errors like this:
> Failed to register rack controller '4shpr4' into the database.
> Connection will be dropped.
>
> And repeated messages like this:
> Aug 1 16:37:05 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
> Aug 1 16:37:12 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
> Aug 1 16:37:22 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
> Aug 1 16:37:52 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
>
> And this:
> 2017-08-01 16:37:39 provisioningserver.rpc.clusterservice: [critical]
> Failed to contact region. (While requesting RPC info at b'http://
> [::ffff:10.245.208.33]/MAAS/rpc/').
>
>
> Expected Behavior
> =================
> - Restarting a rack controller should not affect region controller API
> availability. We should be able to restart rack controllers and immediately
> use the API.
> - Restarted rack controllers should not remain in a 'degraded' 8%
> connec...

Ok well I thought limiting to one IP on our machine would have the same
effect.  Let me know when there is a build we can test.

On Mon, Aug 21, 2017 at 10:31 AM, Andres Rodriguez <andreserl@ubuntu-pe.org>
wrote:

> The reason why you are probably still experiencing the issue is because
> we have not a 2.2 nor 2.3 with this fix yet.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
>   MAAS becomes unstable after rack controller restart
>
> Status in MAAS:
>   Fix Committed
> Status in MAAS 2.2 series:
>   Fix Committed
>
> Bug description:
>   Problem
>   =======
>   We have an HA setup - 3 region API nodes and 2 rack controllers.  When
> we restart a rack controller, the MAAS API becomes
> unresponsive/unstable/slow for a varying period of time.  Sometimes it
> never responds to an API request, sometimes the UI shows as disconnected,
> sometimes an API request takes 30+ second to get responded to, other times
> less than a second.
>
>   Here's a 'zones read' call that fails once and then succeeds.  This is
> done immediately after restarting both rack controllers:
>   http://paste.ubuntu.com/25221159/
>
>   The amount of time it stays this way varies - we currently have a 5
>   minute sleep after restarting maas-rackd before trying to setup
>   networks through the API and that isn't always long enough - we
>   sometimes get API calls disconnected without a response.
>
>   Also, the racks sometimes never show up as fully connected again.  They
> show up as 8% connected here:
>   http://paste.ubuntu.com/25221156/
>
>   The logs are full of questionable stuff, "Successfully configured DNS"
> is repeated over and over:
>   2017-08-01 16:35:28 maasserver.region_controller: [info] Successfully
> configured DNS.
>   2017-08-01 16:35:30 maasserver.region_controller: [info] Successfully
> configured DNS.
>   2017-08-01 16:35:32 maasserver.region_controller: [info] Successfully
> configured DNS.
>
>   So are errors like this:
>   Failed to register rack controller '4shpr4' into the database.
> Connection will be dropped.
>
>   And repeated messages like this:
>   Aug  1 16:37:05 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
>   Aug  1 16:37:12 infra1 maas.rpc.rackcontrollers: [info] Existing rack
> controller 'infra2' has connected to region 'infra1'.
>   Aug  1 16:37:22 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
>   Aug  1 16:37:52 infra1 maas.service_monitor: [info] Service 'ntp' has
> been restarted. Its current state is 'on' and 'running'.
>
>   And this:
>   2017-08-01 16:37:39 provisioningserver.rpc.clusterservice: [critical]
> Failed to contact region. (While requesting RPC info at b'http://
> [::ffff:10.245.208.33]/MAAS/rpc/').
>
>
>   Expected Behavior
>   =================
>   - Restarting a rack controller should not affect region controller API
> availability. We should be able to restart rack controllers and immediately
> use the API.
>   - Restarted rack controllers should not remain in a 'degraded' 8%
> connected state.
>
>   We're using 2.2.2 (6099-g8751f91-0ubuntu1~16.04.1)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707971/+subscriptions
>

Andres Rodriguez (andreserl) on 2017-08-21

Changed in maas:
milestone:	2.3.0 → 2.3.0alpha2

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-08-21:

#17

@Jason,

ppa:maas/next-proposed is available. Mind testing that?

On Mon, Aug 21, 2017 at 1:04 PM, Andres Rodriguez <email address hidden>
wrote:

> ** Changed in: maas
> Milestone: 2.3.0 => 2.3.0alpha2
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707971/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Andres Rodriguez (andreserl) on 2017-08-22

Changed in maas:
status:	Fix Committed → Fix Released

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-21:

#18

testing with 2.3, the behavior here is definitely improved. we no longer see the region controller not responding. It does slow down for about 40 seconds when we restart all the rack controllers - "zones read" takes up to 11 seconds to respond when it usually takes ~2 seconds:

http://paste.ubuntu.com/25587402/

When will this be released in 2.2?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-09-21:

#19

Hi Jason,

This is already part of 2.2.3 which is available in ppa:Maas/proposed

Hope this helps!

On Thu, Sep 21, 2017 at 12:41 PM Jason Hobbs <email address hidden>
wrote:

> testing with 2.3, the behavior here is definitely improved. we no
> longer see the region controller not responding. It does slow down for
> about 40 seconds when we restart all the rack controllers - "zones read"
> takes up to 11 seconds to respond when it usually takes ~2 seconds:
>
> http://paste.ubuntu.com/25587402/
>
> When will this be released in 2.2?
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1707971
>
> Title:
> MAAS becomes unstable after rack controller restart
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707971/+subscriptions
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.