Merge lp:~rvb/maas/tweak-timeouts into lp:~maas-committers/maas/trunk

Proposed by Raphaël Badin
Status: Merged
Approved by: Raphaël Badin
Approved revision: no longer in the source branch.
Merged at revision: 4072
Proposed branch: lp:~rvb/maas/tweak-timeouts
Merge into: lp:~maas-committers/maas/trunk
Diff against target: 193 lines (+44/-10)
6 files modified
src/maasserver/api/nodes.py (+2/-3)
src/maasserver/plugin.py (+14/-0)
src/maasserver/tests/test_plugin.py (+14/-2)
src/maasserver/websockets/handlers/node.py (+3/-4)
src/provisioningserver/drivers/power/__init__.py (+9/-0)
src/provisioningserver/rpc/power.py (+2/-1)
To merge this branch: bzr merge lp:~rvb/maas/tweak-timeouts
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+264018@code.launchpad.net

Commit message

Set the maximum thread pool size to 100: this is meant as a workaround for deadlocks such as the one described in the linked bug. It's not meant to be a permanent solution but rather to buy us time to come up with a proper solution (like an "intelligent thread pool manager). Unify the power query timeout: use a unique timeout across the board and use a value that can accommodate seamicro's tendency to take 30s to reply to a power query.

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) :
review: Approve
Revision history for this message
Blake Rouse (blake-rouse) wrote :
Download full text (8.5 KiB)

Remember last time we increased this we hit database connection issues. I
don't know if increasing the number of threads is a good idea. That is why
we lowered it back down.

Does Gavin's branch not fix the deadlock issue? Is there another deadlock
that needs this increase?
On Jul 7, 2015 6:27 AM, "Raphaël Badin" <email address hidden> wrote:

> Raphaël Badin has proposed merging lp:~rvb/maas/tweak-timeouts into
> lp:maas.
>
> Commit message:
> Set the maximum thread pool size to 100: this is meant as a workaround for
> deadlocks such as the one described in the linked bug. It's not meant to be
> a permanent solution but rather to buy us time to come up with a proper
> solution (like an "intelligent thread pool manager). Unify the power query
> timeout: use a unique timeout across the board and use a value that can
> accommodate seamicro's tendency to take 30s to reply to a power query.
>
> Requested reviews:
> MAAS Maintainers (maas-maintainers)
> Related bugs:
> Bug #1470013 in MAAS: "Failed power off/status for multiple nodes within
> a SM15K chassis"
> https://bugs.launchpad.net/maas/+bug/1470013
>
> For more details, see:
> https://code.launchpad.net/~rvb/maas/tweak-timeouts/+merge/264018
> --
> Your team MAAS Maintainers is requested to review the proposed merge of
> lp:~rvb/maas/tweak-timeouts into lp:maas.
>
> === modified file 'src/maasserver/api/nodes.py'
> --- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
> +++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
> @@ -76,6 +76,7 @@
> from maasserver.utils import find_nodegroup
> from maasserver.utils.orm import get_first
> from piston.utils import rc
> +from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
> from provisioningserver.power.poweraction import (
> PowerActionFail,
> UnknownPowerType,
> @@ -690,9 +691,7 @@
> power_type=power_info.power_type,
> context=power_info.power_parameters)
> try:
> - # Allow 30 seconds for the power query max as we're holding
> - # up an appserver thread here.
> - state = call.wait(30)
> + state = call.wait(POWER_QUERY_TIMEOUT)
> except crochet.TimeoutError:
> maaslog.error(
> "%s: Timed out waiting for power response in
> Node.power_state",
>
> === modified file 'src/maasserver/plugin.py'
> --- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
> +++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
> @@ -24,6 +24,7 @@
> from twisted.plugin import IPlugin
> from twisted.python import usage
> from zope.interface import implementer
> +from twisted.internet import reactor
>
>
> def serverFromString(description):
> @@ -45,6 +46,14 @@
> ]
>
>
> +# The maximum number of threads used by the default twisted thread pool.
> +# This value is a trade-off between a small value (such as the default:
> 10)
> +# which can create deadlocks (see 1470013) and a huge value which can
> cause
> +# MAAS to hit other limitations such as the number of open files or the
> +# number of concurrent database connexions.
> +MAX_THREADS = 100
> +
> +
> @implementer(IServiceMaker, IPlugin)
> class...

Read more...

Revision history for this message
Raphaël Badin (rvb) wrote :
Download full text (9.1 KiB)

You're right, and that's why I used a conservative number (100). This
is meant to be temporary until we come up with a more intelligent thread
pool.

There is a deadlock indeed, it's the DNS consolidator (see my comments
on the related bug).

On 07/07/2015 04:18 PM, Blake Rouse wrote:
> Remember last time we increased this we hit database connection issues. I
> don't know if increasing the number of threads is a good idea. That is why
> we lowered it back down.
>
> Does Gavin's branch not fix the deadlock issue? Is there another deadlock
> that needs this increase?
> On Jul 7, 2015 6:27 AM, "Raphaël Badin" <email address hidden> wrote:
>
>> Raphaël Badin has proposed merging lp:~rvb/maas/tweak-timeouts into
>> lp:maas.
>>
>> Commit message:
>> Set the maximum thread pool size to 100: this is meant as a workaround for
>> deadlocks such as the one described in the linked bug. It's not meant to be
>> a permanent solution but rather to buy us time to come up with a proper
>> solution (like an "intelligent thread pool manager). Unify the power query
>> timeout: use a unique timeout across the board and use a value that can
>> accommodate seamicro's tendency to take 30s to reply to a power query.
>>
>> Requested reviews:
>> MAAS Maintainers (maas-maintainers)
>> Related bugs:
>> Bug #1470013 in MAAS: "Failed power off/status for multiple nodes within
>> a SM15K chassis"
>> https://bugs.launchpad.net/maas/+bug/1470013
>>
>> For more details, see:
>> https://code.launchpad.net/~rvb/maas/tweak-timeouts/+merge/264018
>> --
>> Your team MAAS Maintainers is requested to review the proposed merge of
>> lp:~rvb/maas/tweak-timeouts into lp:maas.
>>
>> === modified file 'src/maasserver/api/nodes.py'
>> --- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
>> +++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
>> @@ -76,6 +76,7 @@
>> from maasserver.utils import find_nodegroup
>> from maasserver.utils.orm import get_first
>> from piston.utils import rc
>> +from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
>> from provisioningserver.power.poweraction import (
>> PowerActionFail,
>> UnknownPowerType,
>> @@ -690,9 +691,7 @@
>> power_type=power_info.power_type,
>> context=power_info.power_parameters)
>> try:
>> - # Allow 30 seconds for the power query max as we're holding
>> - # up an appserver thread here.
>> - state = call.wait(30)
>> + state = call.wait(POWER_QUERY_TIMEOUT)
>> except crochet.TimeoutError:
>> maaslog.error(
>> "%s: Timed out waiting for power response in
>> Node.power_state",
>>
>> === modified file 'src/maasserver/plugin.py'
>> --- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
>> +++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
>> @@ -24,6 +24,7 @@
>> from twisted.plugin import IPlugin
>> from twisted.python import usage
>> from zope.interface import implementer
>> +from twisted.internet import reactor
>>
>>
>> def serverFromString(description):
>> @@ -45,6 +46,14 @@
>> ]
>>
>>
>> +# The maximum number of threads used by the de...

Read more...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/maasserver/api/nodes.py'
--- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
+++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
@@ -76,6 +76,7 @@
76from maasserver.utils import find_nodegroup76from maasserver.utils import find_nodegroup
77from maasserver.utils.orm import get_first77from maasserver.utils.orm import get_first
78from piston.utils import rc78from piston.utils import rc
79from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
79from provisioningserver.power.poweraction import (80from provisioningserver.power.poweraction import (
80 PowerActionFail,81 PowerActionFail,
81 UnknownPowerType,82 UnknownPowerType,
@@ -690,9 +691,7 @@
690 power_type=power_info.power_type,691 power_type=power_info.power_type,
691 context=power_info.power_parameters)692 context=power_info.power_parameters)
692 try:693 try:
693 # Allow 30 seconds for the power query max as we're holding694 state = call.wait(POWER_QUERY_TIMEOUT)
694 # up an appserver thread here.
695 state = call.wait(30)
696 except crochet.TimeoutError:695 except crochet.TimeoutError:
697 maaslog.error(696 maaslog.error(
698 "%s: Timed out waiting for power response in Node.power_state",697 "%s: Timed out waiting for power response in Node.power_state",
699698
=== modified file 'src/maasserver/plugin.py'
--- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
+++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
@@ -24,6 +24,7 @@
24from twisted.plugin import IPlugin24from twisted.plugin import IPlugin
25from twisted.python import usage25from twisted.python import usage
26from zope.interface import implementer26from zope.interface import implementer
27from twisted.internet import reactor
2728
2829
29def serverFromString(description):30def serverFromString(description):
@@ -45,6 +46,14 @@
45 ]46 ]
4647
4748
49# The maximum number of threads used by the default twisted thread pool.
50# This value is a trade-off between a small value (such as the default: 10)
51# which can create deadlocks (see 1470013) and a huge value which can cause
52# MAAS to hit other limitations such as the number of open files or the
53# number of concurrent database connexions.
54MAX_THREADS = 100
55
56
48@implementer(IServiceMaker, IPlugin)57@implementer(IServiceMaker, IPlugin)
49class RegionServiceMaker:58class RegionServiceMaker:
50 """Create a service for the Twisted plugin."""59 """Create a service for the Twisted plugin."""
@@ -76,6 +85,10 @@
76 import crochet85 import crochet
77 crochet.no_setup()86 crochet.no_setup()
7887
88 def _configurePoolSize(self):
89 threadpool = reactor.getThreadPool()
90 threadpool.adjustPoolsize(10, MAX_THREADS)
91
79 def _makeIntrospectionService(self, endpoint):92 def _makeIntrospectionService(self, endpoint):
80 from provisioningserver.utils import introspect93 from provisioningserver.utils import introspect
81 introspect_service = (94 introspect_service = (
@@ -91,6 +104,7 @@
91 self._configureLogging()104 self._configureLogging()
92 self._configureDjango()105 self._configureDjango()
93 self._configureCrochet()106 self._configureCrochet()
107 self._configurePoolSize()
94108
95 # Populate the region's event-loop with services.109 # Populate the region's event-loop with services.
96 from maasserver import eventloop110 from maasserver import eventloop
97111
=== modified file 'src/maasserver/tests/test_plugin.py'
--- src/maasserver/tests/test_plugin.py 2015-05-07 18:14:38 +0000
+++ src/maasserver/tests/test_plugin.py 2015-07-07 10:22:52 +0000
@@ -17,6 +17,7 @@
17import crochet17import crochet
18from maasserver import eventloop18from maasserver import eventloop
19from maasserver.plugin import (19from maasserver.plugin import (
20 MAX_THREADS,
20 Options,21 Options,
21 RegionServiceMaker,22 RegionServiceMaker,
22)23)
@@ -24,11 +25,13 @@
24from maastesting.testcase import MAASTestCase25from maastesting.testcase import MAASTestCase
25from provisioningserver import logger26from provisioningserver import logger
26from provisioningserver.utils.twisted import asynchronous27from provisioningserver.utils.twisted import asynchronous
28from testtools.matchers import GreaterThan
27from twisted.application.service import MultiService29from twisted.application.service import MultiService
30from twisted.internet import reactor
2831
2932
30class TestOptions(MAASTestCase):33class TestOptions(MAASTestCase):
31 """Tests for `provisioningserver.plugin.Options`."""34 """Tests for `maasserver.plugin.Options`."""
3235
33 def test_defaults(self):36 def test_defaults(self):
34 options = Options()37 options = Options()
@@ -42,7 +45,7 @@
4245
4346
44class TestRegionServiceMaker(MAASTestCase):47class TestRegionServiceMaker(MAASTestCase):
45 """Tests for `provisioningserver.plugin.RegionServiceMaker`."""48 """Tests for `maasserver.plugin.RegionServiceMaker`."""
4649
47 def setUp(self):50 def setUp(self):
48 super(TestRegionServiceMaker, self).setUp()51 super(TestRegionServiceMaker, self).setUp()
@@ -78,3 +81,12 @@
78 "Not all services are named.")81 "Not all services are named.")
79 self.assertThat(logger.basicConfig, MockCalledOnceWith())82 self.assertThat(logger.basicConfig, MockCalledOnceWith())
80 self.assertThat(crochet.no_setup, MockCalledOnceWith())83 self.assertThat(crochet.no_setup, MockCalledOnceWith())
84
85 @asynchronous(timeout=5)
86 def test__sets_pool_size(self):
87 service_maker = RegionServiceMaker("Harry", "Hill")
88 service_maker.makeService(Options())
89 threadpool = reactor.getThreadPool()
90 self.assertEqual(MAX_THREADS, threadpool.max)
91 # Max threads is reasonable.
92 self.assertThat(threadpool.max, GreaterThan(50))
8193
=== modified file 'src/maasserver/websockets/handlers/node.py'
--- src/maasserver/websockets/handlers/node.py 2015-05-28 15:07:01 +0000
+++ src/maasserver/websockets/handlers/node.py 2015-07-07 10:22:52 +0000
@@ -48,6 +48,7 @@
48)48)
49from metadataserver.enum import RESULT_TYPE49from metadataserver.enum import RESULT_TYPE
50from metadataserver.models import NodeResult50from metadataserver.models import NodeResult
51from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
51from provisioningserver.logger import get_maas_logger52from provisioningserver.logger import get_maas_logger
52from provisioningserver.power.poweraction import (53from provisioningserver.power.poweraction import (
53 PowerActionFail,54 PowerActionFail,
@@ -547,9 +548,7 @@
547 power_type=power_info.power_type,548 power_type=power_info.power_type,
548 context=power_info.power_parameters)549 context=power_info.power_parameters)
549 try:550 try:
550 # Allow 15 seconds for the power query max as we're holding551 state = call.wait(POWER_QUERY_TIMEOUT)['state']
551 # up a thread waiting.
552 state = call.wait(15)['state']
553 except crochet.TimeoutError:552 except crochet.TimeoutError:
554 maaslog.error(553 maaslog.error(
555 "%s: Timed out waiting for power response in "554 "%s: Timed out waiting for power response in "
@@ -569,5 +568,5 @@
569 # sure the change is committed and retried if required. Not pushing568 # sure the change is committed and retried if required. Not pushing
570 # this to another thread, would result in the entire power query being569 # this to another thread, would result in the entire power query being
571 # performed again.570 # performed again.
572 update_power_state(state).wait(15)571 update_power_state(state).wait(POWER_QUERY_TIMEOUT)
573 return state572 return state
574573
=== modified file 'src/provisioningserver/drivers/power/__init__.py'
--- src/provisioningserver/drivers/power/__init__.py 2015-06-23 16:45:25 +0000
+++ src/provisioningserver/drivers/power/__init__.py 2015-07-07 10:22:52 +0000
@@ -13,6 +13,7 @@
1313
14__metaclass__ = type14__metaclass__ = type
15__all__ = [15__all__ = [
16 "POWER_QUERY_TIMEOUT",
16 "PowerActionError",17 "PowerActionError",
17 "PowerAuthError",18 "PowerAuthError",
18 "PowerConnError",19 "PowerConnError",
@@ -29,6 +30,7 @@
29 abstractmethod,30 abstractmethod,
30 abstractproperty,31 abstractproperty,
31)32)
33from datetime import timedelta
3234
33from jsonschema import validate35from jsonschema import validate
34from provisioningserver.drivers import (36from provisioningserver.drivers import (
@@ -52,6 +54,13 @@
52}54}
5355
5456
57# Timeout for the power query action. We might be holding up a thread for that
58# long but some BMCs (notably seamicro) can take a long time to respond to
59# a power query request.
60# This should be configurable per-BMC.
61POWER_QUERY_TIMEOUT = timedelta(seconds=45).total_seconds()
62
63
55class PowerError(Exception):64class PowerError(Exception):
56 """Base error for all power driver failure commands."""65 """Base error for all power driver failure commands."""
5766
5867
=== modified file 'src/provisioningserver/rpc/power.py'
--- src/provisioningserver/rpc/power.py 2015-06-10 16:45:23 +0000
+++ src/provisioningserver/rpc/power.py 2015-07-07 10:22:52 +0000
@@ -79,7 +79,8 @@
7979
8080
81# Timeout for change_power_state(). We set it to 5 minutes by default,81# Timeout for change_power_state(). We set it to 5 minutes by default,
82# but it would be lovely if this was configurable.82# but it would be lovely if this was configurable. This is only a backstop
83# meant to cope with broken BMCs.
83CHANGE_POWER_STATE_TIMEOUT = timedelta(minutes=5).total_seconds()84CHANGE_POWER_STATE_TIMEOUT = timedelta(minutes=5).total_seconds()
8485
8586