Merge lp:~rvb/maas/tweak-timeouts into lp:~maas-committers/maas/trunk

Proposed by Raphaël Badin
Status: Merged
Approved by: Raphaël Badin
Approved revision: no longer in the source branch.
Merged at revision: 4072
Proposed branch: lp:~rvb/maas/tweak-timeouts
Merge into: lp:~maas-committers/maas/trunk
Diff against target: 193 lines (+44/-10)
6 files modified
src/maasserver/api/nodes.py (+2/-3)
src/maasserver/plugin.py (+14/-0)
src/maasserver/tests/test_plugin.py (+14/-2)
src/maasserver/websockets/handlers/node.py (+3/-4)
src/provisioningserver/drivers/power/__init__.py (+9/-0)
src/provisioningserver/rpc/power.py (+2/-1)
To merge this branch: bzr merge lp:~rvb/maas/tweak-timeouts
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+264018@code.launchpad.net

Commit message

Set the maximum thread pool size to 100: this is meant as a workaround for deadlocks such as the one described in the linked bug. It's not meant to be a permanent solution but rather to buy us time to come up with a proper solution (like an "intelligent thread pool manager). Unify the power query timeout: use a unique timeout across the board and use a value that can accommodate seamicro's tendency to take 30s to reply to a power query.

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) :
review: Approve
Revision history for this message
Blake Rouse (blake-rouse) wrote :
Download full text (8.5 KiB)

Remember last time we increased this we hit database connection issues. I
don't know if increasing the number of threads is a good idea. That is why
we lowered it back down.

Does Gavin's branch not fix the deadlock issue? Is there another deadlock
that needs this increase?
On Jul 7, 2015 6:27 AM, "Raphaël Badin" <email address hidden> wrote:

> Raphaël Badin has proposed merging lp:~rvb/maas/tweak-timeouts into
> lp:maas.
>
> Commit message:
> Set the maximum thread pool size to 100: this is meant as a workaround for
> deadlocks such as the one described in the linked bug. It's not meant to be
> a permanent solution but rather to buy us time to come up with a proper
> solution (like an "intelligent thread pool manager). Unify the power query
> timeout: use a unique timeout across the board and use a value that can
> accommodate seamicro's tendency to take 30s to reply to a power query.
>
> Requested reviews:
> MAAS Maintainers (maas-maintainers)
> Related bugs:
> Bug #1470013 in MAAS: "Failed power off/status for multiple nodes within
> a SM15K chassis"
> https://bugs.launchpad.net/maas/+bug/1470013
>
> For more details, see:
> https://code.launchpad.net/~rvb/maas/tweak-timeouts/+merge/264018
> --
> Your team MAAS Maintainers is requested to review the proposed merge of
> lp:~rvb/maas/tweak-timeouts into lp:maas.
>
> === modified file 'src/maasserver/api/nodes.py'
> --- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
> +++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
> @@ -76,6 +76,7 @@
> from maasserver.utils import find_nodegroup
> from maasserver.utils.orm import get_first
> from piston.utils import rc
> +from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
> from provisioningserver.power.poweraction import (
> PowerActionFail,
> UnknownPowerType,
> @@ -690,9 +691,7 @@
> power_type=power_info.power_type,
> context=power_info.power_parameters)
> try:
> - # Allow 30 seconds for the power query max as we're holding
> - # up an appserver thread here.
> - state = call.wait(30)
> + state = call.wait(POWER_QUERY_TIMEOUT)
> except crochet.TimeoutError:
> maaslog.error(
> "%s: Timed out waiting for power response in
> Node.power_state",
>
> === modified file 'src/maasserver/plugin.py'
> --- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
> +++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
> @@ -24,6 +24,7 @@
> from twisted.plugin import IPlugin
> from twisted.python import usage
> from zope.interface import implementer
> +from twisted.internet import reactor
>
>
> def serverFromString(description):
> @@ -45,6 +46,14 @@
> ]
>
>
> +# The maximum number of threads used by the default twisted thread pool.
> +# This value is a trade-off between a small value (such as the default:
> 10)
> +# which can create deadlocks (see 1470013) and a huge value which can
> cause
> +# MAAS to hit other limitations such as the number of open files or the
> +# number of concurrent database connexions.
> +MAX_THREADS = 100
> +
> +
> @implementer(IServiceMaker, IPlugin)
> class...

Read more...

Revision history for this message
Raphaël Badin (rvb) wrote :
Download full text (9.1 KiB)

You're right, and that's why I used a conservative number (100). This
is meant to be temporary until we come up with a more intelligent thread
pool.

There is a deadlock indeed, it's the DNS consolidator (see my comments
on the related bug).

On 07/07/2015 04:18 PM, Blake Rouse wrote:
> Remember last time we increased this we hit database connection issues. I
> don't know if increasing the number of threads is a good idea. That is why
> we lowered it back down.
>
> Does Gavin's branch not fix the deadlock issue? Is there another deadlock
> that needs this increase?
> On Jul 7, 2015 6:27 AM, "Raphaël Badin" <email address hidden> wrote:
>
>> Raphaël Badin has proposed merging lp:~rvb/maas/tweak-timeouts into
>> lp:maas.
>>
>> Commit message:
>> Set the maximum thread pool size to 100: this is meant as a workaround for
>> deadlocks such as the one described in the linked bug. It's not meant to be
>> a permanent solution but rather to buy us time to come up with a proper
>> solution (like an "intelligent thread pool manager). Unify the power query
>> timeout: use a unique timeout across the board and use a value that can
>> accommodate seamicro's tendency to take 30s to reply to a power query.
>>
>> Requested reviews:
>> MAAS Maintainers (maas-maintainers)
>> Related bugs:
>> Bug #1470013 in MAAS: "Failed power off/status for multiple nodes within
>> a SM15K chassis"
>> https://bugs.launchpad.net/maas/+bug/1470013
>>
>> For more details, see:
>> https://code.launchpad.net/~rvb/maas/tweak-timeouts/+merge/264018
>> --
>> Your team MAAS Maintainers is requested to review the proposed merge of
>> lp:~rvb/maas/tweak-timeouts into lp:maas.
>>
>> === modified file 'src/maasserver/api/nodes.py'
>> --- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
>> +++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
>> @@ -76,6 +76,7 @@
>> from maasserver.utils import find_nodegroup
>> from maasserver.utils.orm import get_first
>> from piston.utils import rc
>> +from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
>> from provisioningserver.power.poweraction import (
>> PowerActionFail,
>> UnknownPowerType,
>> @@ -690,9 +691,7 @@
>> power_type=power_info.power_type,
>> context=power_info.power_parameters)
>> try:
>> - # Allow 30 seconds for the power query max as we're holding
>> - # up an appserver thread here.
>> - state = call.wait(30)
>> + state = call.wait(POWER_QUERY_TIMEOUT)
>> except crochet.TimeoutError:
>> maaslog.error(
>> "%s: Timed out waiting for power response in
>> Node.power_state",
>>
>> === modified file 'src/maasserver/plugin.py'
>> --- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
>> +++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
>> @@ -24,6 +24,7 @@
>> from twisted.plugin import IPlugin
>> from twisted.python import usage
>> from zope.interface import implementer
>> +from twisted.internet import reactor
>>
>>
>> def serverFromString(description):
>> @@ -45,6 +46,14 @@
>> ]
>>
>>
>> +# The maximum number of threads used by the de...

Read more...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/maasserver/api/nodes.py'
2--- src/maasserver/api/nodes.py 2015-06-25 10:26:40 +0000
3+++ src/maasserver/api/nodes.py 2015-07-07 10:22:52 +0000
4@@ -76,6 +76,7 @@
5 from maasserver.utils import find_nodegroup
6 from maasserver.utils.orm import get_first
7 from piston.utils import rc
8+from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
9 from provisioningserver.power.poweraction import (
10 PowerActionFail,
11 UnknownPowerType,
12@@ -690,9 +691,7 @@
13 power_type=power_info.power_type,
14 context=power_info.power_parameters)
15 try:
16- # Allow 30 seconds for the power query max as we're holding
17- # up an appserver thread here.
18- state = call.wait(30)
19+ state = call.wait(POWER_QUERY_TIMEOUT)
20 except crochet.TimeoutError:
21 maaslog.error(
22 "%s: Timed out waiting for power response in Node.power_state",
23
24=== modified file 'src/maasserver/plugin.py'
25--- src/maasserver/plugin.py 2015-05-07 18:14:38 +0000
26+++ src/maasserver/plugin.py 2015-07-07 10:22:52 +0000
27@@ -24,6 +24,7 @@
28 from twisted.plugin import IPlugin
29 from twisted.python import usage
30 from zope.interface import implementer
31+from twisted.internet import reactor
32
33
34 def serverFromString(description):
35@@ -45,6 +46,14 @@
36 ]
37
38
39+# The maximum number of threads used by the default twisted thread pool.
40+# This value is a trade-off between a small value (such as the default: 10)
41+# which can create deadlocks (see 1470013) and a huge value which can cause
42+# MAAS to hit other limitations such as the number of open files or the
43+# number of concurrent database connexions.
44+MAX_THREADS = 100
45+
46+
47 @implementer(IServiceMaker, IPlugin)
48 class RegionServiceMaker:
49 """Create a service for the Twisted plugin."""
50@@ -76,6 +85,10 @@
51 import crochet
52 crochet.no_setup()
53
54+ def _configurePoolSize(self):
55+ threadpool = reactor.getThreadPool()
56+ threadpool.adjustPoolsize(10, MAX_THREADS)
57+
58 def _makeIntrospectionService(self, endpoint):
59 from provisioningserver.utils import introspect
60 introspect_service = (
61@@ -91,6 +104,7 @@
62 self._configureLogging()
63 self._configureDjango()
64 self._configureCrochet()
65+ self._configurePoolSize()
66
67 # Populate the region's event-loop with services.
68 from maasserver import eventloop
69
70=== modified file 'src/maasserver/tests/test_plugin.py'
71--- src/maasserver/tests/test_plugin.py 2015-05-07 18:14:38 +0000
72+++ src/maasserver/tests/test_plugin.py 2015-07-07 10:22:52 +0000
73@@ -17,6 +17,7 @@
74 import crochet
75 from maasserver import eventloop
76 from maasserver.plugin import (
77+ MAX_THREADS,
78 Options,
79 RegionServiceMaker,
80 )
81@@ -24,11 +25,13 @@
82 from maastesting.testcase import MAASTestCase
83 from provisioningserver import logger
84 from provisioningserver.utils.twisted import asynchronous
85+from testtools.matchers import GreaterThan
86 from twisted.application.service import MultiService
87+from twisted.internet import reactor
88
89
90 class TestOptions(MAASTestCase):
91- """Tests for `provisioningserver.plugin.Options`."""
92+ """Tests for `maasserver.plugin.Options`."""
93
94 def test_defaults(self):
95 options = Options()
96@@ -42,7 +45,7 @@
97
98
99 class TestRegionServiceMaker(MAASTestCase):
100- """Tests for `provisioningserver.plugin.RegionServiceMaker`."""
101+ """Tests for `maasserver.plugin.RegionServiceMaker`."""
102
103 def setUp(self):
104 super(TestRegionServiceMaker, self).setUp()
105@@ -78,3 +81,12 @@
106 "Not all services are named.")
107 self.assertThat(logger.basicConfig, MockCalledOnceWith())
108 self.assertThat(crochet.no_setup, MockCalledOnceWith())
109+
110+ @asynchronous(timeout=5)
111+ def test__sets_pool_size(self):
112+ service_maker = RegionServiceMaker("Harry", "Hill")
113+ service_maker.makeService(Options())
114+ threadpool = reactor.getThreadPool()
115+ self.assertEqual(MAX_THREADS, threadpool.max)
116+ # Max threads is reasonable.
117+ self.assertThat(threadpool.max, GreaterThan(50))
118
119=== modified file 'src/maasserver/websockets/handlers/node.py'
120--- src/maasserver/websockets/handlers/node.py 2015-05-28 15:07:01 +0000
121+++ src/maasserver/websockets/handlers/node.py 2015-07-07 10:22:52 +0000
122@@ -48,6 +48,7 @@
123 )
124 from metadataserver.enum import RESULT_TYPE
125 from metadataserver.models import NodeResult
126+from provisioningserver.drivers.power import POWER_QUERY_TIMEOUT
127 from provisioningserver.logger import get_maas_logger
128 from provisioningserver.power.poweraction import (
129 PowerActionFail,
130@@ -547,9 +548,7 @@
131 power_type=power_info.power_type,
132 context=power_info.power_parameters)
133 try:
134- # Allow 15 seconds for the power query max as we're holding
135- # up a thread waiting.
136- state = call.wait(15)['state']
137+ state = call.wait(POWER_QUERY_TIMEOUT)['state']
138 except crochet.TimeoutError:
139 maaslog.error(
140 "%s: Timed out waiting for power response in "
141@@ -569,5 +568,5 @@
142 # sure the change is committed and retried if required. Not pushing
143 # this to another thread, would result in the entire power query being
144 # performed again.
145- update_power_state(state).wait(15)
146+ update_power_state(state).wait(POWER_QUERY_TIMEOUT)
147 return state
148
149=== modified file 'src/provisioningserver/drivers/power/__init__.py'
150--- src/provisioningserver/drivers/power/__init__.py 2015-06-23 16:45:25 +0000
151+++ src/provisioningserver/drivers/power/__init__.py 2015-07-07 10:22:52 +0000
152@@ -13,6 +13,7 @@
153
154 __metaclass__ = type
155 __all__ = [
156+ "POWER_QUERY_TIMEOUT",
157 "PowerActionError",
158 "PowerAuthError",
159 "PowerConnError",
160@@ -29,6 +30,7 @@
161 abstractmethod,
162 abstractproperty,
163 )
164+from datetime import timedelta
165
166 from jsonschema import validate
167 from provisioningserver.drivers import (
168@@ -52,6 +54,13 @@
169 }
170
171
172+# Timeout for the power query action. We might be holding up a thread for that
173+# long but some BMCs (notably seamicro) can take a long time to respond to
174+# a power query request.
175+# This should be configurable per-BMC.
176+POWER_QUERY_TIMEOUT = timedelta(seconds=45).total_seconds()
177+
178+
179 class PowerError(Exception):
180 """Base error for all power driver failure commands."""
181
182
183=== modified file 'src/provisioningserver/rpc/power.py'
184--- src/provisioningserver/rpc/power.py 2015-06-10 16:45:23 +0000
185+++ src/provisioningserver/rpc/power.py 2015-07-07 10:22:52 +0000
186@@ -79,7 +79,8 @@
187
188
189 # Timeout for change_power_state(). We set it to 5 minutes by default,
190-# but it would be lovely if this was configurable.
191+# but it would be lovely if this was configurable. This is only a backstop
192+# meant to cope with broken BMCs.
193 CHANGE_POWER_STATE_TIMEOUT = timedelta(minutes=5).total_seconds()
194
195