Merge lp:~rvb/maas/bug-1384001-redux into lp:~maas-committers/maas/trunk

Proposed by Raphaël Badin
Status: Merged
Approved by: Raphaël Badin
Approved revision: no longer in the source branch.
Merged at revision: 3304
Proposed branch: lp:~rvb/maas/bug-1384001-redux
Merge into: lp:~maas-committers/maas/trunk
Diff against target: 96 lines (+28/-12)
3 files modified
src/maasserver/api/nodes.py (+2/-8)
src/maasserver/middleware.py (+10/-2)
src/maasserver/tests/test_middleware.py (+16/-2)
To merge this branch: bzr merge lp:~rvb/maas/bug-1384001-redux
Reviewer Review Type Date Requested Status
Christian Reis (community) Approve
Julian Edwards (community) Approve
Graham Binns (community) Approve
Review via email: mp+239722@code.launchpad.net

Commit message

Return 503 response for PowerActionAlreadyInProgress exceptions. Add 'Retry-after' header for 503 responses: this will allow clients to retry failed requests.

Description of the change

The first change is simply a cleanup: we now use the middleware instead of raising an ad-hoc exception (the test for this already exists).

The second change is a bit more profound: it adds a default Retry-After header for all 503 responses. This will allow API clients such as Juju to retry failed requests.

The related change to gomaasapi is implemented in https://code.launchpad.net/~rvb/gomaasapi/gomaasapi-bug-1384001/+merge/239738.

To post a comment you must log in.
Revision history for this message
Graham Binns (gmb) :
review: Approve
Revision history for this message
Julian Edwards (julian-edwards) wrote :

I appreciate what you're doing but I don't like the direction the code is taking here.

Firstly, IMO mass catch-all middleware is dangerous as developers will be unknowingly shielded from potentially immense cock-ups. Developers need to be aware of all the errors/exceptions coming out of a function all, and deal with them *as soon as possible*. This is a basic tenet of software development.

Secondly, I believe the original middleware fix did not work because the bug that I fixed in the change you're reverting was to stop a *500* error, not a 409 (conflict), which is what should have happened if that middleware was catching the PowerActionAlreadyInProgress correctly.

Thirdly, I don't think a global retry-after value is appropriate, it should be set by each site that raises the error, depending on the situation.

I'm happy to work with you on this if you want to chat about it.

review: Disapprove
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok I just had a long chat with Raph about this and the upshot is:

1. We came to some middle (haha) ground with the middleware. Raph was referring to general exceptions that can get thrown from anywhere, such as NoConnectionsAvailable. I was referring to exceptions that are declared in the AMP message itself, which are *known* (or should be) to the call sites as being specific to that call. So the upshot is that I am happy for middleware to catch general errors but I still think call sites are responsible for catching known exceptions, because they are best-placed to deal with them. For example, in the case of PowerOn, if we get a PowerActionAlreadyInProgress error, because that code still deals with potentially many nodes at once, it needs to catch and deal with the error because otherwise you'd rollback the entire transaction and potentially leave nodes started up when they should not be. However we decided to compromise on this for now and leave this caught in the middleware, and let Graham look at how to deal with it when he gets around to finishing the work on making node operations work on one at a time.

2. My second and third points above still stand. I convinced Raph that the third needs callsite-specific values with the example of AMT vs IPMI failures; AMT is very much slower at everything and needs a slower retry period.

Did I cover everything Raph? :)

review: Approve
Revision history for this message
Raphaël Badin (rvb) wrote :

> Ok I just had a long chat with Raph about this and the upshot is:
>
> 1. We came to some middle (haha) ground with the middleware. Raph was
> referring to general exceptions that can get thrown from anywhere, such as
> NoConnectionsAvailable. I was referring to exceptions that are declared in
> the AMP message itself, which are *known* (or should be) to the call sites as
> being specific to that call. So the upshot is that I am happy for middleware
> to catch general errors but I still think call sites are responsible for
> catching known exceptions, because they are best-placed to deal with them.
> For example, in the case of PowerOn, if we get a PowerActionAlreadyInProgress
> error, because that code still deals with potentially many nodes at once, it
> needs to catch and deal with the error because otherwise you'd rollback the
> entire transaction and potentially leave nodes started up when they should not
> be. However we decided to compromise on this for now and leave this caught in
> the middleware, and let Graham look at how to deal with it when he gets around
> to finishing the work on making node operations work on one at a time.
>
> 2. My second and third points above still stand. I convinced Raph that the
> third needs callsite-specific values with the example of AMT vs IPMI failures;
> AMT is very much slower at everything and needs a slower retry period.
>
> Did I cover everything Raph? :)

Yep, I think that's all; my side of the story:

- point 1 is valid but with the code that we have now it's much simpler to let the middlware handle the errors directly; we might want to revisit this when the refactoring gmb is working on lands.

- point 2 will be resolved when a fix for bug 1386549 lands.

- point 3 is worth updating this branch, I agree, I'll work on it now.

Revision history for this message
Raphaël Badin (rvb) wrote :

One more thing: the work I'm planning to do to address point #3 will be done in another branch. This is because it's not entirely sure that we will be able to backport it to 1.7. This gives us the option to only backport this branch to 1.7 if we need to.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Tuesday 28 Oct 2014 10:52:27 you wrote:
> One more thing: the work I'm planning to do to address point #3 will be done
> in another branch. This is because it's not entirely sure that we will be
> able to backport it to 1.7. This gives us the option to only backport this
> branch to 1.7 if we need to.

+1!

Revision history for this message
Christian Reis (kiko) wrote :

This looks good for trunk and 1.7, minor comments below:

review: Approve
Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Tuesday 28 Oct 2014 15:38:18 you wrote:
> > + RETRY_AFTER_SERVICE_UNAVAILABLE = 5
>
> Can the comment say this is seconds?

Better still, use:

RETRY_AFTER_SERVICE_UNAVAILABLE = timdelta(seconds=5).total_seconds()

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/maasserver/api/nodes.py'
--- src/maasserver/api/nodes.py 2014-10-23 05:18:42 +0000
+++ src/maasserver/api/nodes.py 2014-10-27 13:22:20 +0000
@@ -82,10 +82,7 @@
82 )82 )
83from provisioningserver.power_schema import UNKNOWN_POWER_TYPE83from provisioningserver.power_schema import UNKNOWN_POWER_TYPE
84from provisioningserver.rpc.cluster import PowerQuery84from provisioningserver.rpc.cluster import PowerQuery
85from provisioningserver.rpc.exceptions import (85from provisioningserver.rpc.exceptions import NoConnectionsAvailable
86 NoConnectionsAvailable,
87 PowerActionAlreadyInProgress,
88 )
89import simplejson as json86import simplejson as json
9087
91# Node's fields exposed on the API.88# Node's fields exposed on the API.
@@ -268,10 +265,7 @@
268 node = Node.objects.get_node_or_404(265 node = Node.objects.get_node_or_404(
269 system_id=system_id, user=request.user,266 system_id=system_id, user=request.user,
270 perm=NODE_PERMISSION.EDIT)267 perm=NODE_PERMISSION.EDIT)
271 try:268 power_action_sent = node.stop(request.user, stop_mode=stop_mode)
272 power_action_sent = node.stop(request.user, stop_mode=stop_mode)
273 except PowerActionAlreadyInProgress as e:
274 raise PowerProblem(e)
275 if power_action_sent:269 if power_action_sent:
276 return node270 return node
277 else:271 else:
278272
=== modified file 'src/maasserver/middleware.py'
--- src/maasserver/middleware.py 2014-10-10 11:03:50 +0000
+++ src/maasserver/middleware.py 2014-10-27 13:22:20 +0000
@@ -350,11 +350,15 @@
350350
351 handled_exceptions = {351 handled_exceptions = {
352 NoConnectionsAvailable: httplib.SERVICE_UNAVAILABLE,352 NoConnectionsAvailable: httplib.SERVICE_UNAVAILABLE,
353 PowerActionAlreadyInProgress: httplib.CONFLICT,353 PowerActionAlreadyInProgress: httplib.SERVICE_UNAVAILABLE,
354 MultipleFailures: httplib.INTERNAL_SERVER_ERROR,354 MultipleFailures: httplib.INTERNAL_SERVER_ERROR,
355 TimeoutError: httplib.GATEWAY_TIMEOUT,355 TimeoutError: httplib.GATEWAY_TIMEOUT,
356 }356 }
357357
358 # Default 'Retry-After' header sent for httplib.SERVICE_UNAVAILABLE
359 # responses.
360 RETRY_AFTER_SERVICE_UNAVAILABLE = 5
361
358 def process_exception(self, request, exception):362 def process_exception(self, request, exception):
359 path_matcher = re.compile(settings.API_URL_REGEXP)363 path_matcher = re.compile(settings.API_URL_REGEXP)
360 if not path_matcher.match(get_relative_path(request.path)):364 if not path_matcher.match(get_relative_path(request.path)):
@@ -380,6 +384,10 @@
380 error_message = get_error_message_for_exception(exception)384 error_message = get_error_message_for_exception(exception)
381385
382 encoding = b'utf-8'386 encoding = b'utf-8'
383 return HttpResponse(387 response = HttpResponse(
384 content=error_message.encode(encoding), status=status,388 content=error_message.encode(encoding), status=status,
385 mimetype=b"text/plain; charset=%s" % encoding)389 mimetype=b"text/plain; charset=%s" % encoding)
390 if status == httplib.SERVICE_UNAVAILABLE:
391 response['Retry-After'] = (
392 self.RETRY_AFTER_SERVICE_UNAVAILABLE)
393 return response
386394
=== modified file 'src/maasserver/tests/test_middleware.py'
--- src/maasserver/tests/test_middleware.py 2014-10-10 13:25:21 +0000
+++ src/maasserver/tests/test_middleware.py 2014-10-27 13:22:20 +0000
@@ -446,7 +446,21 @@
446 (httplib.SERVICE_UNAVAILABLE, error_message),446 (httplib.SERVICE_UNAVAILABLE, error_message),
447 (response.status_code, response.content))447 (response.status_code, response.content))
448448
449 def test_power_action_already_in_progress_returned_as_409(self):449 def test_503_response_includes_retry_after_header_by_default(self):
450 middleware = APIRPCErrorsMiddleware()
451 request = factory.make_fake_request(
452 "/api/1.0/" + factory.make_string(), 'POST')
453 error = NoConnectionsAvailable(factory.make_name())
454 response = middleware.process_exception(request, error)
455
456 self.assertEqual(
457 (
458 httplib.SERVICE_UNAVAILABLE,
459 '%s' % middleware.RETRY_AFTER_SERVICE_UNAVAILABLE,
460 ),
461 (response.status_code, response['Retry-after']))
462
463 def test_power_action_already_in_progress_returned_as_503(self):
450 middleware = APIRPCErrorsMiddleware()464 middleware = APIRPCErrorsMiddleware()
451 request = factory.make_fake_request(465 request = factory.make_fake_request(
452 "/api/1.0/" + factory.make_string(), 'POST')466 "/api/1.0/" + factory.make_string(), 'POST')
@@ -457,7 +471,7 @@
457 response = middleware.process_exception(request, error)471 response = middleware.process_exception(request, error)
458472
459 self.assertEqual(473 self.assertEqual(
460 (httplib.CONFLICT, error_message),474 (httplib.SERVICE_UNAVAILABLE, error_message),
461 (response.status_code, response.content))475 (response.status_code, response.content))
462476
463 def test_multiple_failures_returned_as_500(self):477 def test_multiple_failures_returned_as_500(self):