Merge lp:~rvb/maas/bug-1384001-redux into lp:maas/trunk

Proposed by Raphaël Badin on 2014-10-27
Status: Merged
Approved by: Raphaël Badin on 2014-10-28
Approved revision: 3303
Merged at revision: 3304
Proposed branch: lp:~rvb/maas/bug-1384001-redux
Merge into: lp:maas/trunk
Diff against target: 96 lines (+28/-12)
3 files modified
src/maasserver/api/nodes.py (+2/-8)
src/maasserver/middleware.py (+10/-2)
src/maasserver/tests/test_middleware.py (+16/-2)
To merge this branch: bzr merge lp:~rvb/maas/bug-1384001-redux
Reviewer Review Type Date Requested Status
Christian Reis Approve on 2014-10-28
Julian Edwards (community) Approve on 2014-10-28
Graham Binns (community) 2014-10-27 Approve on 2014-10-27
Review via email: mp+239722@code.launchpad.net

Commit message

Return 503 response for PowerActionAlreadyInProgress exceptions. Add 'Retry-after' header for 503 responses: this will allow clients to retry failed requests.

Description of the change

The first change is simply a cleanup: we now use the middleware instead of raising an ad-hoc exception (the test for this already exists).

The second change is a bit more profound: it adds a default Retry-After header for all 503 responses. This will allow API clients such as Juju to retry failed requests.

The related change to gomaasapi is implemented in https://code.launchpad.net/~rvb/gomaasapi/gomaasapi-bug-1384001/+merge/239738.

To post a comment you must log in.
Graham Binns (gmb) :
review: Approve
Julian Edwards (julian-edwards) wrote :

I appreciate what you're doing but I don't like the direction the code is taking here.

Firstly, IMO mass catch-all middleware is dangerous as developers will be unknowingly shielded from potentially immense cock-ups. Developers need to be aware of all the errors/exceptions coming out of a function all, and deal with them *as soon as possible*. This is a basic tenet of software development.

Secondly, I believe the original middleware fix did not work because the bug that I fixed in the change you're reverting was to stop a *500* error, not a 409 (conflict), which is what should have happened if that middleware was catching the PowerActionAlreadyInProgress correctly.

Thirdly, I don't think a global retry-after value is appropriate, it should be set by each site that raises the error, depending on the situation.

I'm happy to work with you on this if you want to chat about it.

review: Disapprove
Julian Edwards (julian-edwards) wrote :

Ok I just had a long chat with Raph about this and the upshot is:

1. We came to some middle (haha) ground with the middleware. Raph was referring to general exceptions that can get thrown from anywhere, such as NoConnectionsAvailable. I was referring to exceptions that are declared in the AMP message itself, which are *known* (or should be) to the call sites as being specific to that call. So the upshot is that I am happy for middleware to catch general errors but I still think call sites are responsible for catching known exceptions, because they are best-placed to deal with them. For example, in the case of PowerOn, if we get a PowerActionAlreadyInProgress error, because that code still deals with potentially many nodes at once, it needs to catch and deal with the error because otherwise you'd rollback the entire transaction and potentially leave nodes started up when they should not be. However we decided to compromise on this for now and leave this caught in the middleware, and let Graham look at how to deal with it when he gets around to finishing the work on making node operations work on one at a time.

2. My second and third points above still stand. I convinced Raph that the third needs callsite-specific values with the example of AMT vs IPMI failures; AMT is very much slower at everything and needs a slower retry period.

Did I cover everything Raph? :)

review: Approve
Raphaël Badin (rvb) wrote :

> Ok I just had a long chat with Raph about this and the upshot is:
>
> 1. We came to some middle (haha) ground with the middleware. Raph was
> referring to general exceptions that can get thrown from anywhere, such as
> NoConnectionsAvailable. I was referring to exceptions that are declared in
> the AMP message itself, which are *known* (or should be) to the call sites as
> being specific to that call. So the upshot is that I am happy for middleware
> to catch general errors but I still think call sites are responsible for
> catching known exceptions, because they are best-placed to deal with them.
> For example, in the case of PowerOn, if we get a PowerActionAlreadyInProgress
> error, because that code still deals with potentially many nodes at once, it
> needs to catch and deal with the error because otherwise you'd rollback the
> entire transaction and potentially leave nodes started up when they should not
> be. However we decided to compromise on this for now and leave this caught in
> the middleware, and let Graham look at how to deal with it when he gets around
> to finishing the work on making node operations work on one at a time.
>
> 2. My second and third points above still stand. I convinced Raph that the
> third needs callsite-specific values with the example of AMT vs IPMI failures;
> AMT is very much slower at everything and needs a slower retry period.
>
> Did I cover everything Raph? :)

Yep, I think that's all; my side of the story:

- point 1 is valid but with the code that we have now it's much simpler to let the middlware handle the errors directly; we might want to revisit this when the refactoring gmb is working on lands.

- point 2 will be resolved when a fix for bug 1386549 lands.

- point 3 is worth updating this branch, I agree, I'll work on it now.

Raphaël Badin (rvb) wrote :

One more thing: the work I'm planning to do to address point #3 will be done in another branch. This is because it's not entirely sure that we will be able to backport it to 1.7. This gives us the option to only backport this branch to 1.7 if we need to.

Julian Edwards (julian-edwards) wrote :

On Tuesday 28 Oct 2014 10:52:27 you wrote:
> One more thing: the work I'm planning to do to address point #3 will be done
> in another branch. This is because it's not entirely sure that we will be
> able to backport it to 1.7. This gives us the option to only backport this
> branch to 1.7 if we need to.

+1!

Christian Reis (kiko) wrote :

This looks good for trunk and 1.7, minor comments below:

review: Approve
Julian Edwards (julian-edwards) wrote :

On Tuesday 28 Oct 2014 15:38:18 you wrote:
> > + RETRY_AFTER_SERVICE_UNAVAILABLE = 5
>
> Can the comment say this is seconds?

Better still, use:

RETRY_AFTER_SERVICE_UNAVAILABLE = timdelta(seconds=5).total_seconds()

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/maasserver/api/nodes.py'
2--- src/maasserver/api/nodes.py 2014-10-23 05:18:42 +0000
3+++ src/maasserver/api/nodes.py 2014-10-27 13:22:20 +0000
4@@ -82,10 +82,7 @@
5 )
6 from provisioningserver.power_schema import UNKNOWN_POWER_TYPE
7 from provisioningserver.rpc.cluster import PowerQuery
8-from provisioningserver.rpc.exceptions import (
9- NoConnectionsAvailable,
10- PowerActionAlreadyInProgress,
11- )
12+from provisioningserver.rpc.exceptions import NoConnectionsAvailable
13 import simplejson as json
14
15 # Node's fields exposed on the API.
16@@ -268,10 +265,7 @@
17 node = Node.objects.get_node_or_404(
18 system_id=system_id, user=request.user,
19 perm=NODE_PERMISSION.EDIT)
20- try:
21- power_action_sent = node.stop(request.user, stop_mode=stop_mode)
22- except PowerActionAlreadyInProgress as e:
23- raise PowerProblem(e)
24+ power_action_sent = node.stop(request.user, stop_mode=stop_mode)
25 if power_action_sent:
26 return node
27 else:
28
29=== modified file 'src/maasserver/middleware.py'
30--- src/maasserver/middleware.py 2014-10-10 11:03:50 +0000
31+++ src/maasserver/middleware.py 2014-10-27 13:22:20 +0000
32@@ -350,11 +350,15 @@
33
34 handled_exceptions = {
35 NoConnectionsAvailable: httplib.SERVICE_UNAVAILABLE,
36- PowerActionAlreadyInProgress: httplib.CONFLICT,
37+ PowerActionAlreadyInProgress: httplib.SERVICE_UNAVAILABLE,
38 MultipleFailures: httplib.INTERNAL_SERVER_ERROR,
39 TimeoutError: httplib.GATEWAY_TIMEOUT,
40 }
41
42+ # Default 'Retry-After' header sent for httplib.SERVICE_UNAVAILABLE
43+ # responses.
44+ RETRY_AFTER_SERVICE_UNAVAILABLE = 5
45+
46 def process_exception(self, request, exception):
47 path_matcher = re.compile(settings.API_URL_REGEXP)
48 if not path_matcher.match(get_relative_path(request.path)):
49@@ -380,6 +384,10 @@
50 error_message = get_error_message_for_exception(exception)
51
52 encoding = b'utf-8'
53- return HttpResponse(
54+ response = HttpResponse(
55 content=error_message.encode(encoding), status=status,
56 mimetype=b"text/plain; charset=%s" % encoding)
57+ if status == httplib.SERVICE_UNAVAILABLE:
58+ response['Retry-After'] = (
59+ self.RETRY_AFTER_SERVICE_UNAVAILABLE)
60+ return response
61
62=== modified file 'src/maasserver/tests/test_middleware.py'
63--- src/maasserver/tests/test_middleware.py 2014-10-10 13:25:21 +0000
64+++ src/maasserver/tests/test_middleware.py 2014-10-27 13:22:20 +0000
65@@ -446,7 +446,21 @@
66 (httplib.SERVICE_UNAVAILABLE, error_message),
67 (response.status_code, response.content))
68
69- def test_power_action_already_in_progress_returned_as_409(self):
70+ def test_503_response_includes_retry_after_header_by_default(self):
71+ middleware = APIRPCErrorsMiddleware()
72+ request = factory.make_fake_request(
73+ "/api/1.0/" + factory.make_string(), 'POST')
74+ error = NoConnectionsAvailable(factory.make_name())
75+ response = middleware.process_exception(request, error)
76+
77+ self.assertEqual(
78+ (
79+ httplib.SERVICE_UNAVAILABLE,
80+ '%s' % middleware.RETRY_AFTER_SERVICE_UNAVAILABLE,
81+ ),
82+ (response.status_code, response['Retry-after']))
83+
84+ def test_power_action_already_in_progress_returned_as_503(self):
85 middleware = APIRPCErrorsMiddleware()
86 request = factory.make_fake_request(
87 "/api/1.0/" + factory.make_string(), 'POST')
88@@ -457,7 +471,7 @@
89 response = middleware.process_exception(request, error)
90
91 self.assertEqual(
92- (httplib.CONFLICT, error_message),
93+ (httplib.SERVICE_UNAVAILABLE, error_message),
94 (response.status_code, response.content))
95
96 def test_multiple_failures_returned_as_500(self):