Merge lp:~wallyworld/python-jujuclient/retry-on-upgrade into lp:python-jujuclient

Proposed by Ian Booth on 2015-05-31
Status: Merged
Merged at revision: 59
Proposed branch: lp:~wallyworld/python-jujuclient/retry-on-upgrade
Merge into: lp:python-jujuclient
Diff against target: 101 lines (+47/-8)
2 files modified
jujuclient.py (+27/-7)
test_jujuclient.py (+20/-1)
To merge this branch: bzr merge lp:~wallyworld/python-jujuclient/retry-on-upgrade
Reviewer Review Type Date Requested Status
Adam Collard (community) Needs Fixing on 2015-06-01
Kapil Thangavelu 2015-05-31 Approve on 2015-06-01
John A Meinel (community) Approve on 2015-06-01
Review via email: mp+260658@code.launchpad.net

Commit Message

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

Description of the Change

Fixes bug bug 146017

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

To post a comment you must log in.
60. By Ian Booth on 2015-06-01

Fix test

John A Meinel (jameinel) wrote :

Overall I think LGTM. I do have a couple comments.

John A Meinel (jameinel) :
review: Approve
61. By Ian Booth on 2015-06-01

Fix formatting

Kapil Thangavelu (hazmat) wrote :

i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is full of instance variables where as these are not.

How long is the upgrade expected to last? On new systems 5-10s depending on machine, and extant envs o(n) size of env?

Kapil Thangavelu (hazmat) wrote :

lgtm

review: Approve
review: Needs Fixing
62. By Ian Booth on 2015-06-01

Remove stupid ++

63. By Ian Booth on 2015-06-01

Tweak while loop for rpc retries

Adam Collard (adam-collard) wrote :

I don't think the proposed change will fix the bug.

The error in the linked bug occurred when deployer tried to look at the status (called .get_stat()), which made a new watcher and (tried to) started it. That's when the error that Juju was upgrading occurred.

This branch would make that RPC call to start a new watcher get retried, whilst Juju is upgrading. However, once the upgrade is complete, IIUC one would need to re-authenticate to Juju (and probably renegotiate facade versioning - we've just got a new version of Juju!).

In other words, the retry is at the wrong level.

Ian Booth (wallyworld) wrote :

> I don't think the proposed change will fix the bug.
>
> The error in the linked bug occurred when deployer tried to look at the status
> (called .get_stat()), which made a new watcher and (tried to) started it.
> That's when the error that Juju was upgrading occurred.
>
> This branch would make that RPC call to start a new watcher get retried,
> whilst Juju is upgrading. However, once the upgrade is complete, IIUC one
> would need to re-authenticate to Juju (and probably renegotiate facade
> versioning - we've just got a new version of Juju!).
>
> In other words, the retry is at the wrong level.

There's appears to be a little misunderstanding of what this fix does. It does not handle the case where Juju disconnects clients when the agents reboot after an agent upgrade.

What the fix does is this. When the Juju agent starts up, the API is in a limited "upgrading" mode while Juju a) performs any version upgrade steps, and b) checks if any agent upgrades are required. Until both of these complete, any attempts to do much more than get status will result in an "upgrade in progress" error. There are no agent restarts during any of this. So the deployer will not become disconnected. But it will get confused in the short window where an "upgrade in progress" error is reported, so this MP teaches it how to deal with that. It simply retries until the full API becomes available, which is what happens 99.9999% of the time when no agent upgrades are necessary.

Ian Booth (wallyworld) wrote :

TL;DR; this MP fixes bug 1460171 not bug 1455260

Ian Booth (wallyworld) wrote :

> i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is
> full of instance variables where as these are not.
>
> How long is the upgrade expected to last? On new systems 5-10s depending on
> machine, and extant envs o(n) size of env?

I didn't treat those values as constants, just member variables with defaults. I left it open for them to be changed by set_reconnect_params() if required later.

I have no exact figures on how long upgrade steps (not agent upgrades) will take to run. Typically seconds from what has been observed so far. So 1 minute seems like more than enough. Note this is upgrade steps which run once on the state server, not agent upgrades which requires each agent to restart.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'jujuclient.py'
2--- jujuclient.py 2015-05-19 01:17:21 +0000
3+++ jujuclient.py 2015-06-01 14:14:11 +0000
4@@ -263,11 +263,13 @@
5
6
7 class RPC(object):
8-
9+ _upgrade_retry_delay_secs = 1
10+ _upgrade_retry_count = 60
11 _auth = False
12 _request_id = 0
13 _debug = False
14 _reconnect_params = None
15+
16 conn = None
17
18 def _rpc(self, op):
19@@ -277,6 +279,29 @@
20 op['Params'] = {}
21 op['RequestId'] = self._request_id
22 self._request_id += 1
23+ result = self._rpc_retry_if_upgrading(op)
24+ if 'Error' in result:
25+ # The backend disconnects us on err, bug: http://pad.lv/1160971
26+ self.conn.connected = False
27+ raise EnvError(result)
28+ return result['Response']
29+
30+ def _rpc_retry_if_upgrading(self, op):
31+ """If Juju is upgrading when the specified rpc call is made,
32+ retry the call."""
33+ retry_count = 0
34+ result = {'Response': ''}
35+ while retry_count <= self._upgrade_retry_count:
36+ result = self._send_request(op)
37+ if 'Error' in result and 'upgrade in progress' in result['Error']:
38+ log.info("Juju upgrade in progress...")
39+ retry_count += 1
40+ time.sleep(self._upgrade_retry_delay_secs)
41+ continue
42+ break
43+ return result
44+
45+ def _send_request(self, op):
46 if self._debug:
47 log.debug("rpc request:\n%s" % (json.dumps(op, indent=2)))
48 self.conn.send(json.dumps(op))
49@@ -284,12 +309,7 @@
50 result = json.loads(raw)
51 if self._debug:
52 log.debug("rpc response:\n%s" % (json.dumps(result, indent=2)))
53-
54- if 'Error' in result:
55- # The backend disconnects us on err, bug: http://pad.lv/1160971
56- self.conn.connected = False
57- raise EnvError(result)
58- return result['Response']
59+ return result
60
61 def login(self, password, user="user-admin"):
62 """Login gets shared to watchers for reconnect."""
63
64=== modified file 'test_jujuclient.py'
65--- test_jujuclient.py 2015-05-18 23:26:08 +0000
66+++ test_jujuclient.py 2015-06-01 14:14:11 +0000
67@@ -10,7 +10,7 @@
68 import yaml
69
70 from jujuclient import (
71- Environment, Connector,
72+ Environment, EnvError, Connector, RPC,
73 # Watch wrappers
74 WaitForNoMachines,
75 # Facades
76@@ -34,6 +34,25 @@
77 WaitForNoMachines(watch).run()
78
79
80+class ClientRPCTest(unittest.TestCase):
81+
82+ @mock.patch('jujuclient.RPC._send_request')
83+ @mock.patch('jujuclient.RPC._upgrade_retry_delay_secs', 0.001)
84+ def test_retry_on_upgrade_error(self, send_request):
85+ send_request.return_value = {"Error": "upgrade in progress"}
86+ rpc_client = RPC()
87+ rpc_client.conn = mock.Mock()
88+ self.assertRaises(EnvError, rpc_client.login, "password")
89+ self.assertEquals(send_request.call_count, 61)
90+
91+ @mock.patch('jujuclient.RPC._send_request')
92+ def test_no_retry_required(self, send_request):
93+ send_request.return_value = {"Error": "some other error"}
94+ rpc_client = RPC()
95+ rpc_client.conn = mock.Mock()
96+ self.assertRaises(EnvError, rpc_client.login, "password")
97+ self.assertEquals(send_request.call_count, 1)
98+
99 class ClientConnectorTest(unittest.TestCase):
100
101 @mock.patch('jujuclient.websocket')

Subscribers

People subscribed via source and target branches

to all changes: