Merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by Chris Glass
Status: Work in progress
Proposed branch: lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster
Merge into: lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
Diff against target: 204 lines (+52/-39)
4 files modified
Makefile (+1/-1)
hooks/rabbit_utils.py (+23/-23)
hooks/rabbitmq_server_relations.py (+19/-6)
unit_tests/test_rabbit_utils.py (+9/-9)
To merge this branch: bzr merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster
Reviewer Review Type Date Requested Status
Ryan Beisner (community) Needs Information
Geoff Teale (community) Approve
OpenStack Charmers Pending
Review via email: mp+274109@code.launchpad.net

Description of the change

This branch fixes some comments about a previously merged branch (https://code.launchpad.net/~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster/+merge/273474), but since the branch was already merged it needed to be done as a new branch.

List of changes:
- Changed retry logic to actually work.
- Changed leader_node() to return something or None, but not a list with always only one element
- Changed a few things to use more pythonic idioms.

REVIEWERS: Please add your comments and approval/rejection but let me merge the branch myself (I have a sister branch targeting trunk I want to keep in sync)

To post a comment you must log in.
Revision history for this message
Chris Glass (tribaal) wrote :

Added some inline notes to reviewers.

Revision history for this message
Geoff Teale (tealeg) wrote :

+1 a few small notes below.

review: Approve
121. By Chris Glass

Added more logging and a slightly clearer logic.

122. By Chris Glass

Fix typo.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11710 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11710/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10896 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10896/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7277 rabbitmq-server-next for tribaal mp274109
    AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
2015-10-12 14:53:35,977 configure_rmq_ssl_on DEBUG: Setting ssl charm config option: on
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12763702/
Build: http://10.245.162.77:8080/job/charm_amulet_test/7277/

123. By Chris Glass

Empty commit to kick of OSCI again.

124. By Chris Glass

Added missing unit tests dependency while we're at it.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11720 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11720/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10903 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10903/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7287 rabbitmq-server-next for tribaal mp274109
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7287/

Revision history for this message
David Ames (thedac) wrote :

My 2 cents. I am not rejecting this MP as the retry logic was certainly broken (non-existent) and misleading.

However, the ultimate goal of the this charm should be: The non-leader nodes take turns clustering with the leader. Only this will guarantee successful, repeatable, testable clustering. Stuart Biship's charmhelpers.coordinator will allow us to do this and the plan is after 15.10 to carve out time to implement this.

The very concept of retry logic is brittle and it is extremely difficult to test. A single amulet success is insufficient to prove functionality. Ryan and I have been doing iterative tests over various series with upwards of 50 iterations to "prove" things work.

So this MP like mine that preceded it is a stop-gap measure, until we can implement a complete solution.

Revision history for this message
Chris Glass (tribaal) wrote :

> So this MP like mine that preceded it is a stop-gap measure, until we can
> implement a complete solution.

Agreed 100% - it is a stop-gap measure, as you said. It simply feels a little less brittle ,since I believe it makes the code easier to read and more formally correct.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for you work on this.

The proposed new amulet test file 'tests/05-check-leader-election-clustering' will only exercise the tests on one Ubuntu:OpenStack release combo (Trusty-Kilo). Having just done a major refactor of a half-dozen such tests in order to cover our release matrix, I have to nack in its current form.

The guidance for oscharm amulet test scenarios continues to be: Tests should be added to basic_deployment.py in a new test_ method, unless they require a different topology, or a config option which would not be compatible with the other existing tests. In those cases, the test scenario is a candidate for an openstack mojo spec test rather than an amulet test.

A distant 2nd alternative would be to exercise 05-check-leader-election-clustering against Precise-Icehouse, Trusty-Icehouse, Trusty-Juno, Trusty-Kilo, Vivid-Kilo (and soon Trusty-Liberty and Wily Liberty), in the same way that basic_deployment.py is exercised, which would double our deploy count and test time for rmq (already at ~1.5hrs per test pipeline). But we should discuss further on the os-xteam call before anyone invests in that path.

review: Disapprove
125. By Chris Glass

Removed introduced test.

Revision history for this message
Chris Glass (tribaal) wrote :

Removed the introduced test, since I understand the entry barrier for new test is to be full matrix-testable, and this was not the case (not to mention it would need another matrix dimention - juju versions- to be effective).

The rest of the proposed changes are still valid and relevant (broken code is fixed).

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11778 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11778/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10957 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10957/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The added test coverage is appreciated, however we do need it to fit the methodology as described. To be clear, I'm not advocating for reducing test coverage.

To summarize the current behavior (and what to test):

"When min-cluster-size is set to a value which is higher than the actual number of rmq nodes:

Clustering is expected to fail when LE is *not* available (min not met);

Clustering is expected to succeed when LE is available (min ignored)."

^ This is indeed tricky, as all uosci tests are run with LE available, and specifying the juju version is outside the scope of amulet. I think (if that is indeed our desired behavior), a periodic mojo spec regression test is the path.

...

That said...

Given (config.yaml snippet):

  min-cluster-size:
    type: int
    default:
    description: |
      Minimum number of units expected to exist before charm will attempt to
      form a rabbitmq cluster

I'm not a fan of the existing behavior. If I am the user, and I tell the charm to have a minimum cluster size of 9, and only feed it 3 units, I would expect mayhem/failure (or to be blocked at the very least). IMO, we are being kind at the expense of the trust of a config option and the underlying hooks.

This behavior is what would make sense to me:

with LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, w/o status, when min < actual:
try, try, try, and eventually fail a hook

review: Needs Information
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7297 rabbitmq-server-next for tribaal mp274109
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7297/

Unmerged revisions

125. By Chris Glass

Removed introduced test.

124. By Chris Glass

Added missing unit tests dependency while we're at it.

123. By Chris Glass

Empty commit to kick of OSCI again.

122. By Chris Glass

Fix typo.

121. By Chris Glass

Added more logging and a slightly clearer logic.

120. By Chris Glass

Add Amulet test for trusty/kilo/leader-election to ensure previously introduced
behavior actually works.

Some refactoring around unecessary looping.

119. By Adam Collard

Address the review comments I made on stable branch
 * leader_node() returns something or None
 * Fix the retry logic which was crazy bonkers wrong
 * Use better Python idioms

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'Makefile'
2--- Makefile 2015-10-06 18:43:11 +0000
3+++ Makefile 2015-10-13 13:56:36 +0000
4@@ -11,7 +11,7 @@
5 (which dh_clean && dh_clean) || true
6
7 .venv:
8- sudo apt-get install -y gcc python-dev python-virtualenv python-apt
9+ sudo apt-get install -y gcc python-dev python-virtualenv python-apt python-netifaces
10 virtualenv .venv --system-site-packages
11 .venv/bin/pip install -I -r test-requirements.txt
12
13
14=== modified file 'hooks/rabbit_utils.py'
15--- hooks/rabbit_utils.py 2015-10-06 21:52:17 +0000
16+++ hooks/rabbit_utils.py 2015-10-13 13:56:36 +0000
17@@ -303,22 +303,24 @@
18 return False
19
20 # check the leader and try to cluster with it
21- if len(leader_node()) == 0:
22+ node = leader_node()
23+ if node is None:
24 log('No nodes available to cluster with')
25 return False
26
27- num_tries = 0
28- for node in leader_node():
29- if node in running_nodes():
30- log('Host already clustered with %s.' % node)
31- return False
32- log('Clustering with remote rabbit host (%s).' % node)
33+ if node in running_nodes():
34+ log('Host already clustered with %s.' % node)
35+ return False
36+ max_attempts = config('max-cluster-tries')
37+ for attempt_number in range(1, max_attempts + 1):
38 # NOTE: The primary problem rabbitmq has clustering is when
39 # more than one node attempts to cluster at the same time.
40 # The asynchronous nature of hook firing nearly guarantees
41 # this. Using random time wait is a hack until we can
42 # implement charmhelpers.coordinator.
43- time.sleep(random.random()*100)
44+ time.sleep(random.random() * 100)
45+ log('Clustering attempt %d/%d with remote rabbit host (%s).' % (
46+ attempt_number, max_attempts, node))
47 try:
48 cmd = [RABBITMQ_CTL, 'stop_app']
49 subprocess.check_call(cmd)
50@@ -326,20 +328,20 @@
51 subprocess.check_output(cmd, stderr=subprocess.STDOUT)
52 cmd = [RABBITMQ_CTL, 'start_app']
53 subprocess.check_call(cmd)
54- log('Host clustered with %s.' % node)
55+ except subprocess.CalledProcessError as e:
56+ log('Failed to cluster on attempt %d/%d with %s. Exception: %s' % (
57+ attempt_number, max_attempts, node, e))
58+ cmd = [RABBITMQ_CTL, 'start_app']
59+ subprocess.check_call(cmd)
60+ else:
61+ log('Host clustered on attempt %d/%d with %s.' % (
62+ attempt_number, max_attempts, node))
63 return True
64- except subprocess.CalledProcessError as e:
65- log('Failed to cluster with %s. Exception: %s'
66- % (node, e))
67- cmd = [RABBITMQ_CTL, 'start_app']
68- subprocess.check_call(cmd)
69- # continue to the next node
70- num_tries += 1
71- if num_tries > config('max-cluster-tries'):
72- log('Max tries number exhausted, exiting', level=ERROR)
73- raise
74
75- return False
76+ error_message = "Maximum number of attempts (%d) exhausted, exiting" % (
77+ max_attempts)
78+ log(error_message, level=ERROR)
79+ raise Exception(error_message)
80
81
82 def break_cluster():
83@@ -609,9 +611,7 @@
84 # to avoid split-brain clusters.
85 leader_node_ip = peer_retrieve('leader_node_ip')
86 if leader_node_ip:
87- return ["rabbit@" + get_node_hostname(leader_node_ip)]
88- else:
89- return []
90+ return "rabbit@" + get_node_hostname(leader_node_ip)
91
92
93 def get_node_hostname(address):
94
95=== modified file 'hooks/rabbitmq_server_relations.py'
96--- hooks/rabbitmq_server_relations.py 2015-10-06 18:55:33 +0000
97+++ hooks/rabbitmq_server_relations.py 2015-10-13 13:56:36 +0000
98@@ -259,23 +259,36 @@
99 number of peers to proceed with creating rabbitmq cluster.
100 """
101 min_size = config('min-cluster-size')
102+ leader_election_available = True
103+ try:
104+ is_leader()
105+ except NotImplementedError:
106+ leader_election_available = False
107+
108 if min_size:
109- # Ignore min-cluster-size if juju has leadership election
110- try:
111- is_leader()
112- return True
113- except NotImplementedError:
114+ # Use min-cluster-size if we don't have Juju leader election.
115+ if not leader_election_available:
116+ log("Waiting for minimum of %d peer units since there's no Juju "
117+ "leader election" % (min_size))
118 size = 0
119 for rid in relation_ids('cluster'):
120 size = len(related_units(rid))
121
122 # Include this unit
123 size += 1
124- if min_size > size:
125+ if size < min_size:
126 log("Insufficient number of peer units to form cluster "
127 "(expected=%s, got=%s)" % (min_size, size), level=INFO)
128 return False
129+ else:
130+ log("Ignoring min-cluster-size in favour of Juju leader election")
131+ return True
132
133+ if leader_election_available:
134+ log("min-cluster-size is not defined, using juju leader-election.")
135+ else:
136+ log("min-cluster-size is not defined and juju leader election is not "
137+ "available!", level="WARNING")
138 return True
139
140
141
142=== modified file 'unit_tests/test_rabbit_utils.py'
143--- unit_tests/test_rabbit_utils.py 2015-10-06 22:49:14 +0000
144+++ unit_tests/test_rabbit_utils.py 2015-10-13 13:56:36 +0000
145@@ -70,8 +70,6 @@
146
147
148 class UtilsTests(unittest.TestCase):
149- def setUp(self):
150- super(UtilsTests, self).setUp()
151
152 @mock.patch("rabbit_utils.log")
153 def test_update_empty_hosts_file(self, mock_log):
154@@ -171,7 +169,7 @@
155 mock_peer_retrieve.return_value = '192.168.20.50'
156 mock_get_node_hostname.return_value = 'juju-devel3-machine-15'
157 self.assertEqual(rabbit_utils.leader_node(),
158- ['rabbit@juju-devel3-machine-15'])
159+ 'rabbit@juju-devel3-machine-15')
160
161 @mock.patch('rabbit_utils.subprocess.check_call')
162 @mock.patch('rabbit_utils.subprocess.check_output')
163@@ -180,13 +178,15 @@
164 @mock.patch('rabbit_utils.leader_node')
165 @mock.patch('rabbit_utils.clustered')
166 @mock.patch('rabbit_utils.cmp_pkgrevno')
167- def test_cluster_with_not_clustered(self, mock_cmp_pkgrevno,
168+ @mock.patch('rabbit_utils.config')
169+ def test_cluster_with_not_clustered(self, mock_config, mock_cmp_pkgrevno,
170 mock_clustered, mock_leader_node,
171 mock_running_nodes, mock_time,
172 mock_check_output, mock_check_call):
173+ mock_config.return_value = 3
174 mock_cmp_pkgrevno.return_value = True
175 mock_clustered.return_value = False
176- mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
177+ mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
178 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
179 rabbit_utils.cluster_with()
180 mock_check_output.assert_called_with([rabbit_utils.RABBITMQ_CTL,
181@@ -206,11 +206,11 @@
182 mock_time, mock_check_output,
183 mock_check_call):
184 mock_clustered.return_value = True
185- mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
186+ mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
187 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19',
188 'rabbit@juju-devel7-machine-11']
189 rabbit_utils.cluster_with()
190- assert not mock_check_output.called
191+ self.assertEqual(0, mock_check_output.call_count)
192
193 @mock.patch('rabbit_utils.subprocess.check_call')
194 @mock.patch('rabbit_utils.subprocess.check_output')
195@@ -224,7 +224,7 @@
196 mock_time, mock_check_output,
197 mock_check_call):
198 mock_clustered.return_value = False
199- mock_leader_node.return_value = []
200+ mock_leader_node.return_value = None
201 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
202 rabbit_utils.cluster_with()
203- assert not mock_check_output.called
204+ self.assertEqual(0, mock_check_output.call_count)

Subscribers

People subscribed via source and target branches