Merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by Chris Glass
Status: Work in progress
Proposed branch: lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster
Merge into: lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
Diff against target: 204 lines (+52/-39)
4 files modified
Makefile (+1/-1)
hooks/rabbit_utils.py (+23/-23)
hooks/rabbitmq_server_relations.py (+19/-6)
unit_tests/test_rabbit_utils.py (+9/-9)
To merge this branch: bzr merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster
Reviewer Review Type Date Requested Status
Ryan Beisner (community) Needs Information
Geoff Teale (community) Approve
OpenStack Charmers Pending
Review via email: mp+274109@code.launchpad.net

Description of the change

This branch fixes some comments about a previously merged branch (https://code.launchpad.net/~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster/+merge/273474), but since the branch was already merged it needed to be done as a new branch.

List of changes:
- Changed retry logic to actually work.
- Changed leader_node() to return something or None, but not a list with always only one element
- Changed a few things to use more pythonic idioms.

REVIEWERS: Please add your comments and approval/rejection but let me merge the branch myself (I have a sister branch targeting trunk I want to keep in sync)

To post a comment you must log in.
Revision history for this message
Chris Glass (tribaal) wrote :

Added some inline notes to reviewers.

Revision history for this message
Geoff Teale (tealeg) wrote :

+1 a few small notes below.

review: Approve
121. By Chris Glass

Added more logging and a slightly clearer logic.

122. By Chris Glass

Fix typo.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11710 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11710/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10896 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10896/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7277 rabbitmq-server-next for tribaal mp274109
    AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
2015-10-12 14:53:35,977 configure_rmq_ssl_on DEBUG: Setting ssl charm config option: on
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12763702/
Build: http://10.245.162.77:8080/job/charm_amulet_test/7277/

123. By Chris Glass

Empty commit to kick of OSCI again.

124. By Chris Glass

Added missing unit tests dependency while we're at it.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11720 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11720/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10903 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10903/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7287 rabbitmq-server-next for tribaal mp274109
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7287/

Revision history for this message
David Ames (thedac) wrote :

My 2 cents. I am not rejecting this MP as the retry logic was certainly broken (non-existent) and misleading.

However, the ultimate goal of the this charm should be: The non-leader nodes take turns clustering with the leader. Only this will guarantee successful, repeatable, testable clustering. Stuart Biship's charmhelpers.coordinator will allow us to do this and the plan is after 15.10 to carve out time to implement this.

The very concept of retry logic is brittle and it is extremely difficult to test. A single amulet success is insufficient to prove functionality. Ryan and I have been doing iterative tests over various series with upwards of 50 iterations to "prove" things work.

So this MP like mine that preceded it is a stop-gap measure, until we can implement a complete solution.

Revision history for this message
Chris Glass (tribaal) wrote :

> So this MP like mine that preceded it is a stop-gap measure, until we can
> implement a complete solution.

Agreed 100% - it is a stop-gap measure, as you said. It simply feels a little less brittle ,since I believe it makes the code easier to read and more formally correct.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for you work on this.

The proposed new amulet test file 'tests/05-check-leader-election-clustering' will only exercise the tests on one Ubuntu:OpenStack release combo (Trusty-Kilo). Having just done a major refactor of a half-dozen such tests in order to cover our release matrix, I have to nack in its current form.

The guidance for oscharm amulet test scenarios continues to be: Tests should be added to basic_deployment.py in a new test_ method, unless they require a different topology, or a config option which would not be compatible with the other existing tests. In those cases, the test scenario is a candidate for an openstack mojo spec test rather than an amulet test.

A distant 2nd alternative would be to exercise 05-check-leader-election-clustering against Precise-Icehouse, Trusty-Icehouse, Trusty-Juno, Trusty-Kilo, Vivid-Kilo (and soon Trusty-Liberty and Wily Liberty), in the same way that basic_deployment.py is exercised, which would double our deploy count and test time for rmq (already at ~1.5hrs per test pipeline). But we should discuss further on the os-xteam call before anyone invests in that path.

review: Disapprove
125. By Chris Glass

Removed introduced test.

Revision history for this message
Chris Glass (tribaal) wrote :

Removed the introduced test, since I understand the entry barrier for new test is to be full matrix-testable, and this was not the case (not to mention it would need another matrix dimention - juju versions- to be effective).

The rest of the proposed changes are still valid and relevant (broken code is fixed).

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11778 rabbitmq-server-next for tribaal mp274109
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11778/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10957 rabbitmq-server-next for tribaal mp274109
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10957/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The added test coverage is appreciated, however we do need it to fit the methodology as described. To be clear, I'm not advocating for reducing test coverage.

To summarize the current behavior (and what to test):

"When min-cluster-size is set to a value which is higher than the actual number of rmq nodes:

Clustering is expected to fail when LE is *not* available (min not met);

Clustering is expected to succeed when LE is available (min ignored)."

^ This is indeed tricky, as all uosci tests are run with LE available, and specifying the juju version is outside the scope of amulet. I think (if that is indeed our desired behavior), a periodic mojo spec regression test is the path.

...

That said...

Given (config.yaml snippet):

  min-cluster-size:
    type: int
    default:
    description: |
      Minimum number of units expected to exist before charm will attempt to
      form a rabbitmq cluster

I'm not a fan of the existing behavior. If I am the user, and I tell the charm to have a minimum cluster size of 9, and only feed it 3 units, I would expect mayhem/failure (or to be blocked at the very least). IMO, we are being kind at the expense of the trust of a config option and the underlying hooks.

This behavior is what would make sense to me:

with LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, w/o status, when min < actual:
try, try, try, and eventually fail a hook

review: Needs Information
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7297 rabbitmq-server-next for tribaal mp274109
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7297/

Unmerged revisions

125. By Chris Glass

Removed introduced test.

124. By Chris Glass

Added missing unit tests dependency while we're at it.

123. By Chris Glass

Empty commit to kick of OSCI again.

122. By Chris Glass

Fix typo.

121. By Chris Glass

Added more logging and a slightly clearer logic.

120. By Chris Glass

Add Amulet test for trusty/kilo/leader-election to ensure previously introduced
behavior actually works.

Some refactoring around unecessary looping.

119. By Adam Collard

Address the review comments I made on stable branch
 * leader_node() returns something or None
 * Fix the retry logic which was crazy bonkers wrong
 * Use better Python idioms

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'Makefile'
--- Makefile 2015-10-06 18:43:11 +0000
+++ Makefile 2015-10-13 13:56:36 +0000
@@ -11,7 +11,7 @@
11 (which dh_clean && dh_clean) || true11 (which dh_clean && dh_clean) || true
1212
13.venv:13.venv:
14 sudo apt-get install -y gcc python-dev python-virtualenv python-apt14 sudo apt-get install -y gcc python-dev python-virtualenv python-apt python-netifaces
15 virtualenv .venv --system-site-packages15 virtualenv .venv --system-site-packages
16 .venv/bin/pip install -I -r test-requirements.txt16 .venv/bin/pip install -I -r test-requirements.txt
1717
1818
=== modified file 'hooks/rabbit_utils.py'
--- hooks/rabbit_utils.py 2015-10-06 21:52:17 +0000
+++ hooks/rabbit_utils.py 2015-10-13 13:56:36 +0000
@@ -303,22 +303,24 @@
303 return False303 return False
304304
305 # check the leader and try to cluster with it305 # check the leader and try to cluster with it
306 if len(leader_node()) == 0:306 node = leader_node()
307 if node is None:
307 log('No nodes available to cluster with')308 log('No nodes available to cluster with')
308 return False309 return False
309310
310 num_tries = 0311 if node in running_nodes():
311 for node in leader_node():312 log('Host already clustered with %s.' % node)
312 if node in running_nodes():313 return False
313 log('Host already clustered with %s.' % node)314 max_attempts = config('max-cluster-tries')
314 return False315 for attempt_number in range(1, max_attempts + 1):
315 log('Clustering with remote rabbit host (%s).' % node)
316 # NOTE: The primary problem rabbitmq has clustering is when316 # NOTE: The primary problem rabbitmq has clustering is when
317 # more than one node attempts to cluster at the same time.317 # more than one node attempts to cluster at the same time.
318 # The asynchronous nature of hook firing nearly guarantees318 # The asynchronous nature of hook firing nearly guarantees
319 # this. Using random time wait is a hack until we can319 # this. Using random time wait is a hack until we can
320 # implement charmhelpers.coordinator.320 # implement charmhelpers.coordinator.
321 time.sleep(random.random()*100)321 time.sleep(random.random() * 100)
322 log('Clustering attempt %d/%d with remote rabbit host (%s).' % (
323 attempt_number, max_attempts, node))
322 try:324 try:
323 cmd = [RABBITMQ_CTL, 'stop_app']325 cmd = [RABBITMQ_CTL, 'stop_app']
324 subprocess.check_call(cmd)326 subprocess.check_call(cmd)
@@ -326,20 +328,20 @@
326 subprocess.check_output(cmd, stderr=subprocess.STDOUT)328 subprocess.check_output(cmd, stderr=subprocess.STDOUT)
327 cmd = [RABBITMQ_CTL, 'start_app']329 cmd = [RABBITMQ_CTL, 'start_app']
328 subprocess.check_call(cmd)330 subprocess.check_call(cmd)
329 log('Host clustered with %s.' % node)331 except subprocess.CalledProcessError as e:
332 log('Failed to cluster on attempt %d/%d with %s. Exception: %s' % (
333 attempt_number, max_attempts, node, e))
334 cmd = [RABBITMQ_CTL, 'start_app']
335 subprocess.check_call(cmd)
336 else:
337 log('Host clustered on attempt %d/%d with %s.' % (
338 attempt_number, max_attempts, node))
330 return True339 return True
331 except subprocess.CalledProcessError as e:
332 log('Failed to cluster with %s. Exception: %s'
333 % (node, e))
334 cmd = [RABBITMQ_CTL, 'start_app']
335 subprocess.check_call(cmd)
336 # continue to the next node
337 num_tries += 1
338 if num_tries > config('max-cluster-tries'):
339 log('Max tries number exhausted, exiting', level=ERROR)
340 raise
341340
342 return False341 error_message = "Maximum number of attempts (%d) exhausted, exiting" % (
342 max_attempts)
343 log(error_message, level=ERROR)
344 raise Exception(error_message)
343345
344346
345def break_cluster():347def break_cluster():
@@ -609,9 +611,7 @@
609 # to avoid split-brain clusters.611 # to avoid split-brain clusters.
610 leader_node_ip = peer_retrieve('leader_node_ip')612 leader_node_ip = peer_retrieve('leader_node_ip')
611 if leader_node_ip:613 if leader_node_ip:
612 return ["rabbit@" + get_node_hostname(leader_node_ip)]614 return "rabbit@" + get_node_hostname(leader_node_ip)
613 else:
614 return []
615615
616616
617def get_node_hostname(address):617def get_node_hostname(address):
618618
=== modified file 'hooks/rabbitmq_server_relations.py'
--- hooks/rabbitmq_server_relations.py 2015-10-06 18:55:33 +0000
+++ hooks/rabbitmq_server_relations.py 2015-10-13 13:56:36 +0000
@@ -259,23 +259,36 @@
259 number of peers to proceed with creating rabbitmq cluster.259 number of peers to proceed with creating rabbitmq cluster.
260 """260 """
261 min_size = config('min-cluster-size')261 min_size = config('min-cluster-size')
262 leader_election_available = True
263 try:
264 is_leader()
265 except NotImplementedError:
266 leader_election_available = False
267
262 if min_size:268 if min_size:
263 # Ignore min-cluster-size if juju has leadership election269 # Use min-cluster-size if we don't have Juju leader election.
264 try: 270 if not leader_election_available:
265 is_leader()271 log("Waiting for minimum of %d peer units since there's no Juju "
266 return True272 "leader election" % (min_size))
267 except NotImplementedError:
268 size = 0273 size = 0
269 for rid in relation_ids('cluster'):274 for rid in relation_ids('cluster'):
270 size = len(related_units(rid))275 size = len(related_units(rid))
271276
272 # Include this unit277 # Include this unit
273 size += 1278 size += 1
274 if min_size > size:279 if size < min_size:
275 log("Insufficient number of peer units to form cluster "280 log("Insufficient number of peer units to form cluster "
276 "(expected=%s, got=%s)" % (min_size, size), level=INFO)281 "(expected=%s, got=%s)" % (min_size, size), level=INFO)
277 return False282 return False
283 else:
284 log("Ignoring min-cluster-size in favour of Juju leader election")
285 return True
278286
287 if leader_election_available:
288 log("min-cluster-size is not defined, using juju leader-election.")
289 else:
290 log("min-cluster-size is not defined and juju leader election is not "
291 "available!", level="WARNING")
279 return True292 return True
280293
281294
282295
=== modified file 'unit_tests/test_rabbit_utils.py'
--- unit_tests/test_rabbit_utils.py 2015-10-06 22:49:14 +0000
+++ unit_tests/test_rabbit_utils.py 2015-10-13 13:56:36 +0000
@@ -70,8 +70,6 @@
7070
7171
72class UtilsTests(unittest.TestCase):72class UtilsTests(unittest.TestCase):
73 def setUp(self):
74 super(UtilsTests, self).setUp()
7573
76 @mock.patch("rabbit_utils.log")74 @mock.patch("rabbit_utils.log")
77 def test_update_empty_hosts_file(self, mock_log):75 def test_update_empty_hosts_file(self, mock_log):
@@ -171,7 +169,7 @@
171 mock_peer_retrieve.return_value = '192.168.20.50'169 mock_peer_retrieve.return_value = '192.168.20.50'
172 mock_get_node_hostname.return_value = 'juju-devel3-machine-15'170 mock_get_node_hostname.return_value = 'juju-devel3-machine-15'
173 self.assertEqual(rabbit_utils.leader_node(),171 self.assertEqual(rabbit_utils.leader_node(),
174 ['rabbit@juju-devel3-machine-15'])172 'rabbit@juju-devel3-machine-15')
175173
176 @mock.patch('rabbit_utils.subprocess.check_call')174 @mock.patch('rabbit_utils.subprocess.check_call')
177 @mock.patch('rabbit_utils.subprocess.check_output')175 @mock.patch('rabbit_utils.subprocess.check_output')
@@ -180,13 +178,15 @@
180 @mock.patch('rabbit_utils.leader_node')178 @mock.patch('rabbit_utils.leader_node')
181 @mock.patch('rabbit_utils.clustered')179 @mock.patch('rabbit_utils.clustered')
182 @mock.patch('rabbit_utils.cmp_pkgrevno')180 @mock.patch('rabbit_utils.cmp_pkgrevno')
183 def test_cluster_with_not_clustered(self, mock_cmp_pkgrevno,181 @mock.patch('rabbit_utils.config')
182 def test_cluster_with_not_clustered(self, mock_config, mock_cmp_pkgrevno,
184 mock_clustered, mock_leader_node,183 mock_clustered, mock_leader_node,
185 mock_running_nodes, mock_time,184 mock_running_nodes, mock_time,
186 mock_check_output, mock_check_call):185 mock_check_output, mock_check_call):
186 mock_config.return_value = 3
187 mock_cmp_pkgrevno.return_value = True187 mock_cmp_pkgrevno.return_value = True
188 mock_clustered.return_value = False188 mock_clustered.return_value = False
189 mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']189 mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
190 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']190 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
191 rabbit_utils.cluster_with()191 rabbit_utils.cluster_with()
192 mock_check_output.assert_called_with([rabbit_utils.RABBITMQ_CTL,192 mock_check_output.assert_called_with([rabbit_utils.RABBITMQ_CTL,
@@ -206,11 +206,11 @@
206 mock_time, mock_check_output,206 mock_time, mock_check_output,
207 mock_check_call):207 mock_check_call):
208 mock_clustered.return_value = True208 mock_clustered.return_value = True
209 mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']209 mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
210 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19',210 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19',
211 'rabbit@juju-devel7-machine-11']211 'rabbit@juju-devel7-machine-11']
212 rabbit_utils.cluster_with()212 rabbit_utils.cluster_with()
213 assert not mock_check_output.called213 self.assertEqual(0, mock_check_output.call_count)
214214
215 @mock.patch('rabbit_utils.subprocess.check_call')215 @mock.patch('rabbit_utils.subprocess.check_call')
216 @mock.patch('rabbit_utils.subprocess.check_output')216 @mock.patch('rabbit_utils.subprocess.check_output')
@@ -224,7 +224,7 @@
224 mock_time, mock_check_output,224 mock_time, mock_check_output,
225 mock_check_call):225 mock_check_call):
226 mock_clustered.return_value = False226 mock_clustered.return_value = False
227 mock_leader_node.return_value = []227 mock_leader_node.return_value = None
228 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']228 mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
229 rabbit_utils.cluster_with()229 rabbit_utils.cluster_with()
230 assert not mock_check_output.called230 self.assertEqual(0, mock_check_output.call_count)

Subscribers

People subscribed via source and target branches