Juju Charms Collection
rabbitmq-server package

Merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by Chris Glass on 2015-10-12

Status:

Work in progress

Proposed branch:

lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster

Merge into:

lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Diff against target:

204 lines (+52/-39)

4 files modified

Makefile (+1/-1)
hooks/rabbit_utils.py (+23/-23)
hooks/rabbitmq_server_relations.py (+19/-6)
unit_tests/test_rabbit_utils.py (+9/-9)

To merge this branch:

bzr merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster

High

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Ryan Beisner (community)		Needs Information on 2015-10-13
Geoff Teale (community)		Approve on 2015-10-12
OpenStack Charmers	2015-10-12	Pending
Review via email: mp+274109@code.launchpad.net

Description of the change

This branch fixes some comments about a previously merged branch (https://code.launchpad.net/~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster/+merge/273474), but since the branch was already merged it needed to be done as a new branch.

List of changes:
- Changed retry logic to actually work.
- Changed leader_node() to return something or None, but not a list with always only one element
- Changed a few things to use more pythonic idioms.

REVIEWERS: Please add your comments and approval/rejection but let me merge the branch myself (I have a sister branch targeting trunk I want to keep in sync)

Revision history for this message

Chris Glass (tribaal) wrote on 2015-10-12:

Added some inline notes to reviewers.

Revision history for this message

Geoff Teale (tealeg) wrote on 2015-10-12:

+1 a few small notes below.

review: Approve

lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster updated on 2015-10-12

121. By Chris Glass on 2015-10-12: Added more logging and a slightly clearer logic.
122. By Chris Glass on 2015-10-12: Fix typo.

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_lint_check #11710 rabbitmq-server-next for tribaal mp274109
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11710/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_unit_test #10896 rabbitmq-server-next for tribaal mp274109
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10896/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_amulet_test #7277 rabbitmq-server-next for tribaal mp274109
AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
2015-10-12 14:53:35,977 configure_rmq_ssl_on DEBUG: Setting ssl charm config option: on
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12763702/
Build: http://10.245.162.77:8080/job/charm_amulet_test/7277/

lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster updated on 2015-10-12

123. By Chris Glass on 2015-10-12: Empty commit to kick of OSCI again.
124. By Chris Glass on 2015-10-12: Added missing unit tests dependency while we're at it.

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_lint_check #11720 rabbitmq-server-next for tribaal mp274109
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11720/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_unit_test #10903 rabbitmq-server-next for tribaal mp274109
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10903/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-12:

charm_amulet_test #7287 rabbitmq-server-next for tribaal mp274109
AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7287/

Revision history for this message

David Ames (thedac) wrote on 2015-10-12:

My 2 cents. I am not rejecting this MP as the retry logic was certainly broken (non-existent) and misleading.

However, the ultimate goal of the this charm should be: The non-leader nodes take turns clustering with the leader. Only this will guarantee successful, repeatable, testable clustering. Stuart Biship's charmhelpers.coordinator will allow us to do this and the plan is after 15.10 to carve out time to implement this.

The very concept of retry logic is brittle and it is extremely difficult to test. A single amulet success is insufficient to prove functionality. Ryan and I have been doing iterative tests over various series with upwards of 50 iterations to "prove" things work.

So this MP like mine that preceded it is a stop-gap measure, until we can implement a complete solution.

Revision history for this message

Chris Glass (tribaal) wrote on 2015-10-12:

> So this MP like mine that preceded it is a stop-gap measure, until we can
> implement a complete solution.

Agreed 100% - it is a stop-gap measure, as you said. It simply feels a little less brittle ,since I believe it makes the code easier to read and more formally correct.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-10-13:

Thank you for you work on this.

The proposed new amulet test file 'tests/05-check-leader-election-clustering' will only exercise the tests on one Ubuntu:OpenStack release combo (Trusty-Kilo). Having just done a major refactor of a half-dozen such tests in order to cover our release matrix, I have to nack in its current form.

The guidance for oscharm amulet test scenarios continues to be: Tests should be added to basic_deployment.py in a new test_ method, unless they require a different topology, or a config option which would not be compatible with the other existing tests. In those cases, the test scenario is a candidate for an openstack mojo spec test rather than an amulet test.

A distant 2nd alternative would be to exercise 05-check-leader-election-clustering against Precise-Icehouse, Trusty-Icehouse, Trusty-Juno, Trusty-Kilo, Vivid-Kilo (and soon Trusty-Liberty and Wily Liberty), in the same way that basic_deployment.py is exercised, which would double our deploy count and test time for rmq (already at ~1.5hrs per test pipeline). But we should discuss further on the os-xteam call before anyone invests in that path.

review: Disapprove

lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster updated on 2015-10-13

125. By Chris Glass on 2015-10-13: Removed introduced test.

Revision history for this message

Chris Glass (tribaal) wrote on 2015-10-13:

Removed the introduced test, since I understand the entry barrier for new test is to be full matrix-testable, and this was not the case (not to mention it would need another matrix dimention - juju versions- to be effective).

The rest of the proposed changes are still valid and relevant (broken code is fixed).

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-13:

charm_lint_check #11778 rabbitmq-server-next for tribaal mp274109
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11778/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-13:

charm_unit_test #10957 rabbitmq-server-next for tribaal mp274109
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10957/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-10-13:

The added test coverage is appreciated, however we do need it to fit the methodology as described. To be clear, I'm not advocating for reducing test coverage.

To summarize the current behavior (and what to test):

"When min-cluster-size is set to a value which is higher than the actual number of rmq nodes:

Clustering is expected to fail when LE is *not* available (min not met);

Clustering is expected to succeed when LE is available (min ignored)."

^ This is indeed tricky, as all uosci tests are run with LE available, and specifying the juju version is outside the scope of amulet. I think (if that is indeed our desired behavior), a periodic mojo spec regression test is the path.

...

That said...

Given (config.yaml snippet):

  min-cluster-size:
    type: int
    default:
    description: |
      Minimum number of units expected to exist before charm will attempt to
      form a rabbitmq cluster

I'm not a fan of the existing behavior. If I am the user, and I tell the charm to have a minimum cluster size of 9, and only feed it 3 units, I would expect mayhem/failure (or to be blocked at the very least). IMO, we are being kind at the expense of the trust of a config option and the underlying hooks.

This behavior is what would make sense to me:

with LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, with status, when min < actual:
BLOCKED Require X units, only have N units

w/o LE, w/o status, when min < actual:
try, try, try, and eventually fail a hook

review: Needs Information

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-10-13:

charm_amulet_test #7297 rabbitmq-server-next for tribaal mp274109
AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7297/

Unmerged revisions

125. By Chris Glass on 2015-10-13

Removed introduced test.

124. By Chris Glass on 2015-10-12

Added missing unit tests dependency while we're at it.

123. By Chris Glass on 2015-10-12

Empty commit to kick of OSCI again.

122. By Chris Glass on 2015-10-12

Fix typo.

121. By Chris Glass on 2015-10-12

Added more logging and a slightly clearer logic.

120. By Chris Glass on 2015-10-12

Add Amulet test for trusty/kilo/leader-election to ensure previously introduced
behavior actually works.

Some refactoring around unecessary looping.

119. By Adam Collard on 2015-10-09

Address the review comments I made on stable branch
* leader_node() returns something or None
* Fix the retry logic which was crazy bonkers wrong
* Use better Python idioms

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Chris Glass

Nobuto Murata

OpenStack Charmers

 === modified file 'Makefile'
 --- Makefile	2015-10-06 18:43:11 +0000
 +++ Makefile	2015-10-13 13:56:36 +0000
@@ -11,7 +11,7 @@
  	(which dh_clean && dh_clean) || true
  .venv:
--	sudo apt-get install -y gcc python-dev python-virtualenv python-apt
++	sudo apt-get install -y gcc python-dev python-virtualenv python-apt python-netifaces
  	virtualenv .venv --system-site-packages
  	.venv/bin/pip install -I -r test-requirements.txt
 === modified file 'hooks/rabbit_utils.py'
 --- hooks/rabbit_utils.py	2015-10-06 21:52:17 +0000
 +++ hooks/rabbit_utils.py	2015-10-13 13:56:36 +0000
@@ -303,22 +303,24 @@
          return False
      # check the leader and try to cluster with it
--    if len(leader_node()) == 0:
++    node = leader_node()
++    if node is None:
          log('No nodes available to cluster with')
          return False
--    num_tries = 0
--    for node in leader_node():
--        if node in running_nodes():
--            log('Host already clustered with %s.' % node)
--            return False
--        log('Clustering with remote rabbit host (%s).' % node)
++    if node in running_nodes():
++        log('Host already clustered with %s.' % node)
++        return False
++    max_attempts = config('max-cluster-tries')
++    for attempt_number in range(1, max_attempts + 1):
          # NOTE: The primary problem rabbitmq has clustering is when
          # more than one node attempts to cluster at the same time.
          # The asynchronous nature of hook firing nearly guarantees
          # this. Using random time wait is a hack until we can
          # implement charmhelpers.coordinator.
--        time.sleep(random.random()*100)
++        time.sleep(random.random() * 100)
++        log('Clustering attempt %d/%d with remote rabbit host (%s).' % (
++            attempt_number, max_attempts, node))
          try:
              cmd = [RABBITMQ_CTL, 'stop_app']
              subprocess.check_call(cmd)
@@ -326,20 +328,20 @@
              subprocess.check_output(cmd, stderr=subprocess.STDOUT)
              cmd = [RABBITMQ_CTL, 'start_app']
              subprocess.check_call(cmd)
--            log('Host clustered with %s.' % node)
++        except subprocess.CalledProcessError as e:
++            log('Failed to cluster on attempt %d/%d with %s. Exception: %s' % (
++                attempt_number, max_attempts, node, e))
++            cmd = [RABBITMQ_CTL, 'start_app']
++            subprocess.check_call(cmd)
++        else:
++            log('Host clustered on attempt %d/%d with %s.' % (
++                attempt_number, max_attempts, node))
              return True
--        except subprocess.CalledProcessError as e:
--            log('Failed to cluster with %s. Exception: %s'
--                % (node, e))
--            cmd = [RABBITMQ_CTL, 'start_app']
--            subprocess.check_call(cmd)
--        # continue to the next node
--        num_tries += 1
--        if num_tries > config('max-cluster-tries'):
--            log('Max tries number exhausted, exiting', level=ERROR)
--            raise
--    return False
++    error_message = "Maximum number of attempts (%d) exhausted, exiting" % (
++        max_attempts)
++    log(error_message, level=ERROR)
++    raise Exception(error_message)
  def break_cluster():
@@ -609,9 +611,7 @@
      # to avoid split-brain clusters.
      leader_node_ip = peer_retrieve('leader_node_ip')
      if leader_node_ip:
--        return ["rabbit@" + get_node_hostname(leader_node_ip)]
--    else:
--        return []
++        return "rabbit@" + get_node_hostname(leader_node_ip)
  def get_node_hostname(address):
 === modified file 'hooks/rabbitmq_server_relations.py'
 --- hooks/rabbitmq_server_relations.py	2015-10-06 18:55:33 +0000
 +++ hooks/rabbitmq_server_relations.py	2015-10-13 13:56:36 +0000
@@ -259,23 +259,36 @@
      number of peers to proceed with creating rabbitmq cluster.
      """
      min_size = config('min-cluster-size')
++    leader_election_available = True
++    try:
++        is_leader()
++    except NotImplementedError:
++        leader_election_available = False
++
      if min_size:
--        # Ignore min-cluster-size if juju has leadership election
--        try:
--            is_leader()
--            return True
--        except NotImplementedError:
++        # Use min-cluster-size if we don't have Juju leader election.
++        if not leader_election_available:
++            log("Waiting for minimum of %d peer units since there's no Juju "
++                "leader election" % (min_size))
              size = 0
              for rid in relation_ids('cluster'):
                  size = len(related_units(rid))
              # Include this unit
              size += 1
--            if min_size > size:
++            if size < min_size:
                  log("Insufficient number of peer units to form cluster "
                      "(expected=%s, got=%s)" % (min_size, size), level=INFO)
                  return False
++        else:
++            log("Ignoring min-cluster-size in favour of Juju leader election")
++            return True
++    if leader_election_available:
++        log("min-cluster-size is not defined, using juju leader-election.")
++    else:
++        log("min-cluster-size is not defined and juju leader election is not "
++            "available!", level="WARNING")
      return True
 === modified file 'unit_tests/test_rabbit_utils.py'
 --- unit_tests/test_rabbit_utils.py	2015-10-06 22:49:14 +0000
 +++ unit_tests/test_rabbit_utils.py	2015-10-13 13:56:36 +0000
@@ -70,8 +70,6 @@
  class UtilsTests(unittest.TestCase):
--    def setUp(self):
--        super(UtilsTests, self).setUp()
      @mock.patch("rabbit_utils.log")
      def test_update_empty_hosts_file(self, mock_log):
@@ -171,7 +169,7 @@
          mock_peer_retrieve.return_value = '192.168.20.50'
          mock_get_node_hostname.return_value = 'juju-devel3-machine-15'
          self.assertEqual(rabbit_utils.leader_node(),
--                         ['rabbit@juju-devel3-machine-15'])
++                         'rabbit@juju-devel3-machine-15')
      @mock.patch('rabbit_utils.subprocess.check_call')
      @mock.patch('rabbit_utils.subprocess.check_output')
@@ -180,13 +178,15 @@
      @mock.patch('rabbit_utils.leader_node')
      @mock.patch('rabbit_utils.clustered')
      @mock.patch('rabbit_utils.cmp_pkgrevno')
--    def test_cluster_with_not_clustered(self, mock_cmp_pkgrevno,
++    @mock.patch('rabbit_utils.config')
++    def test_cluster_with_not_clustered(self, mock_config, mock_cmp_pkgrevno,
                                          mock_clustered, mock_leader_node,
                                          mock_running_nodes, mock_time,
                                          mock_check_output, mock_check_call):
++        mock_config.return_value = 3
          mock_cmp_pkgrevno.return_value = True
          mock_clustered.return_value = False
--        mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
++        mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
          mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
          rabbit_utils.cluster_with()
          mock_check_output.assert_called_with([rabbit_utils.RABBITMQ_CTL,
@@ -206,11 +206,11 @@
                                      mock_time, mock_check_output,
                                      mock_check_call):
          mock_clustered.return_value = True
--        mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
++        mock_leader_node.return_value = 'rabbit@juju-devel7-machine-11'
          mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19',
                                             'rabbit@juju-devel7-machine-11']
          rabbit_utils.cluster_with()
--        assert not mock_check_output.called
++        self.assertEqual(0, mock_check_output.call_count)
      @mock.patch('rabbit_utils.subprocess.check_call')
      @mock.patch('rabbit_utils.subprocess.check_output')
@@ -224,7 +224,7 @@
                                      mock_time, mock_check_output,
                                      mock_check_call):
          mock_clustered.return_value = False
--        mock_leader_node.return_value = []
++        mock_leader_node.return_value = None
          mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
          rabbit_utils.cluster_with()
--        assert not mock_check_output.called
++        self.assertEqual(0, mock_check_output.call_count)

Juju Charms Collectionrabbitmq-server package

Merge lp:~tribaal/charms/trusty/rabbitmq-server/fix-le-ignore-min-cluster into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

Juju Charms Collection
rabbitmq-server package