Merge lp:~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by David Ames
Status: Merged
Merged at revision: 117
Proposed branch: lp:~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster
Merge into: lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
Diff against target: 278 lines (+134/-49)
3 files modified
hooks/rabbit_utils.py (+46/-36)
hooks/rabbitmq_server_relations.py (+17/-13)
unit_tests/test_rabbit_utils.py (+71/-0)
To merge this branch: bzr merge lp:~thedac/charms/trusty/rabbitmq-server/le-ignore-min-cluster
Reviewer Review Type Date Requested Status
Billy Olsen Approve
Review via email: mp+273474@code.launchpad.net

Description of the change

Ignore min-cluster-size when juju has leadership election LP:1500204

This MP also addresses two major hurdles to rabbitmq clustering.

1) When more than one node has run stop_app at the same time nodes cannot join the cluster.

The ultimate solution to this is using charmhelpers.coordinator. However, this will require a significant design change. In the meantime, inserting a random wait period before attempting join the cluster is effective in resolving the race.

2) When using juju leadership election, if the elected leader is the third or greater node (i.e. rabbitmq/2 +), the leader would never be joined to the cluster.

Changing the clustering algorithm for each non-leader to join_cluster with the leader resolves this. Also by clustering with the leader rather than non-leader nodes we avoid split-brain clusters.

To post a comment you must log in.
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10526 rabbitmq-server-next for thedac mp273474
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10526/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11332 rabbitmq-server-next for thedac mp273474
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11332/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7125 rabbitmq-server-next for thedac mp273474
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7125/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Loooookin good....

 :)

Revision history for this message
Chris Glass (tribaal) wrote :

This looks great!

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11427 rabbitmq-server-next for thedac mp273474
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11427/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10618 rabbitmq-server-next for thedac mp273474
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10618/

122. By David Ames

Return an empty list if a leader has not yet been elected

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11428 rabbitmq-server-next for thedac mp273474
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11428/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10619 rabbitmq-server-next for thedac mp273474
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10619/

Revision history for this message
Billy Olsen (billy-olsen) :
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Per our discussion on IRC, please add unit tests.

review: Needs Fixing
123. By David Ames

Fix comments

124. By David Ames

Unit tests

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #11429 rabbitmq-server-next for thedac mp273474
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/11429/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #10620 rabbitmq-server-next for thedac mp273474
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/10620/

Revision history for this message
Billy Olsen (billy-olsen) wrote :

LGTM, just waiting on the amulet result

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #7171 rabbitmq-server-next for thedac mp273474
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/7171/

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Approved

review: Approve
Revision history for this message
Chris Glass (tribaal) wrote :

I realize this is already merged, but I'll do my comments anyway.

It would be good to add an amulet test to the branch asserting that a cluster forms with only i.e. 2 units (one less than min-cluster-size) if leader election is available.

Two inline comments.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'hooks/rabbit_utils.py'
2--- hooks/rabbit_utils.py 2015-10-01 09:29:50 +0000
3+++ hooks/rabbit_utils.py 2015-10-06 22:49:58 +0000
4@@ -4,6 +4,8 @@
5 import subprocess
6 import glob
7 import tempfile
8+import random
9+import time
10
11 from rabbitmq_context import (
12 RabbitMQSSLContext,
13@@ -19,7 +21,6 @@
14 from charmhelpers.core.hookenv import (
15 config,
16 relation_ids,
17- relation_get,
18 related_units,
19 log, ERROR,
20 INFO,
21@@ -301,54 +302,37 @@
22 log('Node is already clustered, skipping')
23 return False
24
25- # check all peers and try to cluster with them
26- available_nodes = []
27- for r_id in relation_ids('cluster'):
28- for unit in related_units(r_id):
29- if config('prefer-ipv6'):
30- address = relation_get('hostname',
31- rid=r_id, unit=unit)
32- else:
33- address = relation_get('private-address',
34- rid=r_id, unit=unit)
35- if address is not None:
36- node = get_hostname(address, fqdn=False)
37- if node:
38- available_nodes.append(node)
39- else:
40- log('Cannot resolve hostname for {} '
41- 'using DNS servers'.format(address),
42- level='WARNING')
43-
44- if len(available_nodes) == 0:
45+ # check the leader and try to cluster with it
46+ if len(leader_node()) == 0:
47 log('No nodes available to cluster with')
48 return False
49
50- # iterate over all the nodes, join to the first available
51- active_nodes = running_nodes()
52 num_tries = 0
53- for node in available_nodes:
54- log('Clustering with remote rabbit host (%s).' % node)
55- if node in active_nodes:
56+ for node in leader_node():
57+ if node in running_nodes():
58 log('Host already clustered with %s.' % node)
59 return False
60-
61+ log('Clustering with remote rabbit host (%s).' % node)
62+ # NOTE: The primary problem rabbitmq has clustering is when
63+ # more than one node attempts to cluster at the same time.
64+ # The asynchronous nature of hook firing nearly guarantees
65+ # this. Using random time wait is a hack until we can
66+ # implement charmhelpers.coordinator.
67+ time.sleep(random.random()*100)
68 try:
69 cmd = [RABBITMQ_CTL, 'stop_app']
70 subprocess.check_call(cmd)
71- cmd = [RABBITMQ_CTL, cluster_cmd, 'rabbit@%s' % node]
72- try:
73- subprocess.check_output(cmd, stderr=subprocess.STDOUT)
74- except subprocess.CalledProcessError as e:
75- if not e.returncode == 2 or \
76- "{ok,already_member}" not in e.output:
77- raise e
78+ cmd = [RABBITMQ_CTL, cluster_cmd, node]
79+ subprocess.check_output(cmd, stderr=subprocess.STDOUT)
80 cmd = [RABBITMQ_CTL, 'start_app']
81 subprocess.check_call(cmd)
82 log('Host clustered with %s.' % node)
83 return True
84- except:
85- log('Failed to cluster with %s.' % node)
86+ except subprocess.CalledProcessError as e:
87+ log('Failed to cluster with %s. Exception: %s'
88+ % (node, e))
89+ cmd = [RABBITMQ_CTL, 'start_app']
90+ subprocess.check_call(cmd)
91 # continue to the next node
92 num_tries += 1
93 if num_tries > config('max-cluster-tries'):
94@@ -619,8 +603,34 @@
95
96
97 @cached
98+def leader_node():
99+ ''' Provide the leader node for clustering '''
100+ # Each rabbitmq node should join_cluster with the leader
101+ # to avoid split-brain clusters.
102+ leader_node_ip = peer_retrieve('leader_node_ip')
103+ if leader_node_ip:
104+ return ["rabbit@" + get_node_hostname(leader_node_ip)]
105+ else:
106+ return []
107+
108+
109+def get_node_hostname(address):
110+ ''' Resolve IP address to hostname for nodes '''
111+ node = get_hostname(address, fqdn=False)
112+ if node:
113+ return node
114+ else:
115+ log('Cannot resolve hostname for {} using DNS servers'.format(address),
116+ level='WARNING')
117+ return None
118+
119+
120+@cached
121 def clustered():
122 ''' Determine whether local rabbitmq-server is clustered '''
123+ # NOTE: A rabbitmq node can only join a cluster once.
124+ # Simply checking for more than one running node tells us
125+ # if this unit is in a cluster.
126 if len(running_nodes()) > 1:
127 return True
128 else:
129
130=== modified file 'hooks/rabbitmq_server_relations.py'
131--- hooks/rabbitmq_server_relations.py 2015-10-01 09:41:38 +0000
132+++ hooks/rabbitmq_server_relations.py 2015-10-06 22:49:58 +0000
133@@ -66,6 +66,7 @@
134 is_leader,
135 charm_dir,
136 status_set,
137+ unit_private_ip,
138 )
139 from charmhelpers.core.host import (
140 cmp_pkgrevno,
141@@ -74,7 +75,6 @@
142 service_stop,
143 service_restart,
144 write_file,
145- mkdir,
146 )
147 from charmhelpers.contrib.charmsupport import nrpe
148
149@@ -260,16 +260,21 @@
150 """
151 min_size = config('min-cluster-size')
152 if min_size:
153- size = 0
154- for rid in relation_ids('cluster'):
155- size = len(related_units(rid))
156+ # Ignore min-cluster-size if juju has leadership election
157+ try:
158+ is_leader()
159+ return True
160+ except NotImplementedError:
161+ size = 0
162+ for rid in relation_ids('cluster'):
163+ size = len(related_units(rid))
164
165- # Include this unit
166- size += 1
167- if min_size > size:
168- log("Insufficient number of peer units to form cluster "
169- "(expected=%s, got=%s)" % (min_size, size), level=INFO)
170- return False
171+ # Include this unit
172+ size += 1
173+ if min_size > size:
174+ log("Insufficient number of peer units to form cluster "
175+ "(expected=%s, got=%s)" % (min_size, size), level=INFO)
176+ return False
177
178 return True
179
180@@ -318,6 +323,7 @@
181 log('Leader peer_storing cookie', level=INFO)
182 cookie = open(rabbit.COOKIE_PATH, 'r').read().strip()
183 peer_store('cookie', cookie)
184+ peer_store('leader_node_ip', unit_private_ip())
185
186
187 @hooks.hook('cluster-relation-changed')
188@@ -343,10 +349,8 @@
189 rabbit.update_hosts_file({private_address: hostname})
190
191 if not is_sufficient_peers():
192- # Stop rabbit until leader has finished configuring
193- log('Not enough peers, stopping until leader is configured',
194+ log('Not enough peers, waiting until leader is configured',
195 level=INFO)
196- service_stop('rabbitmq-server')
197 return
198
199 # sync the cookie with peers if necessary
200
201=== modified file 'unit_tests/test_rabbit_utils.py'
202--- unit_tests/test_rabbit_utils.py 2015-10-01 09:27:20 +0000
203+++ unit_tests/test_rabbit_utils.py 2015-10-06 22:49:58 +0000
204@@ -157,3 +157,74 @@
205 RABBITMQCTL_CLUSTERSTATUS_SOLO
206 self.assertEqual(rabbit_utils.running_nodes(),
207 ['rabbit@juju-devel3-machine-14'])
208+
209+ @mock.patch('rabbit_utils.get_hostname')
210+ def test_get_node_hostname(self, mock_get_hostname):
211+ mock_get_hostname.return_value = 'juju-devel3-machine-13'
212+ self.assertEqual(rabbit_utils.get_node_hostname('192.168.20.50'),
213+ 'juju-devel3-machine-13')
214+ mock_get_hostname.assert_called_with('192.168.20.50', fqdn=False)
215+
216+ @mock.patch('rabbit_utils.get_node_hostname')
217+ @mock.patch('rabbit_utils.peer_retrieve')
218+ def test_leader_node(self, mock_peer_retrieve, mock_get_node_hostname):
219+ mock_peer_retrieve.return_value = '192.168.20.50'
220+ mock_get_node_hostname.return_value = 'juju-devel3-machine-15'
221+ self.assertEqual(rabbit_utils.leader_node(),
222+ ['rabbit@juju-devel3-machine-15'])
223+
224+ @mock.patch('rabbit_utils.subprocess.check_call')
225+ @mock.patch('rabbit_utils.subprocess.check_output')
226+ @mock.patch('rabbit_utils.time')
227+ @mock.patch('rabbit_utils.running_nodes')
228+ @mock.patch('rabbit_utils.leader_node')
229+ @mock.patch('rabbit_utils.clustered')
230+ @mock.patch('rabbit_utils.cmp_pkgrevno')
231+ def test_cluster_with_not_clustered(self, mock_cmp_pkgrevno,
232+ mock_clustered, mock_leader_node,
233+ mock_running_nodes, mock_time,
234+ mock_check_output, mock_check_call):
235+ mock_cmp_pkgrevno.return_value = True
236+ mock_clustered.return_value = False
237+ mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
238+ mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
239+ rabbit_utils.cluster_with()
240+ mock_check_output.assert_called_with([rabbit_utils.RABBITMQ_CTL,
241+ 'join_cluster',
242+ 'rabbit@juju-devel7-machine-11'],
243+ stderr=-2)
244+
245+ @mock.patch('rabbit_utils.subprocess.check_call')
246+ @mock.patch('rabbit_utils.subprocess.check_output')
247+ @mock.patch('rabbit_utils.time')
248+ @mock.patch('rabbit_utils.running_nodes')
249+ @mock.patch('rabbit_utils.leader_node')
250+ @mock.patch('rabbit_utils.clustered')
251+ @mock.patch('rabbit_utils.cmp_pkgrevno')
252+ def test_cluster_with_clustered(self, mock_cmp_pkgrevno, mock_clustered,
253+ mock_leader_node, mock_running_nodes,
254+ mock_time, mock_check_output,
255+ mock_check_call):
256+ mock_clustered.return_value = True
257+ mock_leader_node.return_value = ['rabbit@juju-devel7-machine-11']
258+ mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19',
259+ 'rabbit@juju-devel7-machine-11']
260+ rabbit_utils.cluster_with()
261+ assert not mock_check_output.called
262+
263+ @mock.patch('rabbit_utils.subprocess.check_call')
264+ @mock.patch('rabbit_utils.subprocess.check_output')
265+ @mock.patch('rabbit_utils.time')
266+ @mock.patch('rabbit_utils.running_nodes')
267+ @mock.patch('rabbit_utils.leader_node')
268+ @mock.patch('rabbit_utils.clustered')
269+ @mock.patch('rabbit_utils.cmp_pkgrevno')
270+ def test_cluster_with_no_leader(self, mock_cmp_pkgrevno, mock_clustered,
271+ mock_leader_node, mock_running_nodes,
272+ mock_time, mock_check_output,
273+ mock_check_call):
274+ mock_clustered.return_value = False
275+ mock_leader_node.return_value = []
276+ mock_running_nodes.return_value = ['rabbit@juju-devel5-machine-19']
277+ rabbit_utils.cluster_with()
278+ assert not mock_check_output.called

Subscribers

People subscribed via source and target branches