Merge ~vultaire/charm-elasticsearch:lp1835410 into charm-elasticsearch:master

Proposed by Paul Goins
Status: Merged
Approved by: Jeremy Lounder
Approved revision: 78825614a890118d762f755ca2a32985afc9375f
Merged at revision: 3cb5d16d4a183b190a39f000c0c25e81b9d2921d
Proposed branch: ~vultaire/charm-elasticsearch:lp1835410
Merge into: charm-elasticsearch:master
Diff against target: 152 lines (+99/-37)
2 files modified
bin/wait_for_peer.py (+97/-0)
tasks/peer-relations.yml (+2/-37)
Reviewer Review Type Date Requested Status
Jeremy Lounder (community) Approve
Review via email: mp+386119@code.launchpad.net

Commit message

Made initial peer connection code more forgiving

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
Jeremy Lounder (jldev) :
review: Approve
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 3cb5d16d4a183b190a39f000c0c25e81b9d2921d

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/bin/wait_for_peer.py b/bin/wait_for_peer.py
2new file mode 100755
3index 0000000..bffd549
4--- /dev/null
5+++ b/bin/wait_for_peer.py
6@@ -0,0 +1,97 @@
7+#!/usr/bin/env python3
8+
9+import logging
10+import os
11+import socket
12+import sys
13+import time
14+import traceback
15+
16+logging.basicConfig(
17+ stream=sys.stderr,
18+ level=logging.DEBUG,
19+ format="%(asctime)s [%(levelname)s] %(message)s",
20+)
21+
22+import requests
23+
24+sys.path.append(os.path.abspath("charm-helpers"))
25+from charmhelpers.core.host import service_restart
26+
27+
28+CONNECTION_TIMEOUT = 5
29+POLL_DELAY = 5
30+SCRIPT_TIMEOUT = 600
31+
32+
33+def main():
34+ start_time = time.time()
35+ while True:
36+ node_count = get_node_count()
37+ if node_count > 1:
38+ logging.info("Multiple ES nodes detected; success.")
39+ return 0
40+ if time.time() - start_time > SCRIPT_TIMEOUT:
41+ logging.error("Timed out while waiting for peers")
42+ return 1
43+ logging.info("Node count is not greater than 1, will restart and retry")
44+ logging.info("Restarting elasticsearch...")
45+ restart_elasticsearch()
46+ logging.info("Waiting for elasticsearch port to open...")
47+ wait_for_es_port()
48+ logging.info("Waiting {} seconds...".format(POLL_DELAY))
49+ time.sleep(POLL_DELAY)
50+
51+
52+def get_node_count():
53+ # NOTE: 503 errors have been intermittently sighted here, so we'll retry
54+ # in case of errors.
55+ cluster_health_url = "http://localhost:9200/_cluster/health"
56+ max_attempts = 5
57+ for i in range(max_attempts):
58+ try:
59+ return requests.get(cluster_health_url, timeout=CONNECTION_TIMEOUT).json()[
60+ "number_of_nodes"
61+ ]
62+ except Exception:
63+ if i < max_attempts - 1:
64+ logging.info(
65+ "Error occurred while polling {}; retrying".format(
66+ cluster_health_url
67+ )
68+ )
69+ time.sleep(POLL_DELAY)
70+ else:
71+ logging.warning(traceback.format_exc())
72+ else:
73+ logging.warning(
74+ "Error occurred while polling {}; giving up".format(cluster_health_url)
75+ )
76+ return 0
77+
78+
79+def restart_elasticsearch():
80+ service_restart("elasticsearch")
81+
82+
83+def wait_for_es_port(timeout=60):
84+ start_time = time.time()
85+ while time.time() - start_time < timeout:
86+ try:
87+ _test_socket_connection("localhost", 9200)
88+ except Exception:
89+ time.sleep(POLL_DELAY)
90+ else:
91+ return
92+ else:
93+ # Last try; allow exceptions to bubble up
94+ _test_socket_connection("localhost", 9200)
95+
96+
97+def _test_socket_connection(host, port):
98+ s = socket.create_connection((host, port), timeout=CONNECTION_TIMEOUT)
99+ s.close()
100+
101+
102+if __name__ == "__main__":
103+ sys.exit(main())
104diff --git a/tasks/peer-relations.yml b/tasks/peer-relations.yml
105index a9b5bc7..d50dd2e 100644
106--- a/tasks/peer-relations.yml
107+++ b/tasks/peer-relations.yml
108@@ -19,42 +19,7 @@
109 - peer-relation-changed
110 wait_for: port=9200
111
112-- name: Record current cluster health
113- tags:
114- - peer-relation-joined
115- - peer-relation-changed
116- uri: url=http://localhost:9200/_cluster/health return_content=yes
117- register: cluster_health
118-
119-- name: Restart if not part of cluster
120- tags:
121- - peer-relation-joined
122- - peer-relation-changed
123- service: name=elasticsearch state=restarted
124- when: cluster_health.json.number_of_nodes == 1
125-
126-- name: Wait until the local service is available after restart
127- tags:
128- - peer-relation-joined
129- - peer-relation-changed
130- wait_for: port=9200
131- when: cluster_health.json.number_of_nodes == 1
132-
133-- name: Pause to ensure that after restart unit has time to join.
134- tags:
135- - peer-relation-changed
136- pause: seconds=30
137- when: cluster_health.json.number_of_nodes == 1
138-
139-- name: Record cluster health after restart
140- tags:
141- - peer-relation-changed
142- uri: url=http://localhost:9200/_cluster/health return_content=yes
143- register: cluster_health_after_restart
144- when: cluster_health.json.number_of_nodes == 1
145-
146-- name: Fail if unit is still not part of cluster
147+- name: Wait for the cluster to increase in size
148 tags:
149 - peer-relation-changed
150- fail: msg="Unit failed to join cluster after peer-relation-changed"
151- when: cluster_health.json.number_of_nodes == 1 and cluster_health_after_restart.json.number_of_nodes == 1
152+ command: bin/wait_for_peer.py

Subscribers

People subscribed via source and target branches

to all changes: