Merge ~raychan96/charm-graylog:1970964_and_1835156 into charm-graylog:master

Proposed by Chi Wai CHAN
Status: Merged
Approved by: Eric Chen
Approved revision: c4365e49ac50e96d4af4d1cd57ddc6ca3166d5d2
Merged at revision: 4f92d5ba4a8fe0f32622190d9214300e8dc3281e
Proposed branch: ~raychan96/charm-graylog:1970964_and_1835156
Merge into: charm-graylog:master
Diff against target: 485 lines (+159/-90)
7 files modified
src/lib/charms/layer/elasticsearch/api.py (+29/-16)
src/lib/charms/layer/graylog/__init__.py (+1/-1)
src/lib/charms/layer/graylog/api.py (+45/-12)
src/lib/charms/layer/graylog/constants.py (+0/-1)
src/reactive/graylog.py (+53/-29)
src/tests/unit/requirements.txt (+1/-0)
src/tests/unit/test_es_api.py (+30/-31)
Reviewer Review Type Date Requested Status
JamesLin Approve
🤖 prod-jenkaas-bootstack (community) continuous-integration Approve
Eric Chen Approve
BootStack Reviewers Pending
BootStack Reviewers Pending
Review via email: mp+434900@code.launchpad.net

Commit message

This patch fixes #1970964 and #1835156 (the two are related).

Description of the change

* Properly handle non-healthy (not green) elasticsearch cluster with extra hooks. (addressed #1970964)

* Add health check to graylog api and properly handle the cases when graylog api is not ready.

* Installation of content packs should only be done once; calling it multiple times will lead to HTTP 400 bad request.

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
JamesLin (jneo8) wrote :

Retry should based on we know which exception to handle.

Revision history for this message
Chi Wai CHAN (raychan96) :
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Eric Chen (eric-chen) wrote :

Compared the solution from Sudeep. It seems very similiar.
One is included in ElasticSearchApi , another one is an adaptor when using ElasticSearchApi

https://code.launchpad.net/~sudeephb/charm-graylog/+git/charm-graylog/+merge/433720

The common problem: it may block the whole juju agent process.

Is it possible that
- when connection (elasticsearch) fail or healthy check fail, graylog's status will become error
- juju agent still can handle other event, eg: config-change
- recover when the connection okay

review: Needs Fixing
Revision history for this message
JamesLin (jneo8) wrote :

It'a a little bit grey area here because there will have multiple status of ElasticSearch.

1. ElasticSearch is not health, something went wrong.
2. ElasticSearch is still initiating, just need more time.
3. All nodes are health
3. Some nodes are not health

Can we just set graylog charm to BlockState(waiting for ElasticSearch healthy) instead of raise error? Error status may not be precise here.

Also another topic is: The green/yellow/red lights in elasticsearch are on each node. But our checking logic seems don't handle the details like "how many percentage of nodes are healthy then we can use it."

Also the checking logic should be include in update_status hook(This should be another task)

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Eric Chen (eric-chen) wrote :

LGTM, wait for jenkins CI pass

review: Approve
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Chi Wai CHAN (raychan96) wrote :

Interesting pattern when comparing one of the previous successful CI run and current failed CI: they both Error at checking mongodb health for gl2 tests (graylog 2, bionic tests). But the successful CI run passed the test after 3 retries.; while the current failed CI failed because there's no retry now. I am adding back the retry logic in the request function.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
JamesLin (jneo8) :
review: Approve
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 4f92d5ba4a8fe0f32622190d9214300e8dc3281e

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/src/lib/charms/layer/elasticsearch/api.py b/src/lib/charms/layer/elasticsearch/api.py
2index 811432b..b78064c 100644
3--- a/src/lib/charms/layer/elasticsearch/api.py
4+++ b/src/lib/charms/layer/elasticsearch/api.py
5@@ -1,30 +1,43 @@
6 """Elastic search API connector."""
7
8 import json
9+from urllib.parse import urljoin
10
11 from charmhelpers.core import hookenv
12
13 import requests
14
15
16+class ElasticSearchNotHealthy(Exception):
17+ pass
18+
19+
20 class ElasticSearchApi(object):
21 def __init__(self, endpoints):
22- self.endpoints = endpoints
23- self.reachable_ep = None
24- for ep in self.endpoints:
25- try:
26- hookenv.log("Trying to connect to endpoint {}".format(ep))
27- requests.get(ep)
28- self.reachable_ep = ep
29- break
30- except Exception:
31- hookenv.log("Endpoint {} is not a valid endpoint".format(ep))
32-
33- if self.reachable_ep is not None:
34- hookenv.log("Connected to ES endpoint: {}".format(self.reachable_ep))
35+ # any endpoint will do
36+ self.endpoint = endpoints[0]
37+
38+ def health_check(self):
39+ response = None
40+ health_endpoint = urljoin(
41+ self.endpoint, "/_cluster/health?wait_for_status=green&timeout=10s"
42+ )
43+ try:
44+ hookenv.log("Verifying ES cluster is healthy (green).")
45+ response = requests.get(health_endpoint)
46+ except Exception:
47+ message = "Endpoint {} is not a valid endpoint".format(health_endpoint)
48+ hookenv.log(message)
49+ raise ConnectionError(message)
50 else:
51- hookenv.log("Can not connect to ES API")
52- raise ConnectionError
53+ result = response.json()
54+ message = "ES cluster is healthy (green)."
55+ if result.get("status") == "green":
56+ hookenv.log(message)
57+ else:
58+ message = "ES cluster is not healthy or cannot connect to ES."
59+ hookenv.log(message)
60+ raise ElasticSearchNotHealthy(message)
61
62 # NOTE(erlon): As per Graylog's docs, graylog can inadvertently create a
63 # index called 'graylog_deflector' on elastic search. This index conflicts
64@@ -36,5 +49,5 @@ class ElasticSearchApi(object):
65 "persistent": {"action.auto_create_index": "-graylog_deflector,+*"}
66 }
67 api_params = json.dumps(api_params)
68- url = "%s%s" % (self.reachable_ep, api_url)
69+ url = urljoin(self.endpoint, api_url)
70 requests.put(url, data=api_params)
71diff --git a/src/lib/charms/layer/graylog/__init__.py b/src/lib/charms/layer/graylog/__init__.py
72index 5bab227..fa74c7d 100644
73--- a/src/lib/charms/layer/graylog/__init__.py
74+++ b/src/lib/charms/layer/graylog/__init__.py
75@@ -1,6 +1,6 @@
76 """Graylog library."""
77
78-from .api import GraylogApi # noqa: F401
79+from .api import ApiTimeout, GraylogApi # noqa: F401
80 from .logextract import ( # noqa: F401
81 GraylogPipelines,
82 GraylogRules,
83diff --git a/src/lib/charms/layer/graylog/api.py b/src/lib/charms/layer/graylog/api.py
84index d93e741..cdda27b 100644
85--- a/src/lib/charms/layer/graylog/api.py
86+++ b/src/lib/charms/layer/graylog/api.py
87@@ -4,6 +4,11 @@ import os
88
89 import requests
90
91+from tenacity import retry
92+from tenacity.retry import retry_if_exception_type
93+from tenacity.stop import stop_after_delay
94+from tenacity.wait import wait_fixed
95+
96
97 # When using 'certifi' from the virtualenv, the system-wide certificates store
98 # is not used, so installed certificates won't be used to validate hosts.
99@@ -11,6 +16,7 @@ import requests
100 # https://git.launchpad.net/ubuntu/+source/python-certifi/tree/debian/patches/0001-Use-Debian-provided-etc-ssl-certs-ca-certificates.cr.patch
101 SYSTEM_CA_BUNDLE = "/etc/ssl/certs/ca-certificates.crt"
102 DEFAULT_BACKEND_USER_ROLE = "Reader"
103+DEFAULT_REST_API_TIMEOUT = 10
104
105 # We are in a charm environment
106 charm = False
107@@ -25,6 +31,12 @@ def get_ignore_indexer_failures_file(): # noqa: D103
108 return "/usr/local/lib/nagios/plugins/ignore_indexer_failures.timestamp"
109
110
111+class ApiTimeout(Exception):
112+ """Unable to restart Graylog in a timely manner."""
113+
114+ pass
115+
116+
117 class GraylogApi:
118 """Manage Graylog via its API."""
119
120@@ -41,10 +53,23 @@ class GraylogApi:
121 self.token_name = token_name
122 self.token = None
123 self.input_types = None
124- self.req_timeout = 3
125 self.req_retries = 4
126+ self.req_timeout = 3
127 self.verify = verify
128
129+ @retry(
130+ wait=wait_fixed(5),
131+ stop=stop_after_delay(DEFAULT_REST_API_TIMEOUT),
132+ retry=retry_if_exception_type(ApiTimeout),
133+ reraise=True,
134+ )
135+ def health_check(self):
136+ health_check_endpoint = ""
137+ if not self.request(health_check_endpoint, prime_token=False):
138+ raise ApiTimeout(
139+ "Timeout waiting for graylog api; will retry after update-status."
140+ )
141+
142 def request( # noqa: C901
143 self, path, method="GET", data={}, params=None, prime_token=True
144 ):
145@@ -65,11 +90,13 @@ class GraylogApi:
146 }
147 if data:
148 data = json.dumps(data, indent=True)
149+
150 tries = 0
151- while tries < self.req_retries:
152+ result = False
153+ while tries < self.req_retries and not result:
154 tries += 1
155 try:
156- resp = requests.api.request(
157+ resp = requests.request(
158 method,
159 url,
160 auth=self.auth,
161@@ -79,24 +106,30 @@ class GraylogApi:
162 timeout=self.req_timeout,
163 verify=SYSTEM_CA_BUNDLE if self.verify is None else self.verify,
164 )
165+ except Exception as ex:
166+ msg = "Error calling graylog api: {}".format(ex)
167+ if charm:
168+ log(msg)
169+ else:
170+ print(msg)
171+ result = False
172+ else:
173 if resp.ok:
174 if method == "DELETE":
175- return True
176+ result = True
177 if resp.content:
178- return resp.json()
179+ result = resp.json()
180 else:
181- msg = "API error code: {}".format(resp.status_code)
182+ msg = "{}: response of graylog api is not okay. Reason: {}".format(
183+ resp.status_code, resp.reason
184+ )
185 if charm:
186 log(msg)
187 hookenv.status_set("blocked", msg)
188 else:
189 print(msg)
190- except Exception as ex:
191- msg = "Error calling graylog api: {}".format(ex)
192- if charm:
193- log(msg)
194- else:
195- print(msg)
196+ result = False
197+ return result
198
199 def token_get(self, token_name=None, halt=False):
200 """Return a token."""
201diff --git a/src/lib/charms/layer/graylog/constants.py b/src/lib/charms/layer/graylog/constants.py
202index 4340dc8..2c5f733 100644
203--- a/src/lib/charms/layer/graylog/constants.py
204+++ b/src/lib/charms/layer/graylog/constants.py
205@@ -14,7 +14,6 @@ SHIPPED_SNAP_SERVER_DEFAULT_CONF_FILE = (
206 SERVER_DEFAULT_CONF_FILE = "/var/snap/graylog/current/default-graylog-server"
207 ELASTICSEARCH_DISCOVERY_PORT = "9300"
208 SERVICE_NAME = "snap.graylog.graylog"
209-DEFAULT_REST_API_TIMEOUT = 120
210 NAGIOS_USERNAME = "nagios"
211 CERT_PATH = os.path.join(SNAP_COMMON_DIR, "server.crt")
212 CERT_KEY_PATH = os.path.join(SNAP_COMMON_DIR, "server.key")
213diff --git a/src/reactive/graylog.py b/src/reactive/graylog.py
214index b6f4313..ffd17ad 100644
215--- a/src/reactive/graylog.py
216+++ b/src/reactive/graylog.py
217@@ -17,6 +17,7 @@ import charms.leadership
218 from charms.layer import snap, tls_client
219 from charms.layer.elasticsearch import api as esapi
220 from charms.layer.graylog import (
221+ ApiTimeout,
222 GraylogApi,
223 LogExtractPipeline,
224 create_or_update_ldap_backend,
225@@ -32,7 +33,6 @@ from charms.layer.graylog.constants import (
226 CERT_PATH,
227 CONF_FILE,
228 CONTENT_PACKS_PATH,
229- DEFAULT_REST_API_TIMEOUT,
230 ELASTICSEARCH_DISCOVERY_PORT,
231 NAGIOS_USERNAME,
232 SERVER_DEFAULT_CONF_FILE,
233@@ -60,12 +60,6 @@ from charms.reactive.helpers import data_changed
234 import yaml
235
236
237-class ApiTimeout(Exception):
238- """Unable to restart Graylog in a timely manner."""
239-
240- pass
241-
242-
243 @hook("upgrade-charm")
244 def upgrade_charm():
245 """Reconfigure Graylog upon Juju charm upgrade."""
246@@ -326,6 +320,7 @@ def report_status(): # noqa: C901
247 beats_available = is_state("beat.setup")
248 es_connected = is_state("elasticsearch.connected")
249 es_available = is_state("elasticsearch.available")
250+ es_ready = is_state("elasticsearch.ready")
251 mongodb_connected = is_state("mongodb.connected")
252 mongodb_available = is_state("mongodb.available")
253 requested_certs = is_state("graylog.certificates.configured")
254@@ -389,7 +384,7 @@ def report_status(): # noqa: C901
255 # Elasticsearch
256 if es_connected and not es_available:
257 waiting_apps.append("elasticsearch")
258- elif es_available:
259+ elif es_available and es_ready:
260 ready_apps.append("elasticsearch")
261
262 # MongoDB
263@@ -400,7 +395,7 @@ def report_status(): # noqa: C901
264
265 # Graylog REST API
266 try:
267- _verify_rest_api_is_alive(timeout=5)
268+ _verify_rest_api_is_alive()
269 except ApiTimeout:
270 waiting_apps.append("REST API")
271
272@@ -415,6 +410,15 @@ def report_status(): # noqa: C901
273
274
275 @when("graylog.configured")
276+@when("graylog_api.configured")
277+@when_not("graylog.content_pack_installed")
278+def configure_content_packs():
279+ """Install graylog content packs."""
280+ install_content_packs()
281+ set_state("graylog.content_pack_installed")
282+
283+
284+@when("graylog.configured")
285 @when("mongodb_config.set")
286 @when("elasticsearch_config.set")
287 @when_not("graylog.needs_restart")
288@@ -426,9 +430,8 @@ def configure_graylog_api(*discard):
289 except ApiTimeout:
290 # Corner case: ES/Mongo are up, but REST API is not up yet.
291 # Just wait (status already set in report_status()) and try again next time.
292- pass
293+ remove_state("graylog_api.configured")
294 else:
295- install_content_packs()
296 remove_state("beat.setup")
297 remove_state("graylog_index_sets.configured")
298 remove_state("graylog_inputs.configured")
299@@ -723,7 +726,29 @@ def configure_inputs(*discard):
300 set_state("graylog_inputs.configured")
301
302
303+@when_not("elasticsearch.ready")
304+@when("elasticsearch.available")
305+def check_elasticsearch_health(elasticsearch):
306+ """Check if ES cluster is in green state."""
307+ http_hosts = [
308+ "http://{}:{}".format(unit["host"], unit["port"])
309+ for unit in elasticsearch.list_unit_data()
310+ ]
311+ es = esapi.ElasticSearchApi(http_hosts)
312+ try:
313+ es.health_check()
314+ except ConnectionError as conn_err:
315+ remove_state("elasticsearch.ready")
316+ hookenv.log("ES cluster not ready: {}".format(conn_err))
317+ except esapi.ElasticSearchNotHealthy as not_healthy_err:
318+ remove_state("elasticsearch.ready")
319+ hookenv.log("ES cluster not ready: {}".format(not_healthy_err))
320+ else:
321+ set_state("elasticsearch.ready")
322+
323+
324 @when("graylog.configured")
325+@when("elasticsearch.ready")
326 @when("elasticsearch.available")
327 def configure_elasticsearch(elasticsearch):
328 """Configure ES parameters in Graylog's configuration file."""
329@@ -940,13 +965,13 @@ def restart_service(service=SERVICE_NAME):
330 This handles situations when relations are adding quick enough to trigger
331 systemd's standard protection against frequent restarts.
332 """
333- host.service_restart(service)
334+ num_retries = 2
335 remove_state("graylog.needs_restart")
336- if host.service_running(service):
337- return
338-
339- time.sleep(15)
340- host.service_restart(service)
341+ for i in range(num_retries):
342+ host.service_restart(service)
343+ if host.service_running(service):
344+ return
345+ time.sleep(15)
346
347
348 def set_conf(key, value, conf_path=CONF_FILE):
349@@ -1197,6 +1222,7 @@ def set_up_nagios_user():
350
351 @when("leadership.is_leader")
352 @when("leadership.set.nagios_password")
353+@when("graylog_api.configured")
354 @when_not("leadership.set.nagios_token")
355 def set_up_nagios_token():
356 """Configure a token for Nagios to use."""
357@@ -1376,7 +1402,7 @@ def trigger_restart_after_tls_cert_update():
358 set_state("graylog.initial_certs_received")
359 remove_state("tls_client.certs.saved")
360 remove_state("tls_client.server.certs.changed")
361- set_state("graylog.needs_restart")
362+ flag_restart_and_api_reconfigure_needed()
363
364
365 @when("graylog.configured")
366@@ -1449,19 +1475,17 @@ def get_default_graylog_client(): # noqa: D103
367 )
368
369
370-def _verify_rest_api_is_alive(timeout=DEFAULT_REST_API_TIMEOUT):
371+def _verify_rest_api_is_alive():
372 hookenv.log("Verifying REST API is alive...")
373 g = get_default_graylog_client()
374- url = "" # Will query using the base URL of the client, i.e. /api/
375- resp = g.request(url)
376- start_ts = time.time()
377- while resp is None:
378- time.sleep(5)
379- hookenv.log("Retrying REST API check...")
380- resp = g.request(url)
381- if time.time() - start_ts > timeout:
382- raise ApiTimeout()
383- hookenv.log("REST API is up")
384+ try:
385+ g.health_check()
386+ except ApiTimeout as err:
387+ hookenv.log("REST API is not up")
388+ hookenv.status_set("blocked", str(err))
389+ raise ApiTimeout(err)
390+ else:
391+ hookenv.log("REST API is up")
392
393
394 def _maybe_install_ca_certificates_hook():
395diff --git a/src/tests/unit/requirements.txt b/src/tests/unit/requirements.txt
396index c465d62..fbcf43e 100644
397--- a/src/tests/unit/requirements.txt
398+++ b/src/tests/unit/requirements.txt
399@@ -5,3 +5,4 @@ netifaces
400 pytest
401 pytest-cov
402 requests
403+tenacity
404diff --git a/src/tests/unit/test_es_api.py b/src/tests/unit/test_es_api.py
405index 8032de5..c4959d0 100644
406--- a/src/tests/unit/test_es_api.py
407+++ b/src/tests/unit/test_es_api.py
408@@ -8,40 +8,35 @@ import charms.layer.elasticsearch.api as api
409
410 class TestESAPI(unittest.TestCase):
411 @mock.patch("requests.get")
412- def test_init_class(self, req_get):
413- def _side_effect_generator(endpoint):
414- if "unreachable" in endpoint:
415- raise ConnectionError
416+ def test_init_class_all_okay(self, req_get):
417+ fake_endpoints = ["http://reachable.host1:9000", "http://reachable.host2:9000"]
418+ try:
419+ mock_resp = mock.Mock()
420+ mock_resp.json.return_value = {"status": "green"}
421+ req_get.return_value = mock_resp
422+ es = api.ElasticSearchApi(fake_endpoints)
423+ es.health_check()
424+ except Exception:
425+ self.fail("Elasticsearch should be in green state.")
426
427- # All reachable
428+ @mock.patch("requests.get")
429+ def test_init_class_connection_error(self, req_get):
430 fake_endpoints = ["http://reachable.host1:9000", "http://reachable.host2:9000"]
431- es = api.ElasticSearchApi(fake_endpoints)
432- self.assertTrue(es.reachable_ep in fake_endpoints[0])
433-
434- fake_endpoints = [
435- "http://unreachable.host1:9000",
436- "http://unreachable.host2:9000",
437- "http://host2:9000",
438- ]
439- req_get.side_effect = _side_effect_generator
440- es = api.ElasticSearchApi(fake_endpoints)
441- self.assertTrue(es.reachable_ep in fake_endpoints[2])
442-
443- fake_endpoints = [
444- "http://reachable.host1:9000",
445- "http://unreachable.host1:9000",
446- "http://unreachable.host2:9000",
447- ]
448- req_get.side_effect = _side_effect_generator
449- es = api.ElasticSearchApi(fake_endpoints)
450- self.assertTrue(es.reachable_ep in fake_endpoints[0])
451+ req_get.reset_mock()
452+ req_get.side_effect = ConnectionError
453+ with self.assertRaises(ConnectionError):
454+ es = api.ElasticSearchApi(fake_endpoints)
455+ es.health_check()
456
457- fake_endpoints = [
458- "http://unreachable.host1:9000",
459- "http://unreachable.host2:9000",
460- ]
461- req_get.side_effect = _side_effect_generator
462- self.assertRaises(ConnectionError, api.ElasticSearchApi, fake_endpoints)
463+ @mock.patch("requests.get")
464+ def test_init_class_not_healthy(self, req_get):
465+ fake_endpoints = ["http://reachable.host1:9000", "http://reachable.host2:9000"]
466+ mock_resp = mock.Mock()
467+ mock_resp.json.return_value = {"status": "yellow"}
468+ req_get.return_value = mock_resp
469+ with self.assertRaises(api.ElasticSearchNotHealthy):
470+ es = api.ElasticSearchApi(fake_endpoints)
471+ es.health_check()
472
473 @mock.patch("json.dumps")
474 @mock.patch("requests.put")
475@@ -50,6 +45,10 @@ class TestESAPI(unittest.TestCase):
476 fake_endpoints = ["http://host1:9000", "http://host2:9000"]
477 j_dumps.return_value = '{"fake_data":"data"}'
478
479+ mock_resp = mock.Mock()
480+ mock_resp.json.return_value = {"status": "green"}
481+ req_get.return_value = mock_resp
482+
483 es = api.ElasticSearchApi(fake_endpoints)
484 es.disable_auto_create_index()
485 req_put.assert_called_with(

Subscribers

People subscribed via source and target branches

to all changes: