Merge ~canonical-bootstack/charm-openstack-service-checks:check-loadbalancers into ~llama-charmers/charm-openstack-service-checks:master

Proposed by Jeremy Lounder
Status: Merged
Approved by: Jeremy Lounder
Approved revision: a3bc3c9b2a2072a63a7f7a5d7e9829144f742560
Merged at revision: a3bc3c9b2a2072a63a7f7a5d7e9829144f742560
Proposed branch: ~canonical-bootstack/charm-openstack-service-checks:check-loadbalancers
Merge into: ~llama-charmers/charm-openstack-service-checks:master
Diff against target: 329 lines (+283/-0)
4 files modified
config.yaml (+18/-0)
files/plugins/check_octavia.py (+227/-0)
layer.yaml (+2/-0)
lib/lib_openstack_service_checks.py (+36/-0)
Reviewer Review Type Date Requested Status
Jeremy Lounder (community) Approve
Alvaro Uria (community) Needs Fixing
James Hebden (community) Approve
Jose Guedez (community) Approve
Celia Wang Approve
Joe Guo Pending
Review via email: mp+380860@code.launchpad.net

This proposal supersedes a proposal from 2019-12-20.

Commit message

Add checks for loadbalancers.

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote : Posted in a previous version of this proposal

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
James Hebden (ec0) wrote : Posted in a previous version of this proposal

Please see comments inline.
Overall, looks really cool!

One other check I would suggest, is verifying that the image exists in glance for booting new Amphorae if possible. The tag used by Octavia should be available as the amp-image-tag setting on Octavia, so we should either provide a setting on this charm to configure the tag we will monitor (i.e. a string setting called octavia-image-tag with default of 'amp-image-tag') or get this information from Octavia. We can then check that there is an image with this tag present in Glance, as if the image is deleted, we have a problem, because new amphora can not be booted.

I'd be happy to approve this with the below changes, even without the glance check, and we can handle the glance check as a follow up activity.

Revision history for this message
Alvaro Uria (aluria) wrote : Posted in a previous version of this proposal

I've added a comment on a minor typo. OTOH, the best practice suggested for charms is to have as many things autoconfigured as possible (ie. reduce config parameters). So, I would suggest a new interface between charm-octavia and charm-openstack-service-checks which:
1) It would make "check-loadbalancer" be "True", which would create the nrpe check (as James mentioned, an extra config param could exist in charm-openstack-service-checks to disable certain "modules" (lb, amphorae, pools...)
2) The relation would share the amp-image-tag value automatically (if not set, charm-openstack-service-checks would only check for any image with the "octavia-amphora" tag.
3) Possibly do the same for the flavor id, which is a config param in Octavia (however, I don't know the way to guess if a flavor that will be used by the Amphorae exists - the default is that charm-octavia handles it, so an internal value probably exists)

Thank you for the change. Is there a bug we can link this MP to?

Revision history for this message
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal

Hi,
Thanks for the review and feedback.
I've fixed all, and refactored the code to make it more flexible/reasonable.

Could you please review again when you get the time?

NOTE: for image check, it hardcoded `amp_img_tag` to `octavia-amphora` at the moment, and it's disabled by default. Work in progress to find a proper/minimal way to set it.

Revision history for this message
James Hebden (ec0) wrote : Posted in a previous version of this proposal

Just a couple more comments, around config.yaml and default checks.

Revision history for this message
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal

> Just a couple more comments, around config.yaml and default checks.

Hi James,
I've made the changes, please review again when you get the time. Thanks:)

Revision history for this message
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal

comments inline...

Revision history for this message
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal

Hi Jose,
Thanks for the great review, they are all good advice. I will make those changes in my new Engineer rotation. Thanks.

Revision history for this message
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal

Hi Jose,
I've made following changes:
1. split octavia nagios check into 4(loadbalancers|amphorae|pools|image), so I don't need to handle the complex logic of which ERROR to return
2. remove the image tag and days default on check_image function. But I still have them on both cli option and config.yaml since they are different levels.

I didn't import nagios3_plugin module for the status codes, so I can keep this script standalone, and run it from any location.

Another review appreciated:)

Revision history for this message
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal

Hi Joe - Good stuff, much cleaner now. Comments inline. Just one real comment with the functional test and the new filenames of the checks.

Revision history for this message
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal

Hi Jose,
Changed and replied, see my inline replies.

BTW: I've verified the new octavia checks are added correctly with openstack-on-lxd env: https://pastebin.canonical.com/p/D26hdX3KDH/

Revision history for this message
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal

+1 LGTM

review: Approve
Revision history for this message
Celia Wang (ziyiwang) wrote :

LGTM.
I'm going to change the status to "Approved" for merging.

review: Approve
Revision history for this message
Jose Guedez (jfguedez) wrote :

+1. Had already been approved before being re-submitted.

review: Approve
Revision history for this message
James Hebden (ec0) wrote :

Looks good to me, thanks

review: Approve
Revision history for this message
Alvaro Uria (aluria) wrote :

I haven't reviewed the code since it was already approved, but I have run "make test" and I caught 3 issues on func tests:
1) test_deploy_app relates "keystone" and "openstack-service-checks", but there is ambiguity between two interfaces: identity-credentials and identity-notifications. "keystone:identity-credentials" needs to be used.

2) Xenial test fails because octavia packages can't be located. "check-octavia" default value should be False, or at least False in Xenial. https://pastebin.ubuntu.com/p/gmRRbCsPnr/

3) Bionic tests fail because file_stat fails:
"""
> return json.loads(results['Stdout'])
E KeyError: 'Stdout'
"""

review: Needs Fixing
Revision history for this message
Joe Guo (guoqiao) wrote :

Hi Alvaro,
Thanks for the review.
For 2), I can set it to False.
For 1) 3), they are issues from existing functional test, not related to this patch, maybe we can create a separate bug for that?

Revision history for this message
Joe Guo (guoqiao) wrote :

Hi Alvaro,
For 2), I've add a condition check:

    if host.lsb_release()['DISTRIB_RELEASE'] >= '18.04':
        ...

That will bypass octavia checks on 16.04 even it's enabled.
With this, I can keep check-octavia default to True.

For 1) 3), I have trouble to run functional test on my laptop(can not connect to charm store, not sure why). But I've asked help from Jose to prove that they will also happen on master branch. We will create a bug for that soon.

Revision history for this message
Jeremy Lounder (jldev) :
review: Approve
Revision history for this message
Joe Guo (guoqiao) wrote :

Hi Alvaro,
The bug for functional tests has been created at:

https://bugs.launchpad.net/charm-openstack-service-checks/+bug/1869308

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/config.yaml b/config.yaml
2index 073ba9b..edfda51 100644
3--- a/config.yaml
4+++ b/config.yaml
5@@ -10,6 +10,24 @@ options:
6 description: |
7 Switch to turn on or off neutron agents checks. By default, neutron_agents nrpe check is enabled.
8 If a different SDN (ie. Contrail) is in use, you may want to disable this check.
9+ check-octavia:
10+ default: True
11+ type: boolean
12+ description: |
13+ Switch to turn on or off check for octavia services.
14+ octavia-amp-image-tag:
15+ default: "octavia-amphora"
16+ type: string
17+ description: |
18+ The glance image tag octavia will use to create amphora.
19+ octavia-amp-image-days:
20+ default: 365
21+ type: int
22+ description: |
23+ If latest glance image tagged with above octavia-amp-image-tag is updated more than these days ago,
24+ a Nagios warning will be raised. The version of octavia agent builtin in amphora image must match
25+ version of octavia controller, otherwise octavia will fail to communicate with new amphora,
26+ failover will also fail.
27 check-rally:
28 default: False
29 type: boolean
30diff --git a/files/plugins/check_octavia.py b/files/plugins/check_octavia.py
31new file mode 100755
32index 0000000..09fb6b1
33--- /dev/null
34+++ b/files/plugins/check_octavia.py
35@@ -0,0 +1,227 @@
36+#!/usr/bin/env python3
37+
38+import os
39+import sys
40+import json
41+import argparse
42+import subprocess
43+from datetime import datetime, timedelta
44+import openstack
45+
46+NAGIOS_STATUS_OK = 0
47+NAGIOS_STATUS_WARNING = 1
48+NAGIOS_STATUS_CRITICAL = 2
49+NAGIOS_STATUS_UNKNOWN = 3
50+
51+NAGIOS_STATUS = {
52+ NAGIOS_STATUS_OK: 'OK',
53+ NAGIOS_STATUS_WARNING: 'WARNING',
54+ NAGIOS_STATUS_CRITICAL: 'CRITICAL',
55+ NAGIOS_STATUS_UNKNOWN: 'UNKNOWN',
56+}
57+
58+
59+def nagios_exit(status, message):
60+ assert status in NAGIOS_STATUS, "Invalid Nagios status code"
61+ # prefix status name to message
62+ output = '{}: {}'.format(NAGIOS_STATUS[status], message)
63+ print(output) # nagios requires print to stdout, no stderr
64+ sys.exit(status)
65+
66+
67+def check_loadbalancers(connection):
68+ """check loadbalancers status."""
69+
70+ lb_mgr = connection.load_balancer
71+ lb_all = lb_mgr.load_balancers()
72+
73+ # only check enabled lbs
74+ lb_enabled = [lb for lb in lb_all if lb.is_admin_state_up]
75+
76+ # check provisioning_status is ACTIVE for each lb
77+ bad_lbs = [lb for lb in lb_enabled if lb.provisioning_status != 'ACTIVE']
78+ if bad_lbs:
79+ parts = ['loadbalancer {} provisioning_status is {}'.format(
80+ lb.id, lb.provisioning_status) for lb in bad_lbs]
81+ message = ', '.join(parts)
82+ return NAGIOS_STATUS_CRITICAL, message
83+
84+ # raise WARNING if operating_status is not ONLINE
85+ bad_lbs = [lb for lb in lb_enabled if lb.operating_status != 'ONLINE']
86+ if bad_lbs:
87+ parts = ['loadbalancer {} operating_status is {}'.format(
88+ lb.id, lb.operating_status) for lb in bad_lbs]
89+ message = ', '.join(parts)
90+ return NAGIOS_STATUS_CRITICAL, message
91+
92+ net_mgr = connection.network
93+ # check vip port exists for each lb
94+ bad_lbs = []
95+ for lb in lb_enabled:
96+ try:
97+ net_mgr.get_port(lb.vip_port_id)
98+ except openstack.exceptions.NotFoundException:
99+ bad_lbs.append(lb)
100+ if bad_lbs:
101+ parts = ['vip port {} for loadbalancer {} not found'.format(
102+ lb.vip_port_id, lb.id) for lb in bad_lbs]
103+ message = ', '.join(parts)
104+ return NAGIOS_STATUS_CRITICAL, message
105+
106+ # warn about disabled lbs if no other error found
107+ lb_disabled = [lb for lb in lb_all if not lb.is_admin_state_up]
108+ if lb_disabled:
109+ parts = ['loadbalancer {} admin_state_up is False'.format(lb.id)
110+ for lb in lb_disabled]
111+ message = ', '.join(parts)
112+ return NAGIOS_STATUS_WARNING, message
113+
114+ return NAGIOS_STATUS_OK, 'loadbalancers are happy'
115+
116+
117+def check_pools(connection):
118+ """check pools status."""
119+ lb_mgr = connection.load_balancer
120+ pools_all = lb_mgr.pools()
121+ pools_enabled = [pool for pool in pools_all if pool.is_admin_state_up]
122+
123+ # check provisioning_status is ACTIVE for each pool
124+ bad_pools = [pool for pool in pools_enabled if pool.provisioning_status != 'ACTIVE']
125+ if bad_pools:
126+ parts = ['pool {} provisioning_status is {}'.format(
127+ pool.id, pool.provisioning_status) for pool in bad_pools]
128+ message = ', '.join(parts)
129+ return NAGIOS_STATUS_CRITICAL, message
130+
131+ # raise CRITICAL if operating_status is ERROR
132+ bad_pools = [pool for pool in pools_enabled if pool.operating_status == 'ERROR']
133+ if bad_pools:
134+ parts = ['pool {} operating_status is {}'.format(
135+ pool.id, pool.operating_status) for pool in bad_pools]
136+ message = ', '.join(parts)
137+ return NAGIOS_STATUS_CRITICAL, message
138+
139+ # raise WARNING if operating_status is NO_MONITOR
140+ bad_pools = [pool for pool in pools_enabled if pool.operating_status == 'NO_MONITOR']
141+ if bad_pools:
142+ parts = ['pool {} operating_status is {}'.format(
143+ pool.id, pool.operating_status) for pool in bad_pools]
144+ message = ', '.join(parts)
145+ return NAGIOS_STATUS_WARNING, message
146+
147+ return NAGIOS_STATUS_OK, 'pools are happy'
148+
149+
150+def check_amphorae(connection):
151+ """check amphorae status."""
152+
153+ lb_mgr = connection.load_balancer
154+
155+ resp = lb_mgr.get('/v2/octavia/amphorae')
156+ # python api is not available yet, use url
157+ if resp.status_code != 200:
158+ return NAGIOS_STATUS_WARNING, 'amphorae api not working'
159+
160+ data = json.loads(resp.content)
161+ # ouput is like {"amphorae": [{...}, {...}, ...]}
162+ items = data.get('amphorae', [])
163+
164+ # raise CRITICAL for ERROR status
165+ bad_status_list = ('ERROR',)
166+ bad_items = [item for item in items if item['status'] in bad_status_list]
167+ if bad_items:
168+ parts = [
169+ 'amphora {} status is {}'.format(item['id'], item['status'])
170+ for item in bad_items]
171+ message = ', '.join(parts)
172+ return NAGIOS_STATUS_CRITICAL, message
173+
174+ # raise WARNING for these status
175+ bad_status_list = (
176+ 'PENDING_CREATE', 'PENDING_UPDATE', 'PENDING_DELETE', 'BOOTING')
177+ bad_items = [item for item in items if item['status'] in bad_status_list]
178+ if bad_items:
179+ parts = [
180+ 'amphora {} status is {}'.format(item['id'], item['status'])
181+ for item in bad_items]
182+ message = ', '.join(parts)
183+ return NAGIOS_STATUS_WARNING, message
184+
185+ return NAGIOS_STATUS_OK, 'amphorae are happy'
186+
187+
188+def check_image(connection, tag, days):
189+ img_mgr = connection.image
190+ images = list(img_mgr.images(tag=tag))
191+
192+ if not images:
193+ message = ('Octavia requires image with tag {} to create amphora, '
194+ 'but none exist').format(tag)
195+ return NAGIOS_STATUS_CRITICAL, message
196+
197+ active_images = [image for image in images if image.status == 'active']
198+ if not active_images:
199+ parts = ['{}({})'.format(image.name, image.id) for image in images]
200+ message = ('Octavia requires image with tag {} to create amphora, '
201+ 'but none is active: {}').format(tag, ', '.join(parts))
202+ return NAGIOS_STATUS_CRITICAL, message
203+
204+ # raise WARNING if image is too old
205+ when = (datetime.now() - timedelta(days=days)).isoformat()
206+ # updated_at str format: '2019-12-05T18:21:25Z'
207+ fresh_images = [image for image in active_images if image.updated_at > when]
208+ if not fresh_images:
209+ message = ('Octavia requires image with tag {} to create amphora, '
210+ 'but it is older than {} days').format(tag, days)
211+ return NAGIOS_STATUS_WARNING, message
212+
213+ return NAGIOS_STATUS_OK, 'image is ready'
214+
215+
216+if __name__ == '__main__':
217+ parser = argparse.ArgumentParser(
218+ description='Check Octavia status',
219+ formatter_class=argparse.ArgumentDefaultsHelpFormatter,
220+ )
221+ parser.add_argument(
222+ '--env', dest='env', default='/var/lib/nagios/nagios.novarc',
223+ help='Novarc file to use for this check')
224+
225+ check_choices = ['loadbalancers', 'amphorae', 'pools', 'image']
226+ parser.add_argument(
227+ '--check', dest='check', metavar='|'.join(check_choices),
228+ type=str, choices=check_choices,
229+ default=check_choices[0],
230+ help='which check to run')
231+
232+ parser.add_argument(
233+ '--amp-image-tag', dest='amp_image_tag', default='octavia-amphora',
234+ help='amphora image tag for image check')
235+
236+ parser.add_argument(
237+ '--amp-image-days', dest='amp_image_days', type=int, default=365,
238+ help='raise warning if amphora image is older than these days')
239+
240+ args = parser.parse_args()
241+ # source environment vars
242+ command = ['/bin/bash', '-c', 'source {} && env'.format(args.env)]
243+ proc = subprocess.Popen(command, stdout=subprocess.PIPE)
244+ for line in proc.stdout:
245+ (key, _, value) = line.partition(b'=')
246+ os.environ[key.decode('utf-8')] = value.rstrip().decode('utf-8')
247+ proc.communicate()
248+
249+ # use closure to make all checks have same signature
250+ # so we can handle them in same way
251+ def _check_image(connection):
252+ return check_image(connection, args.amp_image_tag, args.amp_image_days)
253+
254+ checks = {
255+ 'loadbalancers': check_loadbalancers,
256+ 'amphorae': check_amphorae,
257+ 'pools': check_pools,
258+ 'image': _check_image,
259+ }
260+
261+ connection = openstack.connect(cloud='envvars')
262+ nagios_exit(*checks[args.check](connection))
263diff --git a/layer.yaml b/layer.yaml
264index 4053919..0606d45 100644
265--- a/layer.yaml
266+++ b/layer.yaml
267@@ -20,6 +20,8 @@ options:
268 - python3-keystoneclient
269 - python3-openstackclient
270 - python-openstackclient
271+ - python3-octaviaclient
272+ - python-octaviaclient
273 snap:
274 fcbtest:
275 channel: stable
276diff --git a/lib/lib_openstack_service_checks.py b/lib/lib_openstack_service_checks.py
277index 716512b..3111b92 100644
278--- a/lib/lib_openstack_service_checks.py
279+++ b/lib/lib_openstack_service_checks.py
280@@ -65,6 +65,18 @@ class OSCHelper():
281 return self.charm_config['check-neutron-agents']
282
283 @property
284+ def is_octavia_check_enabled(self):
285+ return self.charm_config['check-octavia']
286+
287+ @property
288+ def octavia_amp_image_tag(self):
289+ return self.charm_config['octavia-amp-image-tag']
290+
291+ @property
292+ def octavia_amp_image_days(self):
293+ return self.charm_config['octavia-amp-image-days']
294+
295+ @property
296 def skipped_rally_checks(self):
297 skipped_os_components = self.charm_config['skip-rally'].strip()
298 if not skipped_os_components:
299@@ -186,6 +198,30 @@ class OSCHelper():
300 else:
301 nrpe.remove_check(shortname='neutron_agents')
302
303+ # only care about octavia after 18.04
304+ if host.lsb_release()['DISTRIB_RELEASE'] >= '18.04':
305+ if self.is_octavia_check_enabled:
306+ script = os.path.join(self.plugins_dir, 'check_octavia.py')
307+
308+ for check in ('loadbalancers', 'amphorae', 'pools'):
309+ nrpe.add_check(
310+ shortname='octavia_{}'.format(check),
311+ description='Check octavia {} status'.format(check),
312+ check_cmd='{} --check {}'.format(script, check),
313+ )
314+
315+ # image check has extra args, add it separately
316+ check = 'image'
317+ nrpe.add_check(
318+ shortname='octavia_{}'.format(check),
319+ description='Check octavia {} status'.format(check),
320+ check_cmd='{} --check {} --amp-image-tag {} --amp-image-days {}'.format(
321+ script, check, self.octavia_amp_image_tag, self.octavia_amp_image_days),
322+ )
323+ else:
324+ for check in ('loadbalancers', 'amphorae', 'pools', 'image'):
325+ nrpe.remove_check(shortname='octavia_{}'.format(check))
326+
327 if self.contrail_analytics_vip:
328 contrail_check_command = '{} --host {}'.format(
329 os.path.join(self.plugins_dir, 'check_contrail_analytics_alarms.py'),

Subscribers

People subscribed via source and target branches