Merge ~canonical-bootstack/charm-openstack-service-checks:check-loadbalancers into ~llama-charmers/charm-openstack-service-checks:master
- Git
- lp:~canonical-bootstack/charm-openstack-service-checks
- check-loadbalancers
- Merge into master
Status: | Merged | ||||
---|---|---|---|---|---|
Approved by: | Jeremy Lounder | ||||
Approved revision: | a3bc3c9b2a2072a63a7f7a5d7e9829144f742560 | ||||
Merged at revision: | a3bc3c9b2a2072a63a7f7a5d7e9829144f742560 | ||||
Proposed branch: | ~canonical-bootstack/charm-openstack-service-checks:check-loadbalancers | ||||
Merge into: | ~llama-charmers/charm-openstack-service-checks:master | ||||
Diff against target: |
329 lines (+283/-0) 4 files modified
config.yaml (+18/-0) files/plugins/check_octavia.py (+227/-0) layer.yaml (+2/-0) lib/lib_openstack_service_checks.py (+36/-0) |
||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Jeremy Lounder (community) | Approve | ||
Alvaro Uria (community) | Needs Fixing | ||
James Hebden (community) | Approve | ||
Jose Guedez (community) | Approve | ||
Celia Wang | Approve | ||
Joe Guo | Pending | ||
Review via email: mp+380860@code.launchpad.net |
This proposal supersedes a proposal from 2019-12-20.
Commit message
Add checks for loadbalancers.
Description of the change
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote : Posted in a previous version of this proposal | # |
James Hebden (ec0) wrote : Posted in a previous version of this proposal | # |
Please see comments inline.
Overall, looks really cool!
One other check I would suggest, is verifying that the image exists in glance for booting new Amphorae if possible. The tag used by Octavia should be available as the amp-image-tag setting on Octavia, so we should either provide a setting on this charm to configure the tag we will monitor (i.e. a string setting called octavia-image-tag with default of 'amp-image-tag') or get this information from Octavia. We can then check that there is an image with this tag present in Glance, as if the image is deleted, we have a problem, because new amphora can not be booted.
I'd be happy to approve this with the below changes, even without the glance check, and we can handle the glance check as a follow up activity.
Alvaro Uria (aluria) wrote : Posted in a previous version of this proposal | # |
I've added a comment on a minor typo. OTOH, the best practice suggested for charms is to have as many things autoconfigured as possible (ie. reduce config parameters). So, I would suggest a new interface between charm-octavia and charm-openstack
1) It would make "check-
2) The relation would share the amp-image-tag value automatically (if not set, charm-openstack
3) Possibly do the same for the flavor id, which is a config param in Octavia (however, I don't know the way to guess if a flavor that will be used by the Amphorae exists - the default is that charm-octavia handles it, so an internal value probably exists)
Thank you for the change. Is there a bug we can link this MP to?
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal | # |
Hi,
Thanks for the review and feedback.
I've fixed all, and refactored the code to make it more flexible/
Could you please review again when you get the time?
NOTE: for image check, it hardcoded `amp_img_tag` to `octavia-amphora` at the moment, and it's disabled by default. Work in progress to find a proper/minimal way to set it.
James Hebden (ec0) wrote : Posted in a previous version of this proposal | # |
Just a couple more comments, around config.yaml and default checks.
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal | # |
> Just a couple more comments, around config.yaml and default checks.
Hi James,
I've made the changes, please review again when you get the time. Thanks:)
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal | # |
comments inline...
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal | # |
Hi Jose,
Thanks for the great review, they are all good advice. I will make those changes in my new Engineer rotation. Thanks.
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal | # |
Hi Jose,
I've made following changes:
1. split octavia nagios check into 4(loadbalancers
2. remove the image tag and days default on check_image function. But I still have them on both cli option and config.yaml since they are different levels.
I didn't import nagios3_plugin module for the status codes, so I can keep this script standalone, and run it from any location.
Another review appreciated:)
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal | # |
Hi Joe - Good stuff, much cleaner now. Comments inline. Just one real comment with the functional test and the new filenames of the checks.
Joe Guo (guoqiao) wrote : Posted in a previous version of this proposal | # |
Hi Jose,
Changed and replied, see my inline replies.
BTW: I've verified the new octavia checks are added correctly with openstack-on-lxd env: https:/
Jose Guedez (jfguedez) wrote : Posted in a previous version of this proposal | # |
+1 LGTM
Celia Wang (ziyiwang) wrote : | # |
LGTM.
I'm going to change the status to "Approved" for merging.
Jose Guedez (jfguedez) wrote : | # |
+1. Had already been approved before being re-submitted.
James Hebden (ec0) wrote : | # |
Looks good to me, thanks
Alvaro Uria (aluria) wrote : | # |
I haven't reviewed the code since it was already approved, but I have run "make test" and I caught 3 issues on func tests:
1) test_deploy_app relates "keystone" and "openstack-
2) Xenial test fails because octavia packages can't be located. "check-octavia" default value should be False, or at least False in Xenial. https:/
3) Bionic tests fail because file_stat fails:
"""
> return json.loads(
E KeyError: 'Stdout'
"""
Joe Guo (guoqiao) wrote : | # |
Hi Alvaro,
Thanks for the review.
For 2), I can set it to False.
For 1) 3), they are issues from existing functional test, not related to this patch, maybe we can create a separate bug for that?
Joe Guo (guoqiao) wrote : | # |
Hi Alvaro,
For 2), I've add a condition check:
if host.lsb_
...
That will bypass octavia checks on 16.04 even it's enabled.
With this, I can keep check-octavia default to True.
For 1) 3), I have trouble to run functional test on my laptop(can not connect to charm store, not sure why). But I've asked help from Jose to prove that they will also happen on master branch. We will create a bug for that soon.
Jeremy Lounder (jldev) : | # |
Joe Guo (guoqiao) wrote : | # |
Hi Alvaro,
The bug for functional tests has been created at:
https:/
Preview Diff
1 | diff --git a/config.yaml b/config.yaml |
2 | index 073ba9b..edfda51 100644 |
3 | --- a/config.yaml |
4 | +++ b/config.yaml |
5 | @@ -10,6 +10,24 @@ options: |
6 | description: | |
7 | Switch to turn on or off neutron agents checks. By default, neutron_agents nrpe check is enabled. |
8 | If a different SDN (ie. Contrail) is in use, you may want to disable this check. |
9 | + check-octavia: |
10 | + default: True |
11 | + type: boolean |
12 | + description: | |
13 | + Switch to turn on or off check for octavia services. |
14 | + octavia-amp-image-tag: |
15 | + default: "octavia-amphora" |
16 | + type: string |
17 | + description: | |
18 | + The glance image tag octavia will use to create amphora. |
19 | + octavia-amp-image-days: |
20 | + default: 365 |
21 | + type: int |
22 | + description: | |
23 | + If latest glance image tagged with above octavia-amp-image-tag is updated more than these days ago, |
24 | + a Nagios warning will be raised. The version of octavia agent builtin in amphora image must match |
25 | + version of octavia controller, otherwise octavia will fail to communicate with new amphora, |
26 | + failover will also fail. |
27 | check-rally: |
28 | default: False |
29 | type: boolean |
30 | diff --git a/files/plugins/check_octavia.py b/files/plugins/check_octavia.py |
31 | new file mode 100755 |
32 | index 0000000..09fb6b1 |
33 | --- /dev/null |
34 | +++ b/files/plugins/check_octavia.py |
35 | @@ -0,0 +1,227 @@ |
36 | +#!/usr/bin/env python3 |
37 | + |
38 | +import os |
39 | +import sys |
40 | +import json |
41 | +import argparse |
42 | +import subprocess |
43 | +from datetime import datetime, timedelta |
44 | +import openstack |
45 | + |
46 | +NAGIOS_STATUS_OK = 0 |
47 | +NAGIOS_STATUS_WARNING = 1 |
48 | +NAGIOS_STATUS_CRITICAL = 2 |
49 | +NAGIOS_STATUS_UNKNOWN = 3 |
50 | + |
51 | +NAGIOS_STATUS = { |
52 | + NAGIOS_STATUS_OK: 'OK', |
53 | + NAGIOS_STATUS_WARNING: 'WARNING', |
54 | + NAGIOS_STATUS_CRITICAL: 'CRITICAL', |
55 | + NAGIOS_STATUS_UNKNOWN: 'UNKNOWN', |
56 | +} |
57 | + |
58 | + |
59 | +def nagios_exit(status, message): |
60 | + assert status in NAGIOS_STATUS, "Invalid Nagios status code" |
61 | + # prefix status name to message |
62 | + output = '{}: {}'.format(NAGIOS_STATUS[status], message) |
63 | + print(output) # nagios requires print to stdout, no stderr |
64 | + sys.exit(status) |
65 | + |
66 | + |
67 | +def check_loadbalancers(connection): |
68 | + """check loadbalancers status.""" |
69 | + |
70 | + lb_mgr = connection.load_balancer |
71 | + lb_all = lb_mgr.load_balancers() |
72 | + |
73 | + # only check enabled lbs |
74 | + lb_enabled = [lb for lb in lb_all if lb.is_admin_state_up] |
75 | + |
76 | + # check provisioning_status is ACTIVE for each lb |
77 | + bad_lbs = [lb for lb in lb_enabled if lb.provisioning_status != 'ACTIVE'] |
78 | + if bad_lbs: |
79 | + parts = ['loadbalancer {} provisioning_status is {}'.format( |
80 | + lb.id, lb.provisioning_status) for lb in bad_lbs] |
81 | + message = ', '.join(parts) |
82 | + return NAGIOS_STATUS_CRITICAL, message |
83 | + |
84 | + # raise WARNING if operating_status is not ONLINE |
85 | + bad_lbs = [lb for lb in lb_enabled if lb.operating_status != 'ONLINE'] |
86 | + if bad_lbs: |
87 | + parts = ['loadbalancer {} operating_status is {}'.format( |
88 | + lb.id, lb.operating_status) for lb in bad_lbs] |
89 | + message = ', '.join(parts) |
90 | + return NAGIOS_STATUS_CRITICAL, message |
91 | + |
92 | + net_mgr = connection.network |
93 | + # check vip port exists for each lb |
94 | + bad_lbs = [] |
95 | + for lb in lb_enabled: |
96 | + try: |
97 | + net_mgr.get_port(lb.vip_port_id) |
98 | + except openstack.exceptions.NotFoundException: |
99 | + bad_lbs.append(lb) |
100 | + if bad_lbs: |
101 | + parts = ['vip port {} for loadbalancer {} not found'.format( |
102 | + lb.vip_port_id, lb.id) for lb in bad_lbs] |
103 | + message = ', '.join(parts) |
104 | + return NAGIOS_STATUS_CRITICAL, message |
105 | + |
106 | + # warn about disabled lbs if no other error found |
107 | + lb_disabled = [lb for lb in lb_all if not lb.is_admin_state_up] |
108 | + if lb_disabled: |
109 | + parts = ['loadbalancer {} admin_state_up is False'.format(lb.id) |
110 | + for lb in lb_disabled] |
111 | + message = ', '.join(parts) |
112 | + return NAGIOS_STATUS_WARNING, message |
113 | + |
114 | + return NAGIOS_STATUS_OK, 'loadbalancers are happy' |
115 | + |
116 | + |
117 | +def check_pools(connection): |
118 | + """check pools status.""" |
119 | + lb_mgr = connection.load_balancer |
120 | + pools_all = lb_mgr.pools() |
121 | + pools_enabled = [pool for pool in pools_all if pool.is_admin_state_up] |
122 | + |
123 | + # check provisioning_status is ACTIVE for each pool |
124 | + bad_pools = [pool for pool in pools_enabled if pool.provisioning_status != 'ACTIVE'] |
125 | + if bad_pools: |
126 | + parts = ['pool {} provisioning_status is {}'.format( |
127 | + pool.id, pool.provisioning_status) for pool in bad_pools] |
128 | + message = ', '.join(parts) |
129 | + return NAGIOS_STATUS_CRITICAL, message |
130 | + |
131 | + # raise CRITICAL if operating_status is ERROR |
132 | + bad_pools = [pool for pool in pools_enabled if pool.operating_status == 'ERROR'] |
133 | + if bad_pools: |
134 | + parts = ['pool {} operating_status is {}'.format( |
135 | + pool.id, pool.operating_status) for pool in bad_pools] |
136 | + message = ', '.join(parts) |
137 | + return NAGIOS_STATUS_CRITICAL, message |
138 | + |
139 | + # raise WARNING if operating_status is NO_MONITOR |
140 | + bad_pools = [pool for pool in pools_enabled if pool.operating_status == 'NO_MONITOR'] |
141 | + if bad_pools: |
142 | + parts = ['pool {} operating_status is {}'.format( |
143 | + pool.id, pool.operating_status) for pool in bad_pools] |
144 | + message = ', '.join(parts) |
145 | + return NAGIOS_STATUS_WARNING, message |
146 | + |
147 | + return NAGIOS_STATUS_OK, 'pools are happy' |
148 | + |
149 | + |
150 | +def check_amphorae(connection): |
151 | + """check amphorae status.""" |
152 | + |
153 | + lb_mgr = connection.load_balancer |
154 | + |
155 | + resp = lb_mgr.get('/v2/octavia/amphorae') |
156 | + # python api is not available yet, use url |
157 | + if resp.status_code != 200: |
158 | + return NAGIOS_STATUS_WARNING, 'amphorae api not working' |
159 | + |
160 | + data = json.loads(resp.content) |
161 | + # ouput is like {"amphorae": [{...}, {...}, ...]} |
162 | + items = data.get('amphorae', []) |
163 | + |
164 | + # raise CRITICAL for ERROR status |
165 | + bad_status_list = ('ERROR',) |
166 | + bad_items = [item for item in items if item['status'] in bad_status_list] |
167 | + if bad_items: |
168 | + parts = [ |
169 | + 'amphora {} status is {}'.format(item['id'], item['status']) |
170 | + for item in bad_items] |
171 | + message = ', '.join(parts) |
172 | + return NAGIOS_STATUS_CRITICAL, message |
173 | + |
174 | + # raise WARNING for these status |
175 | + bad_status_list = ( |
176 | + 'PENDING_CREATE', 'PENDING_UPDATE', 'PENDING_DELETE', 'BOOTING') |
177 | + bad_items = [item for item in items if item['status'] in bad_status_list] |
178 | + if bad_items: |
179 | + parts = [ |
180 | + 'amphora {} status is {}'.format(item['id'], item['status']) |
181 | + for item in bad_items] |
182 | + message = ', '.join(parts) |
183 | + return NAGIOS_STATUS_WARNING, message |
184 | + |
185 | + return NAGIOS_STATUS_OK, 'amphorae are happy' |
186 | + |
187 | + |
188 | +def check_image(connection, tag, days): |
189 | + img_mgr = connection.image |
190 | + images = list(img_mgr.images(tag=tag)) |
191 | + |
192 | + if not images: |
193 | + message = ('Octavia requires image with tag {} to create amphora, ' |
194 | + 'but none exist').format(tag) |
195 | + return NAGIOS_STATUS_CRITICAL, message |
196 | + |
197 | + active_images = [image for image in images if image.status == 'active'] |
198 | + if not active_images: |
199 | + parts = ['{}({})'.format(image.name, image.id) for image in images] |
200 | + message = ('Octavia requires image with tag {} to create amphora, ' |
201 | + 'but none is active: {}').format(tag, ', '.join(parts)) |
202 | + return NAGIOS_STATUS_CRITICAL, message |
203 | + |
204 | + # raise WARNING if image is too old |
205 | + when = (datetime.now() - timedelta(days=days)).isoformat() |
206 | + # updated_at str format: '2019-12-05T18:21:25Z' |
207 | + fresh_images = [image for image in active_images if image.updated_at > when] |
208 | + if not fresh_images: |
209 | + message = ('Octavia requires image with tag {} to create amphora, ' |
210 | + 'but it is older than {} days').format(tag, days) |
211 | + return NAGIOS_STATUS_WARNING, message |
212 | + |
213 | + return NAGIOS_STATUS_OK, 'image is ready' |
214 | + |
215 | + |
216 | +if __name__ == '__main__': |
217 | + parser = argparse.ArgumentParser( |
218 | + description='Check Octavia status', |
219 | + formatter_class=argparse.ArgumentDefaultsHelpFormatter, |
220 | + ) |
221 | + parser.add_argument( |
222 | + '--env', dest='env', default='/var/lib/nagios/nagios.novarc', |
223 | + help='Novarc file to use for this check') |
224 | + |
225 | + check_choices = ['loadbalancers', 'amphorae', 'pools', 'image'] |
226 | + parser.add_argument( |
227 | + '--check', dest='check', metavar='|'.join(check_choices), |
228 | + type=str, choices=check_choices, |
229 | + default=check_choices[0], |
230 | + help='which check to run') |
231 | + |
232 | + parser.add_argument( |
233 | + '--amp-image-tag', dest='amp_image_tag', default='octavia-amphora', |
234 | + help='amphora image tag for image check') |
235 | + |
236 | + parser.add_argument( |
237 | + '--amp-image-days', dest='amp_image_days', type=int, default=365, |
238 | + help='raise warning if amphora image is older than these days') |
239 | + |
240 | + args = parser.parse_args() |
241 | + # source environment vars |
242 | + command = ['/bin/bash', '-c', 'source {} && env'.format(args.env)] |
243 | + proc = subprocess.Popen(command, stdout=subprocess.PIPE) |
244 | + for line in proc.stdout: |
245 | + (key, _, value) = line.partition(b'=') |
246 | + os.environ[key.decode('utf-8')] = value.rstrip().decode('utf-8') |
247 | + proc.communicate() |
248 | + |
249 | + # use closure to make all checks have same signature |
250 | + # so we can handle them in same way |
251 | + def _check_image(connection): |
252 | + return check_image(connection, args.amp_image_tag, args.amp_image_days) |
253 | + |
254 | + checks = { |
255 | + 'loadbalancers': check_loadbalancers, |
256 | + 'amphorae': check_amphorae, |
257 | + 'pools': check_pools, |
258 | + 'image': _check_image, |
259 | + } |
260 | + |
261 | + connection = openstack.connect(cloud='envvars') |
262 | + nagios_exit(*checks[args.check](connection)) |
263 | diff --git a/layer.yaml b/layer.yaml |
264 | index 4053919..0606d45 100644 |
265 | --- a/layer.yaml |
266 | +++ b/layer.yaml |
267 | @@ -20,6 +20,8 @@ options: |
268 | - python3-keystoneclient |
269 | - python3-openstackclient |
270 | - python-openstackclient |
271 | + - python3-octaviaclient |
272 | + - python-octaviaclient |
273 | snap: |
274 | fcbtest: |
275 | channel: stable |
276 | diff --git a/lib/lib_openstack_service_checks.py b/lib/lib_openstack_service_checks.py |
277 | index 716512b..3111b92 100644 |
278 | --- a/lib/lib_openstack_service_checks.py |
279 | +++ b/lib/lib_openstack_service_checks.py |
280 | @@ -65,6 +65,18 @@ class OSCHelper(): |
281 | return self.charm_config['check-neutron-agents'] |
282 | |
283 | @property |
284 | + def is_octavia_check_enabled(self): |
285 | + return self.charm_config['check-octavia'] |
286 | + |
287 | + @property |
288 | + def octavia_amp_image_tag(self): |
289 | + return self.charm_config['octavia-amp-image-tag'] |
290 | + |
291 | + @property |
292 | + def octavia_amp_image_days(self): |
293 | + return self.charm_config['octavia-amp-image-days'] |
294 | + |
295 | + @property |
296 | def skipped_rally_checks(self): |
297 | skipped_os_components = self.charm_config['skip-rally'].strip() |
298 | if not skipped_os_components: |
299 | @@ -186,6 +198,30 @@ class OSCHelper(): |
300 | else: |
301 | nrpe.remove_check(shortname='neutron_agents') |
302 | |
303 | + # only care about octavia after 18.04 |
304 | + if host.lsb_release()['DISTRIB_RELEASE'] >= '18.04': |
305 | + if self.is_octavia_check_enabled: |
306 | + script = os.path.join(self.plugins_dir, 'check_octavia.py') |
307 | + |
308 | + for check in ('loadbalancers', 'amphorae', 'pools'): |
309 | + nrpe.add_check( |
310 | + shortname='octavia_{}'.format(check), |
311 | + description='Check octavia {} status'.format(check), |
312 | + check_cmd='{} --check {}'.format(script, check), |
313 | + ) |
314 | + |
315 | + # image check has extra args, add it separately |
316 | + check = 'image' |
317 | + nrpe.add_check( |
318 | + shortname='octavia_{}'.format(check), |
319 | + description='Check octavia {} status'.format(check), |
320 | + check_cmd='{} --check {} --amp-image-tag {} --amp-image-days {}'.format( |
321 | + script, check, self.octavia_amp_image_tag, self.octavia_amp_image_days), |
322 | + ) |
323 | + else: |
324 | + for check in ('loadbalancers', 'amphorae', 'pools', 'image'): |
325 | + nrpe.remove_check(shortname='octavia_{}'.format(check)) |
326 | + |
327 | if self.contrail_analytics_vip: |
328 | contrail_check_command = '{} --host {}'.format( |
329 | os.path.join(self.plugins_dir, 'check_contrail_analytics_alarms.py'), |
This merge proposal is being monitored by mergebot. Change the status to Approved to merge.