Merge lp:~brad-marshall/charms/trusty/ceph-osd/add-nrpe-checks into lp:~openstack-charmers-archive/charms/trusty/ceph-osd/trunk

Proposed by Brad Marshall
Status: Merged
Merged at revision: 34
Proposed branch: lp:~brad-marshall/charms/trusty/ceph-osd/add-nrpe-checks
Merge into: lp:~openstack-charmers-archive/charms/trusty/ceph-osd/trunk
Diff against target: 566 lines (+488/-0)
8 files modified
charm-helpers-hooks.yaml (+1/-0)
config.yaml (+11/-0)
files/nagios/check_ceph_status.py (+44/-0)
files/nagios/collect_ceph_status.sh (+18/-0)
hooks/charmhelpers/contrib/charmsupport/nrpe.py (+222/-0)
hooks/charmhelpers/contrib/charmsupport/volumes.py (+156/-0)
hooks/hooks.py (+32/-0)
metadata.yaml (+4/-0)
To merge this branch: bzr merge lp:~brad-marshall/charms/trusty/ceph-osd/add-nrpe-checks
Reviewer Review Type Date Requested Status
Liam Young (community) Disapprove
Review via email: mp+241496@code.launchpad.net

Description of the change

Adds nrpe-external-master interface and adds basic nrpe checks.

To post a comment you must log in.
35. By Brad Marshall

[bradm] Fixes from pep8 run

36. By Brad Marshall

[bradm] Removed nagios check files that were moved to nrpe-external-master charm

Revision history for this message
Liam Young (gnuoy) wrote :

Thank for the mp. The new nrpe support is very gratefully received !

I've taken this branch and centralised the common code between this and the other nrpe branches and moved it to charm-helpers. To land it I created a new branch from this one which has now been merged into the 'next' charm. The 'next' charms will overwrite the stable ones in a couple of weeks.

review: Disapprove

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'charm-helpers-hooks.yaml'
2--- charm-helpers-hooks.yaml 2014-09-27 02:28:51 +0000
3+++ charm-helpers-hooks.yaml 2014-11-18 01:06:55 +0000
4@@ -7,3 +7,4 @@
5 - utils
6 - contrib.openstack.alternatives
7 - contrib.network.ip
8+ - contrib.charmsupport
9
10=== modified file 'config.yaml'
11--- config.yaml 2014-10-06 22:11:14 +0000
12+++ config.yaml 2014-11-18 01:06:55 +0000
13@@ -121,3 +121,14 @@
14 order for this charm to function correctly, the privacy extension must be
15 disabled and a non-temporary address must be configured/available on
16 your network interface.
17+ nagios_context:
18+ default: "juju"
19+ type: string
20+ description: |
21+ Used by the nrpe-external-master subordinate charm.
22+ A string that will be prepended to instance name to set the host name
23+ in nagios. So for instance the hostname would be something like:
24+ juju-myservice-0
25+ If you're running multiple environments with the same services in them
26+ this allows you to differentiate between them.
27+
28
29=== added directory 'files/nagios'
30=== added file 'files/nagios/check_ceph_status.py'
31--- files/nagios/check_ceph_status.py 1970-01-01 00:00:00 +0000
32+++ files/nagios/check_ceph_status.py 2014-11-18 01:06:55 +0000
33@@ -0,0 +1,44 @@
34+#!/usr/bin/env python
35+
36+# Copyright (C) 2014 Canonical
37+# All Rights Reserved
38+# Author: Jacek Nykis <jacek.nykis@canonical.com>
39+
40+import re
41+import argparse
42+import subprocess
43+import nagios_plugin
44+
45+
46+def check_ceph_status(args):
47+ if args.status_file:
48+ nagios_plugin.check_file_freshness(args.status_file, 3600)
49+ with open(args.status_file, "r") as f:
50+ lines = f.readlines()
51+ status_data = dict(l.strip().split(' ', 1) for l in lines if len(l) > 1)
52+ else:
53+ lines = subprocess.check_output(["ceph", "status"]).split('\n')
54+ status_data = dict(l.strip().split(' ', 1) for l in lines if len(l) > 1)
55+
56+ if ('health' not in status_data
57+ or 'monmap' not in status_data
58+ or 'osdmap'not in status_data):
59+ raise nagios_plugin.UnknownError('UNKNOWN: status data is incomplete')
60+
61+ if status_data['health'] != 'HEALTH_OK':
62+ msg = 'CRITICAL: ceph health status: "{}"'.format(status_data['health'])
63+ raise nagios_plugin.CriticalError(msg)
64+ osds = re.search("^.*: (\d+) osds: (\d+) up, (\d+) in", status_data['osdmap'])
65+ if osds.group(1) > osds.group(2): # not all OSDs are "up"
66+ msg = 'CRITICAL: Some OSDs are not up. Total: {}, up: {}'.format(
67+ osds.group(1), osds.group(2))
68+ raise nagios_plugin.CriticalError(msg)
69+ print "All OK"
70+
71+
72+if __name__ == '__main__':
73+ parser = argparse.ArgumentParser(description='Check ceph status')
74+ parser.add_argument('-f', '--file', dest='status_file',
75+ default=False, help='Optional file with "ceph status" output')
76+ args = parser.parse_args()
77+ nagios_plugin.try_check(check_ceph_status, args)
78
79=== added file 'files/nagios/collect_ceph_status.sh'
80--- files/nagios/collect_ceph_status.sh 1970-01-01 00:00:00 +0000
81+++ files/nagios/collect_ceph_status.sh 2014-11-18 01:06:55 +0000
82@@ -0,0 +1,18 @@
83+#!/bin/bash
84+# Copyright (C) 2014 Canonical
85+# All Rights Reserved
86+# Author: Jacek Nykis <jacek.nykis@canonical.com>
87+
88+LOCK=/var/lock/ceph-status.lock
89+lockfile-create -r2 --lock-name $LOCK > /dev/null 2>&1
90+if [ $? -ne 0 ]; then
91+ exit 1
92+fi
93+trap "rm -f $LOCK > /dev/null 2>&1" exit
94+
95+DATA_DIR="/var/lib/nagios"
96+if [ ! -d $DATA_DIR ]; then
97+ mkdir -p $DATA_DIR
98+fi
99+
100+ceph status >${DATA_DIR}/cat-ceph-status.txt
101
102=== added directory 'hooks/charmhelpers/contrib/charmsupport'
103=== added file 'hooks/charmhelpers/contrib/charmsupport/__init__.py'
104=== added file 'hooks/charmhelpers/contrib/charmsupport/nrpe.py'
105--- hooks/charmhelpers/contrib/charmsupport/nrpe.py 1970-01-01 00:00:00 +0000
106+++ hooks/charmhelpers/contrib/charmsupport/nrpe.py 2014-11-18 01:06:55 +0000
107@@ -0,0 +1,222 @@
108+"""Compatibility with the nrpe-external-master charm"""
109+# Copyright 2012 Canonical Ltd.
110+#
111+# Authors:
112+# Matthew Wedgwood <matthew.wedgwood@canonical.com>
113+
114+import subprocess
115+import pwd
116+import grp
117+import os
118+import re
119+import shlex
120+import yaml
121+
122+from charmhelpers.core.hookenv import (
123+ config,
124+ local_unit,
125+ log,
126+ relation_ids,
127+ relation_set,
128+)
129+
130+from charmhelpers.core.host import service
131+
132+# This module adds compatibility with the nrpe-external-master and plain nrpe
133+# subordinate charms. To use it in your charm:
134+#
135+# 1. Update metadata.yaml
136+#
137+# provides:
138+# (...)
139+# nrpe-external-master:
140+# interface: nrpe-external-master
141+# scope: container
142+#
143+# and/or
144+#
145+# provides:
146+# (...)
147+# local-monitors:
148+# interface: local-monitors
149+# scope: container
150+
151+#
152+# 2. Add the following to config.yaml
153+#
154+# nagios_context:
155+# default: "juju"
156+# type: string
157+# description: |
158+# Used by the nrpe subordinate charms.
159+# A string that will be prepended to instance name to set the host name
160+# in nagios. So for instance the hostname would be something like:
161+# juju-myservice-0
162+# If you're running multiple environments with the same services in them
163+# this allows you to differentiate between them.
164+#
165+# 3. Add custom checks (Nagios plugins) to files/nrpe-external-master
166+#
167+# 4. Update your hooks.py with something like this:
168+#
169+# from charmsupport.nrpe import NRPE
170+# (...)
171+# def update_nrpe_config():
172+# nrpe_compat = NRPE()
173+# nrpe_compat.add_check(
174+# shortname = "myservice",
175+# description = "Check MyService",
176+# check_cmd = "check_http -w 2 -c 10 http://localhost"
177+# )
178+# nrpe_compat.add_check(
179+# "myservice_other",
180+# "Check for widget failures",
181+# check_cmd = "/srv/myapp/scripts/widget_check"
182+# )
183+# nrpe_compat.write()
184+#
185+# def config_changed():
186+# (...)
187+# update_nrpe_config()
188+#
189+# def nrpe_external_master_relation_changed():
190+# update_nrpe_config()
191+#
192+# def local_monitors_relation_changed():
193+# update_nrpe_config()
194+#
195+# 5. ln -s hooks.py nrpe-external-master-relation-changed
196+# ln -s hooks.py local-monitors-relation-changed
197+
198+
199+class CheckException(Exception):
200+ pass
201+
202+
203+class Check(object):
204+ shortname_re = '[A-Za-z0-9-_]+$'
205+ service_template = ("""
206+#---------------------------------------------------
207+# This file is Juju managed
208+#---------------------------------------------------
209+define service {{
210+ use active-service
211+ host_name {nagios_hostname}
212+ service_description {nagios_hostname}[{shortname}] """
213+ """{description}
214+ check_command check_nrpe!{command}
215+ servicegroups {nagios_servicegroup}
216+}}
217+""")
218+
219+ def __init__(self, shortname, description, check_cmd):
220+ super(Check, self).__init__()
221+ # XXX: could be better to calculate this from the service name
222+ if not re.match(self.shortname_re, shortname):
223+ raise CheckException("shortname must match {}".format(
224+ Check.shortname_re))
225+ self.shortname = shortname
226+ self.command = "check_{}".format(shortname)
227+ # Note: a set of invalid characters is defined by the
228+ # Nagios server config
229+ # The default is: illegal_object_name_chars=`~!$%^&*"|'<>?,()=
230+ self.description = description
231+ self.check_cmd = self._locate_cmd(check_cmd)
232+
233+ def _locate_cmd(self, check_cmd):
234+ search_path = (
235+ '/',
236+ os.path.join(os.environ['CHARM_DIR'],
237+ 'files/nrpe-external-master'),
238+ '/usr/lib/nagios/plugins',
239+ '/usr/local/lib/nagios/plugins',
240+ )
241+ parts = shlex.split(check_cmd)
242+ for path in search_path:
243+ if os.path.exists(os.path.join(path, parts[0])):
244+ command = os.path.join(path, parts[0])
245+ if len(parts) > 1:
246+ command += " " + " ".join(parts[1:])
247+ return command
248+ log('Check command not found: {}'.format(parts[0]))
249+ return ''
250+
251+ def write(self, nagios_context, hostname):
252+ nrpe_check_file = '/etc/nagios/nrpe.d/{}.cfg'.format(
253+ self.command)
254+ with open(nrpe_check_file, 'w') as nrpe_check_config:
255+ nrpe_check_config.write("# check {}\n".format(self.shortname))
256+ nrpe_check_config.write("command[{}]={}\n".format(
257+ self.command, self.check_cmd))
258+
259+ if not os.path.exists(NRPE.nagios_exportdir):
260+ log('Not writing service config as {} is not accessible'.format(
261+ NRPE.nagios_exportdir))
262+ else:
263+ self.write_service_config(nagios_context, hostname)
264+
265+ def write_service_config(self, nagios_context, hostname):
266+ for f in os.listdir(NRPE.nagios_exportdir):
267+ if re.search('.*{}.cfg'.format(self.command), f):
268+ os.remove(os.path.join(NRPE.nagios_exportdir, f))
269+
270+ templ_vars = {
271+ 'nagios_hostname': hostname,
272+ 'nagios_servicegroup': nagios_context,
273+ 'description': self.description,
274+ 'shortname': self.shortname,
275+ 'command': self.command,
276+ }
277+ nrpe_service_text = Check.service_template.format(**templ_vars)
278+ nrpe_service_file = '{}/service__{}_{}.cfg'.format(
279+ NRPE.nagios_exportdir, hostname, self.command)
280+ with open(nrpe_service_file, 'w') as nrpe_service_config:
281+ nrpe_service_config.write(str(nrpe_service_text))
282+
283+ def run(self):
284+ subprocess.call(self.check_cmd)
285+
286+
287+class NRPE(object):
288+ nagios_logdir = '/var/log/nagios'
289+ nagios_exportdir = '/var/lib/nagios/export'
290+ nrpe_confdir = '/etc/nagios/nrpe.d'
291+
292+ def __init__(self, hostname=None):
293+ super(NRPE, self).__init__()
294+ self.config = config()
295+ self.nagios_context = self.config['nagios_context']
296+ self.unit_name = local_unit().replace('/', '-')
297+ if hostname:
298+ self.hostname = hostname
299+ else:
300+ self.hostname = "{}-{}".format(self.nagios_context, self.unit_name)
301+ self.checks = []
302+
303+ def add_check(self, *args, **kwargs):
304+ self.checks.append(Check(*args, **kwargs))
305+
306+ def write(self):
307+ try:
308+ nagios_uid = pwd.getpwnam('nagios').pw_uid
309+ nagios_gid = grp.getgrnam('nagios').gr_gid
310+ except:
311+ log("Nagios user not set up, nrpe checks not updated")
312+ return
313+
314+ if not os.path.exists(NRPE.nagios_logdir):
315+ os.mkdir(NRPE.nagios_logdir)
316+ os.chown(NRPE.nagios_logdir, nagios_uid, nagios_gid)
317+
318+ nrpe_monitors = {}
319+ monitors = {"monitors": {"remote": {"nrpe": nrpe_monitors}}}
320+ for nrpecheck in self.checks:
321+ nrpecheck.write(self.nagios_context, self.hostname)
322+ nrpe_monitors[nrpecheck.shortname] = {
323+ "command": nrpecheck.command,
324+ }
325+
326+ service('restart', 'nagios-nrpe-server')
327+
328+ for rid in relation_ids("local-monitors"):
329+ relation_set(relation_id=rid, monitors=yaml.dump(monitors))
330
331=== added file 'hooks/charmhelpers/contrib/charmsupport/volumes.py'
332--- hooks/charmhelpers/contrib/charmsupport/volumes.py 1970-01-01 00:00:00 +0000
333+++ hooks/charmhelpers/contrib/charmsupport/volumes.py 2014-11-18 01:06:55 +0000
334@@ -0,0 +1,156 @@
335+'''
336+Functions for managing volumes in juju units. One volume is supported per unit.
337+Subordinates may have their own storage, provided it is on its own partition.
338+
339+Configuration stanzas:
340+ volume-ephemeral:
341+ type: boolean
342+ default: true
343+ description: >
344+ If false, a volume is mounted as sepecified in "volume-map"
345+ If true, ephemeral storage will be used, meaning that log data
346+ will only exist as long as the machine. YOU HAVE BEEN WARNED.
347+ volume-map:
348+ type: string
349+ default: {}
350+ description: >
351+ YAML map of units to device names, e.g:
352+ "{ rsyslog/0: /dev/vdb, rsyslog/1: /dev/vdb }"
353+ Service units will raise a configure-error if volume-ephemeral
354+ is 'true' and no volume-map value is set. Use 'juju set' to set a
355+ value and 'juju resolved' to complete configuration.
356+
357+Usage:
358+ from charmsupport.volumes import configure_volume, VolumeConfigurationError
359+ from charmsupport.hookenv import log, ERROR
360+ def post_mount_hook():
361+ stop_service('myservice')
362+ def post_mount_hook():
363+ start_service('myservice')
364+
365+ if __name__ == '__main__':
366+ try:
367+ configure_volume(before_change=pre_mount_hook,
368+ after_change=post_mount_hook)
369+ except VolumeConfigurationError:
370+ log('Storage could not be configured', ERROR)
371+'''
372+
373+# XXX: Known limitations
374+# - fstab is neither consulted nor updated
375+
376+import os
377+from charmhelpers.core import hookenv
378+from charmhelpers.core import host
379+import yaml
380+
381+
382+MOUNT_BASE = '/srv/juju/volumes'
383+
384+
385+class VolumeConfigurationError(Exception):
386+ '''Volume configuration data is missing or invalid'''
387+ pass
388+
389+
390+def get_config():
391+ '''Gather and sanity-check volume configuration data'''
392+ volume_config = {}
393+ config = hookenv.config()
394+
395+ errors = False
396+
397+ if config.get('volume-ephemeral') in (True, 'True', 'true', 'Yes', 'yes'):
398+ volume_config['ephemeral'] = True
399+ else:
400+ volume_config['ephemeral'] = False
401+
402+ try:
403+ volume_map = yaml.safe_load(config.get('volume-map', '{}'))
404+ except yaml.YAMLError as e:
405+ hookenv.log("Error parsing YAML volume-map: {}".format(e),
406+ hookenv.ERROR)
407+ errors = True
408+ if volume_map is None:
409+ # probably an empty string
410+ volume_map = {}
411+ elif not isinstance(volume_map, dict):
412+ hookenv.log("Volume-map should be a dictionary, not {}".format(
413+ type(volume_map)))
414+ errors = True
415+
416+ volume_config['device'] = volume_map.get(os.environ['JUJU_UNIT_NAME'])
417+ if volume_config['device'] and volume_config['ephemeral']:
418+ # asked for ephemeral storage but also defined a volume ID
419+ hookenv.log('A volume is defined for this unit, but ephemeral '
420+ 'storage was requested', hookenv.ERROR)
421+ errors = True
422+ elif not volume_config['device'] and not volume_config['ephemeral']:
423+ # asked for permanent storage but did not define volume ID
424+ hookenv.log('Ephemeral storage was requested, but there is no volume '
425+ 'defined for this unit.', hookenv.ERROR)
426+ errors = True
427+
428+ unit_mount_name = hookenv.local_unit().replace('/', '-')
429+ volume_config['mountpoint'] = os.path.join(MOUNT_BASE, unit_mount_name)
430+
431+ if errors:
432+ return None
433+ return volume_config
434+
435+
436+def mount_volume(config):
437+ if os.path.exists(config['mountpoint']):
438+ if not os.path.isdir(config['mountpoint']):
439+ hookenv.log('Not a directory: {}'.format(config['mountpoint']))
440+ raise VolumeConfigurationError()
441+ else:
442+ host.mkdir(config['mountpoint'])
443+ if os.path.ismount(config['mountpoint']):
444+ unmount_volume(config)
445+ if not host.mount(config['device'], config['mountpoint'], persist=True):
446+ raise VolumeConfigurationError()
447+
448+
449+def unmount_volume(config):
450+ if os.path.ismount(config['mountpoint']):
451+ if not host.umount(config['mountpoint'], persist=True):
452+ raise VolumeConfigurationError()
453+
454+
455+def managed_mounts():
456+ '''List of all mounted managed volumes'''
457+ return filter(lambda mount: mount[0].startswith(MOUNT_BASE), host.mounts())
458+
459+
460+def configure_volume(before_change=lambda: None, after_change=lambda: None):
461+ '''Set up storage (or don't) according to the charm's volume configuration.
462+ Returns the mount point or "ephemeral". before_change and after_change
463+ are optional functions to be called if the volume configuration changes.
464+ '''
465+
466+ config = get_config()
467+ if not config:
468+ hookenv.log('Failed to read volume configuration', hookenv.CRITICAL)
469+ raise VolumeConfigurationError()
470+
471+ if config['ephemeral']:
472+ if os.path.ismount(config['mountpoint']):
473+ before_change()
474+ unmount_volume(config)
475+ after_change()
476+ return 'ephemeral'
477+ else:
478+ # persistent storage
479+ if os.path.ismount(config['mountpoint']):
480+ mounts = dict(managed_mounts())
481+ if mounts.get(config['mountpoint']) != config['device']:
482+ before_change()
483+ unmount_volume(config)
484+ mount_volume(config)
485+ after_change()
486+ else:
487+ before_change()
488+ mount_volume(config)
489+ after_change()
490+ return config['mountpoint']
491
492=== modified file 'hooks/hooks.py'
493--- hooks/hooks.py 2014-09-30 03:41:06 +0000
494+++ hooks/hooks.py 2014-11-18 01:06:55 +0000
495@@ -20,6 +20,8 @@
496 relation_ids,
497 related_units,
498 relation_get,
499+ relations_of_type,
500+ local_unit,
501 Hooks,
502 UnregisteredHookError,
503 service_name
504@@ -48,6 +50,8 @@
505 format_ipv6_addr
506 )
507
508+from charmhelpers.contrib.charmsupport.nrpe import NRPE
509+
510 hooks = Hooks()
511
512
513@@ -203,6 +207,34 @@
514 fatal=True)
515
516
517+@hooks.hook('nrpe-external-master-relation-joined',
518+ 'nrpe-external-master-relation-changed')
519+def update_nrpe_config():
520+ # Find out if nrpe set nagios_hostname
521+ hostname = None
522+ host_context = None
523+ for rel in relations_of_type('nrpe-external-master'):
524+ if 'nagios_hostname' in rel:
525+ hostname = rel['nagios_hostname']
526+ host_context = rel['nagios_host_context']
527+ break
528+ nrpe = NRPE(hostname=hostname)
529+ apt_install('python-dbus')
530+
531+ if host_context:
532+ current_unit = "%s:%s" % (host_context, local_unit())
533+ else:
534+ current_unit = local_unit()
535+
536+ nrpe.add_check(
537+ shortname='ceph-osd',
538+ description='process check {%s}' % current_unit,
539+ check_cmd='check_upstart_job ceph-osd',
540+ )
541+
542+ nrpe.write()
543+
544+
545 if __name__ == '__main__':
546 try:
547 hooks.execute(sys.argv)
548
549=== added symlink 'hooks/nrpe-external-master-relation-changed'
550=== target is u'hooks.py'
551=== added symlink 'hooks/nrpe-external-master-relation-joined'
552=== target is u'hooks.py'
553=== modified file 'metadata.yaml'
554--- metadata.yaml 2014-10-06 22:11:14 +0000
555+++ metadata.yaml 2014-11-18 01:06:55 +0000
556@@ -1,6 +1,10 @@
557 name: ceph-osd
558 summary: Highly scalable distributed storage - Ceph OSD storage
559 maintainer: James Page <james.page@ubuntu.com>
560+provides:
561+ nrpe-external-master:
562+ interface: nrpe-external-master
563+ scope: container
564 categories:
565 - misc
566 description: |

Subscribers

People subscribed via source and target branches