Merge lp:~jacekn/charms/precise/rabbitmq-server/queue-monitoring into lp:charms/rabbitmq-server
- Precise Pangolin (12.04)
- queue-monitoring
- Merge into trunk
Status: | Merged |
---|---|
Merged at revision: | 94 |
Proposed branch: | lp:~jacekn/charms/precise/rabbitmq-server/queue-monitoring |
Merge into: | lp:charms/rabbitmq-server |
Prerequisite: | lp:~jacekn/charms/precise/rabbitmq-server/nrpe-fix |
Diff against target: |
271 lines (+194/-5) 5 files modified
config.yaml (+14/-0) hooks/rabbitmq_server_relations.py (+31/-4) revision (+1/-1) scripts/check_rabbitmq_queues.py (+99/-0) scripts/collect_rabbitmq_stats.sh (+49/-0) |
To merge this branch: | bzr merge lp:~jacekn/charms/precise/rabbitmq-server/queue-monitoring |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Stuart Bishop (community) | Needs Fixing | ||
Matt Bruzek (community) | Needs Fixing | ||
Jacek Nykis (community) | Needs Resubmitting | ||
Review via email: mp+218580@code.launchpad.net |
Commit message
Description of the change
This change adds support for queue monitoring by nagios.
- 58. By Jacek Nykis
-
Add quotes to the rabbitmq thresholds to allow wildcards
Matt Bruzek (mbruzek) wrote : | # |
Matt Bruzek (mbruzek) : | # |
- 59. By Jacek Nykis
-
Added default value (empty string) to stats_cron_schedule and queue_thresholds config options as per MP comment
Jacek Nykis (jacekn) wrote : | # |
Hi Matt,
Thank you for reviewing the charm. I added default values (empty strings) and tested everything again.
If you want to verify the queue check you can use sample data below. This lines were generated by the same script as one in the charm. You can also modify 5th column to simulate queues filling up.
#Vhost Name Messages_ready Messages_
nagios-
openstack ceilometer.
openstack ceilometer.
openstack cert 0 0 0 1 13984 1401271802
Matt Bruzek (mbruzek) wrote : | # |
Jacek,
Thank you for resubmitting this proposal with those fixes!
I deployed the charms as listed above and added the text (from your last comment) to the juju-hp-
sudo vi juju-hp-
The check_rabbitmq_
ubuntu@
CRITICALS: No Queues Found in No Vhosts Found has None messages
Also there are 2 new keys that need default: in config.yaml.
$ charm proof
W: config.yaml: option key does not have the keys: default
W: config.yaml: option source does not have the keys: default
These configuration options need the “default:” keyword in the config.yaml and my understanding is leaving them blank gives it a None value which is what I recommend for these options.
The rabbitmq queue has not filled up on any of the tests with the rabbitmq-server and nrpe-external-
Putting this proposal in "Needs Fixing". Once you have addressed the issues please click on the “Request another review” link on this merge proposal. That way it will be added to the review queue properly.
If you have any questions/
Alexander List (alexlist) wrote : | # |
Please consider setting up an openstack trusty/icehouse deployment, which requires rabbitmq-server for collecting stats for ceilometer. Or any other service that uses AMQP.
The two empty ("") default values for "key" and "source" are missing in cs:precise/
Alexander List (alexlist) wrote : | # |
We'll try to come up with a scenario that's testable, and update the MP to include these defaults.
Stuart Bishop (stub) wrote : | # |
Per mbruzek's review
José Antonio Rey (jose) wrote : | # |
Hey Jacek,
Based on other charmer's comments, I'm marking this as WIP. Make sure to move it to Needs Review and ask for another review from the ~charmers team once it's ready!
Thanks for all your efforts on making this charm better :)
Preview Diff
1 | === modified file 'config.yaml' |
2 | --- config.yaml 2014-03-26 10:23:01 +0000 |
3 | +++ config.yaml 2014-05-28 10:08:46 +0000 |
4 | @@ -133,3 +133,17 @@ |
5 | description: | |
6 | Key ID to import to the apt keyring to support use with arbitary source |
7 | configuration from outside of Launchpad archives or PPA's. |
8 | + stats_cron_schedule: |
9 | + type: string |
10 | + default: "" |
11 | + description: | |
12 | + Cron schedule used to generate rabbitmq stats. If unset |
13 | + no stats will be generated |
14 | + queue_thresholds: |
15 | + type: string |
16 | + default: "" |
17 | + description: | |
18 | + List of RabbitMQ queue size check thresholds. Interpreted as YAML |
19 | + in format [<vhost>, <queue>, <warn>, <crit>] |
20 | + - ['/', 'queue1', 10, 20] |
21 | + - ['/', 'queue2', 200, 300] |
22 | |
23 | === modified file 'hooks/rabbitmq_server_relations.py' |
24 | --- hooks/rabbitmq_server_relations.py 2014-05-02 14:18:00 +0000 |
25 | +++ hooks/rabbitmq_server_relations.py 2014-05-28 10:08:46 +0000 |
26 | @@ -5,6 +5,7 @@ |
27 | import sys |
28 | import subprocess |
29 | import glob |
30 | +import yaml |
31 | |
32 | import rabbit_utils as rabbit |
33 | from lib.utils import ( |
34 | @@ -41,7 +42,7 @@ |
35 | UnregisteredHookError |
36 | ) |
37 | from charmhelpers.core.host import ( |
38 | - rsync, service_stop, service_restart |
39 | + rsync, service_stop, service_restart, write_file |
40 | ) |
41 | from charmhelpers.contrib.charmsupport.nrpe import NRPE |
42 | from charmhelpers.contrib.ssl.service import ServiceCA |
43 | @@ -60,6 +61,10 @@ |
44 | RABBIT_USER = 'rabbitmq' |
45 | RABBIT_GROUP = 'rabbitmq' |
46 | NAGIOS_PLUGINS = '/usr/local/lib/nagios/plugins' |
47 | +SCRIPTS_DIR = '/usr/local/bin' |
48 | +STATS_CRONFILE = '/etc/cron.d/rabbitmq-stats' |
49 | +STATS_DATAFILE = os.path.join(RABBIT_DIR, 'data', |
50 | + subprocess.check_output(['hostname', '-s']).strip() + '_queue_stats.dat') |
51 | |
52 | |
53 | @hooks.hook('install') |
54 | @@ -334,10 +339,10 @@ |
55 | rbd_img=rbd_img, sizemb=sizemb, |
56 | fstype='ext4', mount_point=RABBIT_DIR, |
57 | blk_device=blk_device, |
58 | - system_services=['rabbitmq-server'])#, |
59 | + system_services=['rabbitmq-server']) # , |
60 | #rbd_pool_replicas=rbd_pool_rep_count) |
61 | subprocess.check_call(['chown', '-R', '%s:%s' % |
62 | - (RABBIT_USER,RABBIT_GROUP), RABBIT_DIR]) |
63 | + (RABBIT_USER, RABBIT_GROUP), RABBIT_DIR]) |
64 | else: |
65 | log('This is not the peer leader. Not configuring RBD.') |
66 | log('Stopping rabbitmq-server.') |
67 | @@ -360,9 +365,20 @@ |
68 | rsync(os.path.join(os.getenv('CHARM_DIR'), 'scripts', |
69 | 'check_rabbitmq.py'), |
70 | os.path.join(NAGIOS_PLUGINS, 'check_rabbitmq.py')) |
71 | + rsync(os.path.join(os.getenv('CHARM_DIR'), 'scripts', |
72 | + 'check_rabbitmq_queues.py'), |
73 | + os.path.join(NAGIOS_PLUGINS, 'check_rabbitmq_queues.py')) |
74 | + if config('stats_cron_schedule'): |
75 | + script = os.path.join(SCRIPTS_DIR, 'collect_rabbitmq_stats.sh') |
76 | + cronjob = "{} root {}\n".format(config('stats_cron_schedule'), script) |
77 | + rsync(os.path.join(os.getenv('CHARM_DIR'), 'scripts', |
78 | + 'collect_rabbitmq_stats.sh'), script) |
79 | + write_file(STATS_CRONFILE, cronjob) |
80 | + elif os.path.isfile(STATS_CRONFILE): |
81 | + os.remove(STATS_CRONFILE) |
82 | |
83 | # Find out if nrpe set nagios_hostname |
84 | - hostname=None |
85 | + hostname = None |
86 | for rel in relations_of_type('nrpe-external-master'): |
87 | if 'nagios_hostname' in rel: |
88 | hostname = rel['nagios_hostname'] |
89 | @@ -384,6 +400,17 @@ |
90 | check_cmd='{}/check_rabbitmq.py --user {} --password {} --vhost {}' |
91 | ''.format(NAGIOS_PLUGINS, user, password, vhost) |
92 | ) |
93 | + if config('queue_thresholds'): |
94 | + cmd = "" |
95 | + # If value of queue_thresholds is incorrect we want the hook to fail |
96 | + for item in yaml.safe_load(config('queue_thresholds')): |
97 | + cmd += ' -c "{}" "{}" {} {}'.format(*item) |
98 | + nrpe_compat.add_check( |
99 | + shortname=rabbit.RABBIT_USER + '_queue', |
100 | + description='Check RabbitMQ Queues', |
101 | + check_cmd='{}/check_rabbitmq_queues.py{} {}'.format( |
102 | + NAGIOS_PLUGINS, cmd, STATS_DATAFILE) |
103 | + ) |
104 | nrpe_compat.write() |
105 | |
106 | |
107 | |
108 | === modified file 'revision' |
109 | --- revision 2014-05-02 14:18:00 +0000 |
110 | +++ revision 2014-05-28 10:08:46 +0000 |
111 | @@ -1,1 +1,1 @@ |
112 | -128 |
113 | +150 |
114 | |
115 | === added file 'scripts/check_rabbitmq_queues.py' |
116 | --- scripts/check_rabbitmq_queues.py 1970-01-01 00:00:00 +0000 |
117 | +++ scripts/check_rabbitmq_queues.py 2014-05-28 10:08:46 +0000 |
118 | @@ -0,0 +1,99 @@ |
119 | +#!/usr/bin/python |
120 | + |
121 | +# Copyright (C) 2011, 2012, 2014 Canonical |
122 | +# All Rights Reserved |
123 | +# Author: Liam Young, Jacek Nykis |
124 | + |
125 | +from collections import defaultdict |
126 | +from fnmatch import fnmatchcase |
127 | +from itertools import chain |
128 | +import argparse |
129 | +import sys |
130 | + |
131 | +def gen_data_lines(filename): |
132 | + with open(filename, "rb") as fin: |
133 | + for line in fin: |
134 | + if not line.startswith("#"): |
135 | + yield line |
136 | + |
137 | + |
138 | +def gen_stats(data_lines): |
139 | + for line in data_lines: |
140 | + try: |
141 | + vhost, queue, _, _, m_all, _ = line.split(None, 5) |
142 | + except ValueError: |
143 | + print "ERROR: problem parsing the stats file" |
144 | + sys.exit(2) |
145 | + assert m_all.isdigit(), "Message count is not a number: %r" % m_all |
146 | + yield vhost, queue, int(m_all) |
147 | + |
148 | + |
149 | +def collate_stats(stats, limits): |
150 | + # Create a dict with stats collated according to the definitions in the |
151 | + # limits file. If none of the definitions in the limits file is matched, |
152 | + # store the stat without collating. |
153 | + collated = defaultdict(lambda: 0) |
154 | + for vhost, queue, m_all in stats: |
155 | + for l_vhost, l_queue, _, _ in limits: |
156 | + if fnmatchcase(vhost, l_vhost) and fnmatchcase(queue, l_queue): |
157 | + collated[l_vhost, l_queue] += m_all |
158 | + break |
159 | + else: |
160 | + collated[vhost, queue] += m_all |
161 | + return collated |
162 | + |
163 | + |
164 | +def check_stats(stats_collated, limits): |
165 | + # Create a limits lookup dict with keys of the form (vhost, queue). |
166 | + limits_lookup = dict( |
167 | + ((l_vhost, l_queue), (int(t_warning), int(t_critical))) |
168 | + for l_vhost, l_queue, t_warning, t_critical in limits) |
169 | + if not (stats_collated): |
170 | + yield 'No Queues Found', 'No Vhosts Found', None, "CRIT" |
171 | + # Go through the stats and compare again limits, if any. |
172 | + for l_vhost, l_queue in sorted(stats_collated): |
173 | + m_all = stats_collated[l_vhost, l_queue] |
174 | + try: |
175 | + t_warning, t_critical = limits_lookup[l_vhost, l_queue] |
176 | + except KeyError: |
177 | + yield l_queue, l_vhost, m_all, "UNKNOWN" |
178 | + else: |
179 | + if m_all >= t_critical: |
180 | + yield l_queue, l_vhost, m_all, "CRIT" |
181 | + elif m_all >= t_warning: |
182 | + yield l_queue, l_vhost, m_all, "WARN" |
183 | + |
184 | + |
185 | +if __name__ == "__main__": |
186 | + parser = argparse.ArgumentParser(description='RabbitMQ queue size nagios check.') |
187 | + parser.add_argument('-c', nargs=4, action='append', required=True, |
188 | + metavar=('vhost', 'queue', 'warn', 'crit'), |
189 | + help=('Vhost and queue to check. Can be used multiple times')) |
190 | + parser.add_argument('stats_file', nargs='*', type=str, help='file containing queue stats') |
191 | + args = parser.parse_args() |
192 | + |
193 | + # Start generating stats from all files given on the command line. |
194 | + stats = gen_stats( |
195 | + chain.from_iterable( |
196 | + gen_data_lines(filename) for filename in args.stats_file)) |
197 | + # Collate stats according to limit definitions and check. |
198 | + stats_collated = collate_stats(stats, args.c) |
199 | + stats_checked = check_stats(stats_collated, args.c) |
200 | + criticals, warnings = [], [] |
201 | + for queue, vhost, message_no, status in stats_checked: |
202 | + if status == "CRIT": |
203 | + criticals.append( |
204 | + "%s in %s has %s messages" % (queue, vhost, message_no)) |
205 | + elif status == "WARN": |
206 | + warnings.append( |
207 | + "%s in %s has %s messages" % (queue, vhost, message_no)) |
208 | + if len(criticals) > 0: |
209 | + print "CRITICALS: %s" % ", ".join(criticals) |
210 | + sys.exit(2) |
211 | + # XXX: No warnings if there are criticals? |
212 | + elif len(warnings) > 0: |
213 | + print "WARNINGS: %s" % ", ".join(warnings) |
214 | + sys.exit(1) |
215 | + else: |
216 | + print "OK" |
217 | + sys.exit(0) |
218 | |
219 | === added file 'scripts/collect_rabbitmq_stats.sh' |
220 | --- scripts/collect_rabbitmq_stats.sh 1970-01-01 00:00:00 +0000 |
221 | +++ scripts/collect_rabbitmq_stats.sh 2014-05-28 10:08:46 +0000 |
222 | @@ -0,0 +1,49 @@ |
223 | +#!/bin/bash |
224 | +# Copyright (C) 2011, 2014 Canonical |
225 | +# All Rights Reserved |
226 | +# Author: Liam Young, Jacek Nykis |
227 | + |
228 | +# Produce a queue data for a given vhost. Useful for graphing and Nagios checks |
229 | +LOCK=/var/lock/rabbitmq-gather-metrics.lock |
230 | +# Check for a lock file and if not, create one |
231 | +lockfile-create -r2 --lock-name $LOCK > /dev/null 2>&1 |
232 | +if [ $? -ne 0 ]; then |
233 | + exit 1 |
234 | +fi |
235 | +trap "rm -f $LOCK > /dev/null 2>&1" exit |
236 | + |
237 | +# Required to fix the bug about start-stop-daemon not being found in |
238 | +# rabbitmq-server 2.7.1-0ubuntu4. |
239 | +# '/usr/sbin/rabbitmqctl: 33: /usr/sbin/rabbitmqctl: start-stop-daemon: not found' |
240 | +export PATH=${PATH}:/sbin/ |
241 | + |
242 | +if [ -f /var/lib/rabbitmq/pids ]; then |
243 | + RABBIT_PID=$(grep "{rabbit\@${HOSTNAME}," /var/lib/rabbitmq/pids | sed -e 's!^.*,\([0-9]*\).*!\1!') |
244 | +elif [ -f /var/run/rabbitmq/pid ]; then |
245 | + RABBIT_PID=$(cat /var/run/rabbitmq/pid) |
246 | +else |
247 | + echo "No PID file found" |
248 | + exit 3 |
249 | +fi |
250 | +DATA_DIR="/var/lib/rabbitmq/data" |
251 | +DATA_FILE="${DATA_DIR}/$(hostname -s)_queue_stats.dat" |
252 | +LOG_DIR="/var/lib/rabbitmq/logs" |
253 | +RABBIT_STATS_DATA_FILE="${DATA_DIR}/$(hostname -s)_general_stats.dat" |
254 | +NOW=$(date +'%s') |
255 | +HOSTNAME=$(hostname -s) |
256 | +MNESIA_DB_SIZE=$(du -sm /var/lib/rabbitmq/mnesia | cut -f1) |
257 | +RABBIT_RSS=$(ps -p $RABBIT_PID -o rss=) |
258 | +if [ ! -d $DATA_DIR ]; then |
259 | + mkdir -p $DATA_DIR |
260 | +fi |
261 | +if [ ! -d $LOG_DIR ]; then |
262 | + mkdir -p $LOG_DIR |
263 | +fi |
264 | +echo "#Vhost Name Messages_ready Messages_unacknowledged Messages Consumers Memory Time" > $DATA_FILE |
265 | +/usr/sbin/rabbitmqctl -q list_vhosts | \ |
266 | +while read VHOST; do |
267 | + /usr/sbin/rabbitmqctl -q list_queues -p $VHOST name messages_ready messages_unacknowledged messages consumers memory | \ |
268 | + awk "{print \"$VHOST \" \$0 \" $(date +'%s') \"}" >> $DATA_FILE 2>${LOG_DIR}/list_queues.log |
269 | +done |
270 | +echo "mnesia_size: ${MNESIA_DB_SIZE}@$NOW" > $RABBIT_STATS_DATA_FILE |
271 | +echo "rss_size: ${RABBIT_RSS}@$NOW" >> $RABBIT_STATS_DATA_FILE |
Jacek,
Thanks for this work on the rabbitmq-server! I realize monitoring is important for production environments and am looking forward to getting the monitoring code in the charm.
When I run charm proof on the merged code there are two new Warning messages:
$ charm proof
W: config.yaml: option stats_cron_schedule does not have the keys: default
W: config.yaml: option queue_thresholds does not have the keys: default
…
These new configuration values should have default values in the config.yaml file. You can have a default of “” or (I believe) NoneType is valid.
Thanks for chatting with me on IRC. The steps to test this were as follows:
juju deploy --repository=. local:precise/ rabbitmq- server master master
juju deploy nrpe-external-
juju add-relation rabbitmq-server nrpe-external-
juju set rabbitmq-server stats_cron_ schedule= "*/5 * * * *" s="[['\ *', '\*', 15, 30]]"
juju set rabbitmq-server queue_threshold
(Wait 5 minutes for the queue data to be generated by cron") lib/nagios/ plugins/ check_rabbitmq_ queues. py -c \* \* 15 30 /var/lib/ rabbitmq/ data/*_ queue_stats. dat
juju ssh rabbitmq-server/0
$ /usr/local/
The result of this command were: mbruzek- local-machine- 1:/var/ lib/rabbitmq/ data$ /usr/local/ lib/nagios/ plugins/ check_rabbitmq_ queues. py -c \* \* 15 30 /var/lib/ rabbitmq/ data/mbruzek- local-machine- 1_queue_ stats.dat
ubuntu@
CRITICALS: No Queues Found in No Vhosts Found has None messages
Normally this result would indicate a failure, but you mentioned on IRC that is to be expected because there was nothing connected to rabbitmq-server to generate messages. I was unable to figure out how to generate queued messages in rabbitmq-server for further testing.
Please provide default values for the two new configuration options and give me some more details on how to generate some rabbitmq-traffic so we can see the command run successful.
Thank you again for the submission! I am going to put this MP to "needs fixing"
Once the problem has been addressed, click on the “Request another review” link on this merge proposal. That way it will be added to the review queue properly.
If you have any questions/ comments/ concerns about the review contact mbruzek on IRC. You can find the rest of the team in #juju on irc.freenode.net or email the mailing list <email address hidden>