Merge ~xavpaice/bootstack-specs:ssl-nagios-checks into bootstack-specs:master

Proposed by Xav Paice
Status: Rejected
Rejected by: Haw Loeung
Proposed branch: ~xavpaice/bootstack-specs:ssl-nagios-checks
Merge into: bootstack-specs:master
Diff against target: 192 lines (+186/-0)
1 file modified
backlog/ssl-nagios-checks.rst (+186/-0)
Reviewer Review Type Date Requested Status
Jeremy Lounder Pending
Review via email: mp+361470@code.launchpad.net
To post a comment you must log in.

Unmerged commits

72edf30... by Xav Paice

ssl-nagios-checks

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/backlog/ssl-nagios-checks.rst b/backlog/ssl-nagios-checks.rst
2new file mode 100644
3index 0000000..dc768df
4--- /dev/null
5+++ b/backlog/ssl-nagios-checks.rst
6@@ -0,0 +1,186 @@
7+..
8+ Copyright 2018 Canonical
9+
10+ This work is licensed under a Creative Commons Attribution 3.0
11+ Unported License.
12+ http://creativecommons.org/licenses/by/3.0/legalcode
13+
14+============================================
15+Nagios monitoring for SSL certificate expiry
16+============================================
17+
18+SSL certificates, either self-signed, corporate CA, or signed by a public
19+authority, are used by various charms to provide https encryption for API
20+endpoints. These certs are added to the charm configuration, along with a CA
21+certificate if the cert is signed by a private authority. At, preferably
22+before, the expiry date of either the SSL certificate itself, or the CA cert,
23+we need to replace those certificates.
24+
25+Problem Description
26+===================
27+
28+Currently there is no monitoring or alerting to warn operators when the
29+certificates provided are nearing expiry. This leaves an opportunity for
30+certificates to expire without notice, causing API services to become
31+unavailable on expiry, and no useful alerting message to explain why.
32+
33+Proposed Change
34+===============
35+
36+The goal is to add an nrpe monitor that will check the expiry date of OpenStack
37+API certificates and alert when the expiry is a set number of days away or
38+less.
39+
40+First stage of change
41+---------------------
42+
43+Currently the Foundation Cloud build includes the openstack-service-checks
44+charm, which monitors a small number of items in Nova and Neutron. The
45+proposed change is to extend that charm to Loop through the Keystone catalog as
46+provided by the output of `openstack endpoint list`, and add a check_http check
47+for each entry. The exact url should be determined by the endpoint provided,
48+alongside a translation dictionary so that the 'healthcheck' url can be added
49+if one exists for that service. If the url is https, then add -C 30,14 to the
50+check_http check so that we are warned at 30 days, and critical at 14 days till
51+certificate expiry.
52+
53+The format of the new Nagios service provided for each service catalog entry
54+would be::
55+
56+ nagios_context-unitname-$service-$type
57+
58+E.g. to check the keystone public URL::
59+
60+ bootstack-customer-os-service-checks-keystone-public
61+
62+Limitations of the proposed change:
63+
64+- Only certificates for OpenStack API endpoints are checked, if they're in the
65+ Keystone catalog. That means non-OpenStack services such as Kubernetes, or
66+ other Apache services, are not monitored by this additional feature.
67+- Only the endpoint URL is checked, for a cluster of multiple units behind the
68+ hacluster charm we will only check one unit per nrpe check run. This may
69+ miss a unit should a certificate update fail at some stage.
70+
71+Second stage of the change
72+--------------------------
73+
74+Create a Nagios plugin to take a certificate file and check it's expiry date.
75+
76+Add the new plugin to charmhelpers, similar to how check_haproxy.sh is added to
77+./charm-helpers/charmhelpers/contrib/openstack/files/check_haproxy.sh.
78+
79+Add a method to charmhelpers.contrib.ssl which takes arguments for a
80+certificate file path, plus the actual certificate, and stores the cert where
81+asked. If this method is called plus there's an nrpe relation, the method
82+should also trigger a call to add an nrpe check for certificate expiry.
83+
84+Add a new method to charmhelpers.contrib.nrpe which adds checks for TLS
85+certificates defined by ssl_cert and ssl_ca.
86+
87+For each non-reactive charm that uses charmhelpers, if ssl_cert and/or ssl_ca
88+is set, plus there's an nrpe relation, use the method added above to add nrpe
89+checks for certificate expiry. It may be preferable to move the code for
90+storing the certificate out of the individual charms.
91+
92+For reactive charms, update the charm-layer-openstack-api code to run the new
93+charmhelpers methods to store ssl_cert and/or ssl_ca.
94+
95+Charm Relation changes
96+----------------------
97+
98+The number of nrpe checks exported by openstack-service-checks will increase.
99+No changes to the relation itself are required.
100+
101+Charm config changes
102+--------------------
103+
104+New config options for openstack-service-checks:
105+
106+* Option for warning and critical number of days for certificate expiry
107+* Option to disable endpoint types - public, internal and admin, in case the
108+ network access to those endpoints is not available from the
109+ openstack-service-checks unit. The default should be to enable all, this
110+ option is really just to allow existing clouds with limited networks to
111+ function until such time as network access is provided.
112+
113+
114+Charm upgrade risks
115+-------------------
116+
117+None.
118+
119+
120+Alternatives
121+------------
122+
123+- Add a specific check to each openstack charm, to check the certificate
124+ validity by URL or by file. This will mean that the checks are available
125+ without needing openstack-service-checks at all. However, this will also
126+ mean all charms need updating, and in the case of checking by URL we will
127+ only see a result from the host which receives the request via haproxy.
128+- Provide URLs to nagios to check ssl expiry dates. This requires network
129+ access from the Nagios host to all the URLs which we cannot guarantee will be
130+ the case at every site.
131+- Create a subordinate charm to add an nrpe check to test certificate validity
132+ on each unit of an application. This would allow fast deployment across
133+ multiple applications, but at the cost of an extra Juju agent per unit.
134+- Extend the NRPE charm to provide a new plugin to check a certificate file or
135+ base64 string for expiry date. Add logic to each charm that has the
136+ nrpe-external-master interface to add nrpe checks for the cert if it's
137+ configured. Again, this means changing each charm in turn. If certs are
138+ provided by Easyrsa or some other mechanism no checks would be provided.
139+
140+Implementation
141+==============
142+
143+Assignee(s)
144+-----------
145+
146+Primary assignee:
147+ xavpaice
148+
149+
150+Gerrit Topic
151+------------
152+
153+N/A, this is not an OpenStack project and does not use Gerrit.
154+
155+Work Items
156+----------
157+
158+1. Evaluate the healthcheck urls available for each project, and if they're
159+ different from release to release (Mitaka to Rocky).
160+2. Add logic to the openstack-service-checks charm to add a check for each
161+ endpoint url provided by Keystone. If there's a healthcheck URL, use it,
162+ else just check for a valid response from the base url.
163+3. Add unit testing plus functional testing to the openstack-service-checks
164+ charm (currently entirely missing).
165+
166+Repositories
167+------------
168+
169+No new repositories, we will perform the work entirely within
170+https://launchpad.net/charm-openstack-service-checks.
171+
172+Documentation
173+-------------
174+
175+Documentation for the charm is located in README.md. The new functionality
176+will need to be described in that document.
177+
178+Security
179+--------
180+
181+This change does not alter the security profile of the charm.
182+
183+Testing
184+-------
185+
186+The charm currently has no unit testing or functional testing at all. This gap
187+should be addressed while working on this change.
188+
189+Dependencies
190+============
191+
192+None

Subscribers

People subscribed via source and target branches

to all changes: