Merge ~barryprice/postgresql-charm/+git/postgresql-charm:master into postgresql-charm:master
- Git
- lp:~barryprice/postgresql-charm/+git/postgresql-charm
- master
- Merge into master
Status: | Merged | ||||
---|---|---|---|---|---|
Approved by: | Barry Price | ||||
Approved revision: | 9288a97b426f35850d7d05cf8d5320d7f31e9abe | ||||
Merged at revision: | 8be5b9ae4eab4d43a90a4f036e3a040120929ca5 | ||||
Proposed branch: | ~barryprice/postgresql-charm/+git/postgresql-charm:master | ||||
Merge into: | postgresql-charm:master | ||||
Diff against target: |
315 lines (+235/-5) 6 files modified
config.yaml (+16/-4) reactive/postgresql/nagios.py (+50/-1) scripts/check_latest_ready_wal.py (+54/-0) scripts/find_latest_ready_wal.py (+27/-0) unit_tests/test_check_latest_ready_wal.py (+49/-0) unit_tests/test_find_latest_ready_wal.py (+39/-0) |
||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Stuart Bishop (community) | Approve | ||
Review via email: mp+362732@code.launchpad.net |
Commit message
Add new scripts and config options to warn the operator about any backlog in WAL-E backups - see LP:1814523
Description of the change
Stuart Bishop (stub) wrote : | # |
Code changes are good
Stuart Bishop (stub) wrote : | # |
Unit tests look fine.
The best way of testing stuff under reactive/ is to move the code elsewhere. This needs to improve, but isn't your problem. Best way for now is to move testable chunks of code elsewhere.
The main() can be testable, but needs a mock for the print function. Mocking print and dealing with sys.exit are not obvious, so it might be better to avoid using them (using 'return' instead of sys.exit, leaving it to the __main__ section to sys.exit(
Start by changing the main() signature into main(args=
Next, move the 'max_age_filename' constant definition from inside main() to module scope (and in CAPS per PEP8 naming conventions). This allows your test to override it (using a try: finally: block to put back the original value, or setUp/tearDown methods to do the same). Alternatively, make it a function def max_age_filename(): return '/var/whatsit'. This function can then be mocked by the test to return a different result. A third alternative is pass it in as yet another parameter to main(), but that trick gets old fast.
Now your test can look something like:
import check_latest_
@patch(
def test_main(self, mock_print):
with tempfile.
try:
with self.assertRais
finally:
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote : | # |
This merge proposal is being monitored by mergebot. Change the status to Approved to merge.
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote : | # |
Change successfully merged at revision 8be5b9ae4eab4d4
Preview Diff
1 | diff --git a/config.yaml b/config.yaml | |||
2 | index b70b53f..94daff9 100644 | |||
3 | --- a/config.yaml | |||
4 | +++ b/config.yaml | |||
5 | @@ -170,6 +170,18 @@ options: | |||
6 | 170 | You need enough space for this many backups plus one more, as | 170 | You need enough space for this many backups plus one more, as |
7 | 171 | an old backup will only be removed after a new one has been | 171 | an old backup will only be removed after a new one has been |
8 | 172 | successfully made to replace it. | 172 | successfully made to replace it. |
9 | 173 | wal_e_backup_stale_warn_threshold: | ||
10 | 174 | type: int | ||
11 | 175 | default: 300 | ||
12 | 176 | description: > | ||
13 | 177 | How many seconds the oldest un-uploaded WAL database backup can be | ||
14 | 178 | before the Nagios check will issue a warning. | ||
15 | 179 | wal_e_backup_stale_warn_threshold: | ||
16 | 180 | type: int | ||
17 | 181 | default: 600 | ||
18 | 182 | description: > | ||
19 | 183 | How many seconds the oldest un-uploaded WAL database backup can be | ||
20 | 184 | before the Nagios check will issue a critical error. | ||
21 | 173 | streaming_replication: | 185 | streaming_replication: |
22 | 174 | type: boolean | 186 | type: boolean |
23 | 175 | default: true | 187 | default: true |
24 | @@ -308,7 +320,7 @@ options: | |||
25 | 308 | type: float | 320 | type: float |
26 | 309 | description: DEPRECATED. Use extra_pg_conf. | 321 | description: DEPRECATED. Use extra_pg_conf. |
27 | 310 | autovacuum_analyze_scale_factor: | 322 | autovacuum_analyze_scale_factor: |
29 | 311 | default: 0.1 | 323 | default: 0.1 |
30 | 312 | type: float | 324 | type: float |
31 | 313 | description: DEPRECATED. Use extra_pg_conf. | 325 | description: DEPRECATED. Use extra_pg_conf. |
32 | 314 | autovacuum_vacuum_cost_delay: | 326 | autovacuum_vacuum_cost_delay: |
33 | @@ -366,7 +378,7 @@ options: | |||
34 | 366 | kernel_shmall: | 378 | kernel_shmall: |
35 | 367 | default: 0 | 379 | default: 0 |
36 | 368 | type: int | 380 | type: int |
38 | 369 | description: DEPRECATED and ignored. | 381 | description: DEPRECATED and ignored. |
39 | 370 | kernel_shmmax: | 382 | kernel_shmmax: |
40 | 371 | default: 0 | 383 | default: 0 |
41 | 372 | type: int | 384 | type: int |
42 | @@ -384,7 +396,7 @@ options: | |||
43 | 384 | type: int | 396 | type: int |
44 | 385 | description: DEPRECATED. Use extra_pg_conf. | 397 | description: DEPRECATED. Use extra_pg_conf. |
45 | 386 | collapse_limit: | 398 | collapse_limit: |
47 | 387 | default: -1 | 399 | default: -1 |
48 | 388 | type: int | 400 | type: int |
49 | 389 | description: DEPRECATED. Use extra_pg_conf. | 401 | description: DEPRECATED. Use extra_pg_conf. |
50 | 390 | temp_buffers: | 402 | temp_buffers: |
51 | @@ -396,7 +408,7 @@ options: | |||
52 | 396 | type: string | 408 | type: string |
53 | 397 | description: DEPRECATED. Use extra_pg_conf. | 409 | description: DEPRECATED. Use extra_pg_conf. |
54 | 398 | checkpoint_segments: | 410 | checkpoint_segments: |
56 | 399 | default: 10 | 411 | default: 10 |
57 | 400 | type: int | 412 | type: int |
58 | 401 | description: DEPRECATED. Use extra_pg_conf. | 413 | description: DEPRECATED. Use extra_pg_conf. |
59 | 402 | checkpoint_completion_target: | 414 | checkpoint_completion_target: |
60 | diff --git a/reactive/postgresql/nagios.py b/reactive/postgresql/nagios.py | |||
61 | index 23c4626..8893460 100644 | |||
62 | --- a/reactive/postgresql/nagios.py | |||
63 | +++ b/reactive/postgresql/nagios.py | |||
64 | @@ -110,8 +110,57 @@ def update_nrpe_config(): | |||
65 | 110 | description='Check pgsql', | 110 | description='Check pgsql', |
66 | 111 | check_cmd='check_pgsql -P {} -l {}'.format(port, user)) | 111 | check_cmd='check_pgsql -P {} -l {}'.format(port, user)) |
67 | 112 | 112 | ||
68 | 113 | # copy the check script which will run cronned as postgres user | ||
69 | 114 | with open('scripts/find_latest_ready_wal.py.py') as fh: | ||
70 | 115 | check_script = fh.read() | ||
71 | 116 | |||
72 | 117 | check_script_path = '{}/{}'.format( | ||
73 | 118 | helpers.scripts_dir(), | ||
74 | 119 | 'find_latest_ready_wal.py.py' | ||
75 | 120 | ) | ||
76 | 121 | helpers.write(check_script_path, check_script, mode=0o755) | ||
77 | 122 | |||
78 | 123 | # create an (empty) file with appropriate permissions for the above | ||
79 | 124 | check_output_path = '/var/lib/nagios/postgres-wal-e-max-age.txt' | ||
80 | 125 | helpers.write(check_output_path, '', mode=0o644, | ||
81 | 126 | user='postgres', group='postgres') | ||
82 | 127 | |||
83 | 128 | # retrieve the threshold values from the charm config | ||
84 | 129 | check_warn_threshold = helpers.config_yaml.get( | ||
85 | 130 | 'wal_e_backup_stale_warn_threshold' | ||
86 | 131 | ) | ||
87 | 132 | check_crit_threshold = helpers.config_yaml.get( | ||
88 | 133 | 'wal_e_backup_stale_crit_threshold' | ||
89 | 134 | ) | ||
90 | 135 | |||
91 | 136 | # create the cron job to run the above | ||
92 | 137 | check_cron = "*/2 * * * * postgres {}".format(check_script_path) | ||
93 | 138 | check_cron_path = '/etc/cron.d/postgres-stale-wal-e-check' | ||
94 | 139 | helpers.write(check_cron_path, check_cron, mode=0o644, | ||
95 | 140 | user='root', group='root') | ||
96 | 141 | |||
97 | 142 | # copy the nagios plugin which will check the cronned output | ||
98 | 143 | with open('scripts/check_latest_ready_wal.py') as fh: | ||
99 | 144 | check_script = fh.read() | ||
100 | 145 | |||
101 | 146 | check_script_path = '{}/{}'.format( | ||
102 | 147 | '/usr/local/lib/nagios/plugins', | ||
103 | 148 | 'check_latest_ready_wal.py' | ||
104 | 149 | ) | ||
105 | 150 | helpers.write(check_script_path, check_script, mode=0o755, | ||
106 | 151 | user='nagios', group='nagios') | ||
107 | 152 | |||
108 | 153 | # write the nagios check definition | ||
109 | 154 | nrpe.add_check(shortname='pgsql_stale_wal', | ||
110 | 155 | description='Check for stale WAL backups', | ||
111 | 156 | check_cmd='{} {} {}'.format( | ||
112 | 157 | check_script_path, | ||
113 | 158 | check_warn_threshold, | ||
114 | 159 | check_crit_threshold, | ||
115 | 160 | )) | ||
116 | 161 | |||
117 | 113 | if reactive.is_state('postgresql.replication.is_master'): | 162 | if reactive.is_state('postgresql.replication.is_master'): |
119 | 114 | # TODO: These should be calcualted from the backup schedule, | 163 | # TODO: These should be calculated from the backup schedule, |
120 | 115 | # which is difficult since that is specified in crontab format. | 164 | # which is difficult since that is specified in crontab format. |
121 | 116 | warn_age = 172800 | 165 | warn_age = 172800 |
122 | 117 | crit_age = 194400 | 166 | crit_age = 194400 |
123 | diff --git a/scripts/check_latest_ready_wal.py b/scripts/check_latest_ready_wal.py | |||
124 | 118 | new file mode 100755 | 167 | new file mode 100755 |
125 | index 0000000..87f4b30 | |||
126 | --- /dev/null | |||
127 | +++ b/scripts/check_latest_ready_wal.py | |||
128 | @@ -0,0 +1,54 @@ | |||
129 | 1 | #!/usr/bin/python3 | ||
130 | 2 | |||
131 | 3 | import sys | ||
132 | 4 | |||
133 | 5 | MAX_AGE_FILENAME = '/var/lib/nagios/postgres-wal-e-max-age.txt' | ||
134 | 6 | |||
135 | 7 | |||
136 | 8 | def get_val_from_file(filename): | ||
137 | 9 | with open(filename) as the_file: | ||
138 | 10 | content = the_file.read().strip() | ||
139 | 11 | return int(content) | ||
140 | 12 | |||
141 | 13 | |||
142 | 14 | def make_nice_age(seconds): | ||
143 | 15 | minutes, seconds = divmod(int(seconds), 60) | ||
144 | 16 | hours, minutes = divmod(minutes, 60) | ||
145 | 17 | days, hours = divmod(hours, 24) | ||
146 | 18 | return "{} days, {} hours, {} minutes and {} seconds".format(days, hours, minutes, seconds) | ||
147 | 19 | |||
148 | 20 | |||
149 | 21 | def main(args=sys.argv): | ||
150 | 22 | NAGIOS_OK = 0 | ||
151 | 23 | NAGIOS_WARN = 1 | ||
152 | 24 | NAGIOS_FAIL = 2 | ||
153 | 25 | |||
154 | 26 | try: | ||
155 | 27 | warn_threshold = int(args[1]) | ||
156 | 28 | except IndexError: | ||
157 | 29 | print("Syntax error: not enough arguments given") | ||
158 | 30 | return NAGIOS_FAIL | ||
159 | 31 | |||
160 | 32 | try: | ||
161 | 33 | crit_threshold = int(args[2]) | ||
162 | 34 | except IndexError: | ||
163 | 35 | print("Syntax error: not enough arguments given") | ||
164 | 36 | return NAGIOS_FAIL | ||
165 | 37 | |||
166 | 38 | max_age = get_val_from_file(MAX_AGE_FILENAME) | ||
167 | 39 | nice_age = make_nice_age(max_age) | ||
168 | 40 | ret_val = NAGIOS_OK | ||
169 | 41 | |||
170 | 42 | if max_age > crit_threshold: | ||
171 | 43 | print("CRITICAL: Last WAL-E backup was {} ago".format(nice_age)) | ||
172 | 44 | ret_val = NAGIOS_FAIL | ||
173 | 45 | elif max_age > warn_threshold: | ||
174 | 46 | print("WARNING: Last WAL-E backup was {} ago".format(nice_age)) | ||
175 | 47 | ret_val = NAGIOS_WARN | ||
176 | 48 | else: | ||
177 | 49 | print("OK: No stale WAL-E backups found") | ||
178 | 50 | return ret_val | ||
179 | 51 | |||
180 | 52 | |||
181 | 53 | if __name__ == '__main__': | ||
182 | 54 | sys.exit(main(sys.argv)) | ||
183 | diff --git a/scripts/find_latest_ready_wal.py b/scripts/find_latest_ready_wal.py | |||
184 | 0 | new file mode 100755 | 55 | new file mode 100755 |
185 | index 0000000..6727be1 | |||
186 | --- /dev/null | |||
187 | +++ b/scripts/find_latest_ready_wal.py | |||
188 | @@ -0,0 +1,27 @@ | |||
189 | 1 | #!/usr/bin/python3 | ||
190 | 2 | |||
191 | 3 | import glob | ||
192 | 4 | import os | ||
193 | 5 | import time | ||
194 | 6 | |||
195 | 7 | |||
196 | 8 | def file_age(filepath): | ||
197 | 9 | return time.time() - os.path.getmtime(filepath) | ||
198 | 10 | |||
199 | 11 | |||
200 | 12 | def main(): | ||
201 | 13 | max_seen_age = 0 | ||
202 | 14 | ready_files = glob.glob('/var/lib/postgresql/*/*/pg_{xlog,wal}/archive_status/*.ready') | ||
203 | 15 | max_age_filename = '/var/lib/nagios/postgres-wal-e-max-age.txt' | ||
204 | 16 | |||
205 | 17 | for ready_file in ready_files: | ||
206 | 18 | this_age = file_age(ready_file) | ||
207 | 19 | if this_age > max_seen_age: | ||
208 | 20 | max_seen_age = this_age | ||
209 | 21 | |||
210 | 22 | with open(max_age_filename, 'w') as max_age_file: | ||
211 | 23 | max_age_file.write('{}\n'.format(max_seen_age)) | ||
212 | 24 | |||
213 | 25 | |||
214 | 26 | if __name__ == '__main__': | ||
215 | 27 | main() | ||
216 | diff --git a/unit_tests/test_check_latest_ready_wal.py b/unit_tests/test_check_latest_ready_wal.py | |||
217 | 0 | new file mode 100644 | 28 | new file mode 100644 |
218 | index 0000000..7acb4ba | |||
219 | --- /dev/null | |||
220 | +++ b/unit_tests/test_check_latest_ready_wal.py | |||
221 | @@ -0,0 +1,49 @@ | |||
222 | 1 | # Copyright 2019 Canonical Ltd. | ||
223 | 2 | # | ||
224 | 3 | # This file is part of the PostgreSQL Charm for Juju. | ||
225 | 4 | # | ||
226 | 5 | # This program is free software: you can redistribute it and/or modify | ||
227 | 6 | # it under the terms of the GNU General Public License version 3, as | ||
228 | 7 | # published by the Free Software Foundation. | ||
229 | 8 | # | ||
230 | 9 | # This program is distributed in the hope that it will be useful, but | ||
231 | 10 | # WITHOUT ANY WARRANTY; without even the implied warranties of | ||
232 | 11 | # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR | ||
233 | 12 | # PURPOSE. See the GNU General Public License for more details. | ||
234 | 13 | # | ||
235 | 14 | # You should have received a copy of the GNU General Public License | ||
236 | 15 | # along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
237 | 16 | |||
238 | 17 | import os.path | ||
239 | 18 | import sys | ||
240 | 19 | import tempfile | ||
241 | 20 | import unittest | ||
242 | 21 | |||
243 | 22 | ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)) | ||
244 | 23 | sys.path.insert(1, os.path.join(ROOT, 'scripts')) | ||
245 | 24 | |||
246 | 25 | from check_latest_ready_wal import get_val_from_file, make_nice_age | ||
247 | 26 | |||
248 | 27 | |||
249 | 28 | class TestCheckLatestReadyWal(unittest.TestCase): | ||
250 | 29 | |||
251 | 30 | def test_get_val_from_file(self): | ||
252 | 31 | fp = tempfile.NamedTemporaryFile() | ||
253 | 32 | fp.write(b'42\n') | ||
254 | 33 | fp.flush() | ||
255 | 34 | self.assertEqual( | ||
256 | 35 | get_val_from_file(fp.name), | ||
257 | 36 | 42 | ||
258 | 37 | ) | ||
259 | 38 | fp.close() | ||
260 | 39 | |||
261 | 40 | def test_make_nice_age(self): | ||
262 | 41 | seconds = 123456 | ||
263 | 42 | self.assertEqual( | ||
264 | 43 | make_nice_age(seconds), | ||
265 | 44 | '1 days, 10 hours, 17 minutes and 36 seconds' | ||
266 | 45 | ) | ||
267 | 46 | |||
268 | 47 | |||
269 | 48 | if __name__ == '__main__': | ||
270 | 49 | unittest.main() | ||
271 | diff --git a/unit_tests/test_find_latest_ready_wal.py b/unit_tests/test_find_latest_ready_wal.py | |||
272 | 0 | new file mode 100644 | 50 | new file mode 100644 |
273 | index 0000000..07eea66 | |||
274 | --- /dev/null | |||
275 | +++ b/unit_tests/test_find_latest_ready_wal.py | |||
276 | @@ -0,0 +1,39 @@ | |||
277 | 1 | # Copyright 2019 Canonical Ltd. | ||
278 | 2 | # | ||
279 | 3 | # This file is part of the PostgreSQL Charm for Juju. | ||
280 | 4 | # | ||
281 | 5 | # This program is free software: you can redistribute it and/or modify | ||
282 | 6 | # it under the terms of the GNU General Public License version 3, as | ||
283 | 7 | # published by the Free Software Foundation. | ||
284 | 8 | # | ||
285 | 9 | # This program is distributed in the hope that it will be useful, but | ||
286 | 10 | # WITHOUT ANY WARRANTY; without even the implied warranties of | ||
287 | 11 | # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR | ||
288 | 12 | # PURPOSE. See the GNU General Public License for more details. | ||
289 | 13 | # | ||
290 | 14 | # You should have received a copy of the GNU General Public License | ||
291 | 15 | # along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
292 | 16 | |||
293 | 17 | import os.path | ||
294 | 18 | import sys | ||
295 | 19 | import tempfile | ||
296 | 20 | import unittest | ||
297 | 21 | |||
298 | 22 | ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)) | ||
299 | 23 | sys.path.insert(1, os.path.join(ROOT, 'scripts')) | ||
300 | 24 | |||
301 | 25 | from find_latest_ready_wal import file_age | ||
302 | 26 | |||
303 | 27 | |||
304 | 28 | class TestFileAge(unittest.TestCase): | ||
305 | 29 | def test_file_age(self): | ||
306 | 30 | fp = tempfile.NamedTemporaryFile() | ||
307 | 31 | self.assertEqual( | ||
308 | 32 | int(file_age(fp.name)), | ||
309 | 33 | 0 | ||
310 | 34 | ) | ||
311 | 35 | fp.close() | ||
312 | 36 | |||
313 | 37 | |||
314 | 38 | if __name__ == '__main__': | ||
315 | 39 | unittest.main() |
Coding seems fine.
Only running the cron job on master units is arguable. On one hand, running it everywhere is wasteful. On the other, it is only a quick script and disk write every minute or two.
Some int() casts are missing, which need to be fixed because this isn't Perl.