Merge ~barryprice/postgresql-charm/+git/postgresql-charm:master into postgresql-charm:master

Proposed by Barry Price
Status: Merged
Approved by: Barry Price
Approved revision: 9288a97b426f35850d7d05cf8d5320d7f31e9abe
Merged at revision: 8be5b9ae4eab4d43a90a4f036e3a040120929ca5
Proposed branch: ~barryprice/postgresql-charm/+git/postgresql-charm:master
Merge into: postgresql-charm:master
Diff against target: 315 lines (+235/-5)
6 files modified
config.yaml (+16/-4)
reactive/postgresql/nagios.py (+50/-1)
scripts/check_latest_ready_wal.py (+54/-0)
scripts/find_latest_ready_wal.py (+27/-0)
unit_tests/test_check_latest_ready_wal.py (+49/-0)
unit_tests/test_find_latest_ready_wal.py (+39/-0)
Reviewer Review Type Date Requested Status
Stuart Bishop (community) Approve
Review via email: mp+362732@code.launchpad.net

Commit message

Add new scripts and config options to warn the operator about any backlog in WAL-E backups - see LP:1814523

To post a comment you must log in.
Revision history for this message
Stuart Bishop (stub) wrote :

Coding seems fine.

Only running the cron job on master units is arguable. On one hand, running it everywhere is wasteful. On the other, it is only a quick script and disk write every minute or two.

Some int() casts are missing, which need to be fixed because this isn't Perl.

review: Needs Fixing
Revision history for this message
Stuart Bishop (stub) wrote :

Code changes are good

review: Approve
Revision history for this message
Stuart Bishop (stub) wrote :

Unit tests look fine.

The best way of testing stuff under reactive/ is to move the code elsewhere. This needs to improve, but isn't your problem. Best way for now is to move testable chunks of code elsewhere.

The main() can be testable, but needs a mock for the print function. Mocking print and dealing with sys.exit are not obvious, so it might be better to avoid using them (using 'return' instead of sys.exit, leaving it to the __main__ section to sys.exit(main(sys.argv)) instead of just main(sys.argv), and defining an easily mocked wrapper around print and using it instead.

Start by changing the main() signature into main(args=sys.argv), which allows you to pass in alternative arguments giving you control.

Next, move the 'max_age_filename' constant definition from inside main() to module scope (and in CAPS per PEP8 naming conventions). This allows your test to override it (using a try: finally: block to put back the original value, or setUp/tearDown methods to do the same). Alternatively, make it a function def max_age_filename(): return '/var/whatsit'. This function can then be mocked by the test to return a different result. A third alternative is pass it in as yet another parameter to main(), but that trick gets old fast.

Now your test can look something like:

import check_latest_ready_wal

@patch('check_latest_ready_wal.print')
def test_main(self, mock_print):
    with tempfile.TemporaryFile() as max_age_f:
        max_age_f.write(b'42\n')
        max_age_f.flush()
        org_max_age_path = check_latest_ready_wal.MAX_AGE_FILENAME
        try:
            check_latest_ready_wal.MAX_AGE_FILENAME = max_age_f.name
            check_latest_ready_wal([]) # overriding sys.argv
            with self.assertRaises(SystemExit) as ex:
                main([])
            self.assertEqual(ex.code, 0)
            mock_print.assert_called_once_with('OK: Expected message')
        finally:
            check_latest_ready_wal.MAX_AGE_FILENAME = org_max_age_path

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 8be5b9ae4eab4d43a90a4f036e3a040120929ca5

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/config.yaml b/config.yaml
2index b70b53f..94daff9 100644
3--- a/config.yaml
4+++ b/config.yaml
5@@ -170,6 +170,18 @@ options:
6 You need enough space for this many backups plus one more, as
7 an old backup will only be removed after a new one has been
8 successfully made to replace it.
9+ wal_e_backup_stale_warn_threshold:
10+ type: int
11+ default: 300
12+ description: >
13+ How many seconds the oldest un-uploaded WAL database backup can be
14+ before the Nagios check will issue a warning.
15+ wal_e_backup_stale_warn_threshold:
16+ type: int
17+ default: 600
18+ description: >
19+ How many seconds the oldest un-uploaded WAL database backup can be
20+ before the Nagios check will issue a critical error.
21 streaming_replication:
22 type: boolean
23 default: true
24@@ -308,7 +320,7 @@ options:
25 type: float
26 description: DEPRECATED. Use extra_pg_conf.
27 autovacuum_analyze_scale_factor:
28- default: 0.1
29+ default: 0.1
30 type: float
31 description: DEPRECATED. Use extra_pg_conf.
32 autovacuum_vacuum_cost_delay:
33@@ -366,7 +378,7 @@ options:
34 kernel_shmall:
35 default: 0
36 type: int
37- description: DEPRECATED and ignored.
38+ description: DEPRECATED and ignored.
39 kernel_shmmax:
40 default: 0
41 type: int
42@@ -384,7 +396,7 @@ options:
43 type: int
44 description: DEPRECATED. Use extra_pg_conf.
45 collapse_limit:
46- default: -1
47+ default: -1
48 type: int
49 description: DEPRECATED. Use extra_pg_conf.
50 temp_buffers:
51@@ -396,7 +408,7 @@ options:
52 type: string
53 description: DEPRECATED. Use extra_pg_conf.
54 checkpoint_segments:
55- default: 10
56+ default: 10
57 type: int
58 description: DEPRECATED. Use extra_pg_conf.
59 checkpoint_completion_target:
60diff --git a/reactive/postgresql/nagios.py b/reactive/postgresql/nagios.py
61index 23c4626..8893460 100644
62--- a/reactive/postgresql/nagios.py
63+++ b/reactive/postgresql/nagios.py
64@@ -110,8 +110,57 @@ def update_nrpe_config():
65 description='Check pgsql',
66 check_cmd='check_pgsql -P {} -l {}'.format(port, user))
67
68+ # copy the check script which will run cronned as postgres user
69+ with open('scripts/find_latest_ready_wal.py.py') as fh:
70+ check_script = fh.read()
71+
72+ check_script_path = '{}/{}'.format(
73+ helpers.scripts_dir(),
74+ 'find_latest_ready_wal.py.py'
75+ )
76+ helpers.write(check_script_path, check_script, mode=0o755)
77+
78+ # create an (empty) file with appropriate permissions for the above
79+ check_output_path = '/var/lib/nagios/postgres-wal-e-max-age.txt'
80+ helpers.write(check_output_path, '', mode=0o644,
81+ user='postgres', group='postgres')
82+
83+ # retrieve the threshold values from the charm config
84+ check_warn_threshold = helpers.config_yaml.get(
85+ 'wal_e_backup_stale_warn_threshold'
86+ )
87+ check_crit_threshold = helpers.config_yaml.get(
88+ 'wal_e_backup_stale_crit_threshold'
89+ )
90+
91+ # create the cron job to run the above
92+ check_cron = "*/2 * * * * postgres {}".format(check_script_path)
93+ check_cron_path = '/etc/cron.d/postgres-stale-wal-e-check'
94+ helpers.write(check_cron_path, check_cron, mode=0o644,
95+ user='root', group='root')
96+
97+ # copy the nagios plugin which will check the cronned output
98+ with open('scripts/check_latest_ready_wal.py') as fh:
99+ check_script = fh.read()
100+
101+ check_script_path = '{}/{}'.format(
102+ '/usr/local/lib/nagios/plugins',
103+ 'check_latest_ready_wal.py'
104+ )
105+ helpers.write(check_script_path, check_script, mode=0o755,
106+ user='nagios', group='nagios')
107+
108+ # write the nagios check definition
109+ nrpe.add_check(shortname='pgsql_stale_wal',
110+ description='Check for stale WAL backups',
111+ check_cmd='{} {} {}'.format(
112+ check_script_path,
113+ check_warn_threshold,
114+ check_crit_threshold,
115+ ))
116+
117 if reactive.is_state('postgresql.replication.is_master'):
118- # TODO: These should be calcualted from the backup schedule,
119+ # TODO: These should be calculated from the backup schedule,
120 # which is difficult since that is specified in crontab format.
121 warn_age = 172800
122 crit_age = 194400
123diff --git a/scripts/check_latest_ready_wal.py b/scripts/check_latest_ready_wal.py
124new file mode 100755
125index 0000000..87f4b30
126--- /dev/null
127+++ b/scripts/check_latest_ready_wal.py
128@@ -0,0 +1,54 @@
129+#!/usr/bin/python3
130+
131+import sys
132+
133+MAX_AGE_FILENAME = '/var/lib/nagios/postgres-wal-e-max-age.txt'
134+
135+
136+def get_val_from_file(filename):
137+ with open(filename) as the_file:
138+ content = the_file.read().strip()
139+ return int(content)
140+
141+
142+def make_nice_age(seconds):
143+ minutes, seconds = divmod(int(seconds), 60)
144+ hours, minutes = divmod(minutes, 60)
145+ days, hours = divmod(hours, 24)
146+ return "{} days, {} hours, {} minutes and {} seconds".format(days, hours, minutes, seconds)
147+
148+
149+def main(args=sys.argv):
150+ NAGIOS_OK = 0
151+ NAGIOS_WARN = 1
152+ NAGIOS_FAIL = 2
153+
154+ try:
155+ warn_threshold = int(args[1])
156+ except IndexError:
157+ print("Syntax error: not enough arguments given")
158+ return NAGIOS_FAIL
159+
160+ try:
161+ crit_threshold = int(args[2])
162+ except IndexError:
163+ print("Syntax error: not enough arguments given")
164+ return NAGIOS_FAIL
165+
166+ max_age = get_val_from_file(MAX_AGE_FILENAME)
167+ nice_age = make_nice_age(max_age)
168+ ret_val = NAGIOS_OK
169+
170+ if max_age > crit_threshold:
171+ print("CRITICAL: Last WAL-E backup was {} ago".format(nice_age))
172+ ret_val = NAGIOS_FAIL
173+ elif max_age > warn_threshold:
174+ print("WARNING: Last WAL-E backup was {} ago".format(nice_age))
175+ ret_val = NAGIOS_WARN
176+ else:
177+ print("OK: No stale WAL-E backups found")
178+ return ret_val
179+
180+
181+if __name__ == '__main__':
182+ sys.exit(main(sys.argv))
183diff --git a/scripts/find_latest_ready_wal.py b/scripts/find_latest_ready_wal.py
184new file mode 100755
185index 0000000..6727be1
186--- /dev/null
187+++ b/scripts/find_latest_ready_wal.py
188@@ -0,0 +1,27 @@
189+#!/usr/bin/python3
190+
191+import glob
192+import os
193+import time
194+
195+
196+def file_age(filepath):
197+ return time.time() - os.path.getmtime(filepath)
198+
199+
200+def main():
201+ max_seen_age = 0
202+ ready_files = glob.glob('/var/lib/postgresql/*/*/pg_{xlog,wal}/archive_status/*.ready')
203+ max_age_filename = '/var/lib/nagios/postgres-wal-e-max-age.txt'
204+
205+ for ready_file in ready_files:
206+ this_age = file_age(ready_file)
207+ if this_age > max_seen_age:
208+ max_seen_age = this_age
209+
210+ with open(max_age_filename, 'w') as max_age_file:
211+ max_age_file.write('{}\n'.format(max_seen_age))
212+
213+
214+if __name__ == '__main__':
215+ main()
216diff --git a/unit_tests/test_check_latest_ready_wal.py b/unit_tests/test_check_latest_ready_wal.py
217new file mode 100644
218index 0000000..7acb4ba
219--- /dev/null
220+++ b/unit_tests/test_check_latest_ready_wal.py
221@@ -0,0 +1,49 @@
222+# Copyright 2019 Canonical Ltd.
223+#
224+# This file is part of the PostgreSQL Charm for Juju.
225+#
226+# This program is free software: you can redistribute it and/or modify
227+# it under the terms of the GNU General Public License version 3, as
228+# published by the Free Software Foundation.
229+#
230+# This program is distributed in the hope that it will be useful, but
231+# WITHOUT ANY WARRANTY; without even the implied warranties of
232+# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
233+# PURPOSE. See the GNU General Public License for more details.
234+#
235+# You should have received a copy of the GNU General Public License
236+# along with this program. If not, see <http://www.gnu.org/licenses/>.
237+
238+import os.path
239+import sys
240+import tempfile
241+import unittest
242+
243+ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
244+sys.path.insert(1, os.path.join(ROOT, 'scripts'))
245+
246+from check_latest_ready_wal import get_val_from_file, make_nice_age
247+
248+
249+class TestCheckLatestReadyWal(unittest.TestCase):
250+
251+ def test_get_val_from_file(self):
252+ fp = tempfile.NamedTemporaryFile()
253+ fp.write(b'42\n')
254+ fp.flush()
255+ self.assertEqual(
256+ get_val_from_file(fp.name),
257+ 42
258+ )
259+ fp.close()
260+
261+ def test_make_nice_age(self):
262+ seconds = 123456
263+ self.assertEqual(
264+ make_nice_age(seconds),
265+ '1 days, 10 hours, 17 minutes and 36 seconds'
266+ )
267+
268+
269+if __name__ == '__main__':
270+ unittest.main()
271diff --git a/unit_tests/test_find_latest_ready_wal.py b/unit_tests/test_find_latest_ready_wal.py
272new file mode 100644
273index 0000000..07eea66
274--- /dev/null
275+++ b/unit_tests/test_find_latest_ready_wal.py
276@@ -0,0 +1,39 @@
277+# Copyright 2019 Canonical Ltd.
278+#
279+# This file is part of the PostgreSQL Charm for Juju.
280+#
281+# This program is free software: you can redistribute it and/or modify
282+# it under the terms of the GNU General Public License version 3, as
283+# published by the Free Software Foundation.
284+#
285+# This program is distributed in the hope that it will be useful, but
286+# WITHOUT ANY WARRANTY; without even the implied warranties of
287+# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
288+# PURPOSE. See the GNU General Public License for more details.
289+#
290+# You should have received a copy of the GNU General Public License
291+# along with this program. If not, see <http://www.gnu.org/licenses/>.
292+
293+import os.path
294+import sys
295+import tempfile
296+import unittest
297+
298+ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
299+sys.path.insert(1, os.path.join(ROOT, 'scripts'))
300+
301+from find_latest_ready_wal import file_age
302+
303+
304+class TestFileAge(unittest.TestCase):
305+ def test_file_age(self):
306+ fp = tempfile.NamedTemporaryFile()
307+ self.assertEqual(
308+ int(file_age(fp.name)),
309+ 0
310+ )
311+ fp.close()
312+
313+
314+if __name__ == '__main__':
315+ unittest.main()

Subscribers

People subscribed via source and target branches

to all changes: