Merge ~barryprice/postgresql-charm/+git/postgresql-charm:master into postgresql-charm:master

Proposed by Barry Price
Status: Merged
Approved by: Barry Price
Approved revision: 9288a97b426f35850d7d05cf8d5320d7f31e9abe
Merged at revision: 8be5b9ae4eab4d43a90a4f036e3a040120929ca5
Proposed branch: ~barryprice/postgresql-charm/+git/postgresql-charm:master
Merge into: postgresql-charm:master
Diff against target: 315 lines (+235/-5)
6 files modified
config.yaml (+16/-4)
reactive/postgresql/nagios.py (+50/-1)
scripts/check_latest_ready_wal.py (+54/-0)
scripts/find_latest_ready_wal.py (+27/-0)
unit_tests/test_check_latest_ready_wal.py (+49/-0)
unit_tests/test_find_latest_ready_wal.py (+39/-0)
Reviewer Review Type Date Requested Status
Stuart Bishop (community) Approve
Review via email: mp+362732@code.launchpad.net

Commit message

Add new scripts and config options to warn the operator about any backlog in WAL-E backups - see LP:1814523

To post a comment you must log in.
Revision history for this message
Stuart Bishop (stub) wrote :

Coding seems fine.

Only running the cron job on master units is arguable. On one hand, running it everywhere is wasteful. On the other, it is only a quick script and disk write every minute or two.

Some int() casts are missing, which need to be fixed because this isn't Perl.

review: Needs Fixing
Revision history for this message
Stuart Bishop (stub) wrote :

Code changes are good

review: Approve
Revision history for this message
Stuart Bishop (stub) wrote :

Unit tests look fine.

The best way of testing stuff under reactive/ is to move the code elsewhere. This needs to improve, but isn't your problem. Best way for now is to move testable chunks of code elsewhere.

The main() can be testable, but needs a mock for the print function. Mocking print and dealing with sys.exit are not obvious, so it might be better to avoid using them (using 'return' instead of sys.exit, leaving it to the __main__ section to sys.exit(main(sys.argv)) instead of just main(sys.argv), and defining an easily mocked wrapper around print and using it instead.

Start by changing the main() signature into main(args=sys.argv), which allows you to pass in alternative arguments giving you control.

Next, move the 'max_age_filename' constant definition from inside main() to module scope (and in CAPS per PEP8 naming conventions). This allows your test to override it (using a try: finally: block to put back the original value, or setUp/tearDown methods to do the same). Alternatively, make it a function def max_age_filename(): return '/var/whatsit'. This function can then be mocked by the test to return a different result. A third alternative is pass it in as yet another parameter to main(), but that trick gets old fast.

Now your test can look something like:

import check_latest_ready_wal

@patch('check_latest_ready_wal.print')
def test_main(self, mock_print):
    with tempfile.TemporaryFile() as max_age_f:
        max_age_f.write(b'42\n')
        max_age_f.flush()
        org_max_age_path = check_latest_ready_wal.MAX_AGE_FILENAME
        try:
            check_latest_ready_wal.MAX_AGE_FILENAME = max_age_f.name
            check_latest_ready_wal([]) # overriding sys.argv
            with self.assertRaises(SystemExit) as ex:
                main([])
            self.assertEqual(ex.code, 0)
            mock_print.assert_called_once_with('OK: Expected message')
        finally:
            check_latest_ready_wal.MAX_AGE_FILENAME = org_max_age_path

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 8be5b9ae4eab4d43a90a4f036e3a040120929ca5

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/config.yaml b/config.yaml
index b70b53f..94daff9 100644
--- a/config.yaml
+++ b/config.yaml
@@ -170,6 +170,18 @@ options:
170 You need enough space for this many backups plus one more, as170 You need enough space for this many backups plus one more, as
171 an old backup will only be removed after a new one has been171 an old backup will only be removed after a new one has been
172 successfully made to replace it.172 successfully made to replace it.
173 wal_e_backup_stale_warn_threshold:
174 type: int
175 default: 300
176 description: >
177 How many seconds the oldest un-uploaded WAL database backup can be
178 before the Nagios check will issue a warning.
179 wal_e_backup_stale_warn_threshold:
180 type: int
181 default: 600
182 description: >
183 How many seconds the oldest un-uploaded WAL database backup can be
184 before the Nagios check will issue a critical error.
173 streaming_replication:185 streaming_replication:
174 type: boolean186 type: boolean
175 default: true187 default: true
@@ -308,7 +320,7 @@ options:
308 type: float320 type: float
309 description: DEPRECATED. Use extra_pg_conf.321 description: DEPRECATED. Use extra_pg_conf.
310 autovacuum_analyze_scale_factor:322 autovacuum_analyze_scale_factor:
311 default: 0.1 323 default: 0.1
312 type: float324 type: float
313 description: DEPRECATED. Use extra_pg_conf.325 description: DEPRECATED. Use extra_pg_conf.
314 autovacuum_vacuum_cost_delay:326 autovacuum_vacuum_cost_delay:
@@ -366,7 +378,7 @@ options:
366 kernel_shmall:378 kernel_shmall:
367 default: 0379 default: 0
368 type: int380 type: int
369 description: DEPRECATED and ignored. 381 description: DEPRECATED and ignored.
370 kernel_shmmax:382 kernel_shmmax:
371 default: 0383 default: 0
372 type: int384 type: int
@@ -384,7 +396,7 @@ options:
384 type: int396 type: int
385 description: DEPRECATED. Use extra_pg_conf.397 description: DEPRECATED. Use extra_pg_conf.
386 collapse_limit:398 collapse_limit:
387 default: -1 399 default: -1
388 type: int400 type: int
389 description: DEPRECATED. Use extra_pg_conf.401 description: DEPRECATED. Use extra_pg_conf.
390 temp_buffers:402 temp_buffers:
@@ -396,7 +408,7 @@ options:
396 type: string408 type: string
397 description: DEPRECATED. Use extra_pg_conf.409 description: DEPRECATED. Use extra_pg_conf.
398 checkpoint_segments:410 checkpoint_segments:
399 default: 10 411 default: 10
400 type: int412 type: int
401 description: DEPRECATED. Use extra_pg_conf.413 description: DEPRECATED. Use extra_pg_conf.
402 checkpoint_completion_target:414 checkpoint_completion_target:
diff --git a/reactive/postgresql/nagios.py b/reactive/postgresql/nagios.py
index 23c4626..8893460 100644
--- a/reactive/postgresql/nagios.py
+++ b/reactive/postgresql/nagios.py
@@ -110,8 +110,57 @@ def update_nrpe_config():
110 description='Check pgsql',110 description='Check pgsql',
111 check_cmd='check_pgsql -P {} -l {}'.format(port, user))111 check_cmd='check_pgsql -P {} -l {}'.format(port, user))
112112
113 # copy the check script which will run cronned as postgres user
114 with open('scripts/find_latest_ready_wal.py.py') as fh:
115 check_script = fh.read()
116
117 check_script_path = '{}/{}'.format(
118 helpers.scripts_dir(),
119 'find_latest_ready_wal.py.py'
120 )
121 helpers.write(check_script_path, check_script, mode=0o755)
122
123 # create an (empty) file with appropriate permissions for the above
124 check_output_path = '/var/lib/nagios/postgres-wal-e-max-age.txt'
125 helpers.write(check_output_path, '', mode=0o644,
126 user='postgres', group='postgres')
127
128 # retrieve the threshold values from the charm config
129 check_warn_threshold = helpers.config_yaml.get(
130 'wal_e_backup_stale_warn_threshold'
131 )
132 check_crit_threshold = helpers.config_yaml.get(
133 'wal_e_backup_stale_crit_threshold'
134 )
135
136 # create the cron job to run the above
137 check_cron = "*/2 * * * * postgres {}".format(check_script_path)
138 check_cron_path = '/etc/cron.d/postgres-stale-wal-e-check'
139 helpers.write(check_cron_path, check_cron, mode=0o644,
140 user='root', group='root')
141
142 # copy the nagios plugin which will check the cronned output
143 with open('scripts/check_latest_ready_wal.py') as fh:
144 check_script = fh.read()
145
146 check_script_path = '{}/{}'.format(
147 '/usr/local/lib/nagios/plugins',
148 'check_latest_ready_wal.py'
149 )
150 helpers.write(check_script_path, check_script, mode=0o755,
151 user='nagios', group='nagios')
152
153 # write the nagios check definition
154 nrpe.add_check(shortname='pgsql_stale_wal',
155 description='Check for stale WAL backups',
156 check_cmd='{} {} {}'.format(
157 check_script_path,
158 check_warn_threshold,
159 check_crit_threshold,
160 ))
161
113 if reactive.is_state('postgresql.replication.is_master'):162 if reactive.is_state('postgresql.replication.is_master'):
114 # TODO: These should be calcualted from the backup schedule,163 # TODO: These should be calculated from the backup schedule,
115 # which is difficult since that is specified in crontab format.164 # which is difficult since that is specified in crontab format.
116 warn_age = 172800165 warn_age = 172800
117 crit_age = 194400166 crit_age = 194400
diff --git a/scripts/check_latest_ready_wal.py b/scripts/check_latest_ready_wal.py
118new file mode 100755167new file mode 100755
index 0000000..87f4b30
--- /dev/null
+++ b/scripts/check_latest_ready_wal.py
@@ -0,0 +1,54 @@
1#!/usr/bin/python3
2
3import sys
4
5MAX_AGE_FILENAME = '/var/lib/nagios/postgres-wal-e-max-age.txt'
6
7
8def get_val_from_file(filename):
9 with open(filename) as the_file:
10 content = the_file.read().strip()
11 return int(content)
12
13
14def make_nice_age(seconds):
15 minutes, seconds = divmod(int(seconds), 60)
16 hours, minutes = divmod(minutes, 60)
17 days, hours = divmod(hours, 24)
18 return "{} days, {} hours, {} minutes and {} seconds".format(days, hours, minutes, seconds)
19
20
21def main(args=sys.argv):
22 NAGIOS_OK = 0
23 NAGIOS_WARN = 1
24 NAGIOS_FAIL = 2
25
26 try:
27 warn_threshold = int(args[1])
28 except IndexError:
29 print("Syntax error: not enough arguments given")
30 return NAGIOS_FAIL
31
32 try:
33 crit_threshold = int(args[2])
34 except IndexError:
35 print("Syntax error: not enough arguments given")
36 return NAGIOS_FAIL
37
38 max_age = get_val_from_file(MAX_AGE_FILENAME)
39 nice_age = make_nice_age(max_age)
40 ret_val = NAGIOS_OK
41
42 if max_age > crit_threshold:
43 print("CRITICAL: Last WAL-E backup was {} ago".format(nice_age))
44 ret_val = NAGIOS_FAIL
45 elif max_age > warn_threshold:
46 print("WARNING: Last WAL-E backup was {} ago".format(nice_age))
47 ret_val = NAGIOS_WARN
48 else:
49 print("OK: No stale WAL-E backups found")
50 return ret_val
51
52
53if __name__ == '__main__':
54 sys.exit(main(sys.argv))
diff --git a/scripts/find_latest_ready_wal.py b/scripts/find_latest_ready_wal.py
0new file mode 10075555new file mode 100755
index 0000000..6727be1
--- /dev/null
+++ b/scripts/find_latest_ready_wal.py
@@ -0,0 +1,27 @@
1#!/usr/bin/python3
2
3import glob
4import os
5import time
6
7
8def file_age(filepath):
9 return time.time() - os.path.getmtime(filepath)
10
11
12def main():
13 max_seen_age = 0
14 ready_files = glob.glob('/var/lib/postgresql/*/*/pg_{xlog,wal}/archive_status/*.ready')
15 max_age_filename = '/var/lib/nagios/postgres-wal-e-max-age.txt'
16
17 for ready_file in ready_files:
18 this_age = file_age(ready_file)
19 if this_age > max_seen_age:
20 max_seen_age = this_age
21
22 with open(max_age_filename, 'w') as max_age_file:
23 max_age_file.write('{}\n'.format(max_seen_age))
24
25
26if __name__ == '__main__':
27 main()
diff --git a/unit_tests/test_check_latest_ready_wal.py b/unit_tests/test_check_latest_ready_wal.py
0new file mode 10064428new file mode 100644
index 0000000..7acb4ba
--- /dev/null
+++ b/unit_tests/test_check_latest_ready_wal.py
@@ -0,0 +1,49 @@
1# Copyright 2019 Canonical Ltd.
2#
3# This file is part of the PostgreSQL Charm for Juju.
4#
5# This program is free software: you can redistribute it and/or modify
6# it under the terms of the GNU General Public License version 3, as
7# published by the Free Software Foundation.
8#
9# This program is distributed in the hope that it will be useful, but
10# WITHOUT ANY WARRANTY; without even the implied warranties of
11# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
12# PURPOSE. See the GNU General Public License for more details.
13#
14# You should have received a copy of the GNU General Public License
15# along with this program. If not, see <http://www.gnu.org/licenses/>.
16
17import os.path
18import sys
19import tempfile
20import unittest
21
22ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
23sys.path.insert(1, os.path.join(ROOT, 'scripts'))
24
25from check_latest_ready_wal import get_val_from_file, make_nice_age
26
27
28class TestCheckLatestReadyWal(unittest.TestCase):
29
30 def test_get_val_from_file(self):
31 fp = tempfile.NamedTemporaryFile()
32 fp.write(b'42\n')
33 fp.flush()
34 self.assertEqual(
35 get_val_from_file(fp.name),
36 42
37 )
38 fp.close()
39
40 def test_make_nice_age(self):
41 seconds = 123456
42 self.assertEqual(
43 make_nice_age(seconds),
44 '1 days, 10 hours, 17 minutes and 36 seconds'
45 )
46
47
48if __name__ == '__main__':
49 unittest.main()
diff --git a/unit_tests/test_find_latest_ready_wal.py b/unit_tests/test_find_latest_ready_wal.py
0new file mode 10064450new file mode 100644
index 0000000..07eea66
--- /dev/null
+++ b/unit_tests/test_find_latest_ready_wal.py
@@ -0,0 +1,39 @@
1# Copyright 2019 Canonical Ltd.
2#
3# This file is part of the PostgreSQL Charm for Juju.
4#
5# This program is free software: you can redistribute it and/or modify
6# it under the terms of the GNU General Public License version 3, as
7# published by the Free Software Foundation.
8#
9# This program is distributed in the hope that it will be useful, but
10# WITHOUT ANY WARRANTY; without even the implied warranties of
11# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
12# PURPOSE. See the GNU General Public License for more details.
13#
14# You should have received a copy of the GNU General Public License
15# along with this program. If not, see <http://www.gnu.org/licenses/>.
16
17import os.path
18import sys
19import tempfile
20import unittest
21
22ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
23sys.path.insert(1, os.path.join(ROOT, 'scripts'))
24
25from find_latest_ready_wal import file_age
26
27
28class TestFileAge(unittest.TestCase):
29 def test_file_age(self):
30 fp = tempfile.NamedTemporaryFile()
31 self.assertEqual(
32 int(file_age(fp.name)),
33 0
34 )
35 fp.close()
36
37
38if __name__ == '__main__':
39 unittest.main()

Subscribers

People subscribed via source and target branches

to all changes: