Merge lp:~tcuthbert/turku/turku-api into lp:turku/turku-api

Proposed by Thomas Cuthbert
Status: Merged
Approved by: Thomas Cuthbert
Approved revision: 68
Merged at revision: 67
Proposed branch: lp:~tcuthbert/turku/turku-api
Merge into: lp:turku/turku-api
Diff against target: 86 lines (+23/-17)
1 file modified
scripts/turku_sick_sources (+23/-17)
To merge this branch: bzr merge lp:~tcuthbert/turku/turku-api
Reviewer Review Type Date Requested Status
Junien F Approve
Review via email: mp+397467@code.launchpad.net

Commit message

Refactor the metrics based on promtheus querying.

Description of the change

This change refactors the way we are exporting the health of backups metrics based on my findings with prometheus querying. Below outlines the required changes:

* Relying on the influx line protocol "timestamp" element is flimsy, influx expects nanosecond precision while telegraf defaults to 1s. So just abandon the whole idea and let telegraf handle the metric timestamp for us.
* Instead of exporting the metrics as either healthy/unhealthy (0/1), encode the health as a label which we can use to filter. The data is just the date last backed up as a unix timestamp. We can then use the prometheus time() function and subtract the date_last_backed value to work out how many days a back up is out of date.

So to figure out what hasn't been backed up in 90 days we can use an expression like `turku_sick_date_last_backed_up != 0 and (time () - turku_sick_date_last_backed_up{source_machine_name=~"hosts_i_care_about"}) / (60 * 60 * 24) > 90`.

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

lp:~tcuthbert/turku/turku-api updated
68. By Thomas Cuthbert

Refactor the metrics based on findings with prometheus querying.

Revision history for this message
Junien F (axino) wrote :

+1

review: Approve
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 67

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'scripts/turku_sick_sources'
2--- scripts/turku_sick_sources 2021-02-04 05:08:58 +0000
3+++ scripts/turku_sick_sources 2021-02-04 07:23:51 +0000
4@@ -19,7 +19,7 @@
5 import os
6 import sys
7
8-from time import gmtime, mktime, time
9+from time import mktime
10
11 BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
12 sys.path.append(BASE_DIR)
13@@ -52,6 +52,10 @@
14 sick_sources.out
15 """
16
17+ # Telegraf doesn't like boolean metrics so use this primitive enum-like object
18+ # for readability.
19+ _health_enum = {0: "no", 1: "yes"}
20+
21 def run(self, options):
22 sources_total = 0
23 sources_sick = 0
24@@ -63,45 +67,47 @@
25 ):
26 sources_total += 1
27 if not source.healthy():
28- continue
29-
30- sources_sick += 1
31- # Rather than mixing types, initialise timestamp as epoch.
32- timestamp = mktime(gmtime(0))
33+ sources_sick += 1
34+ # Rather than mixing types, initialise timestamp as epoch at default precision.
35+ date_last_backed_up = 0.0
36 if source.date_last_backed_up is not None:
37- timestamp = mktime(source.date_last_backed_up.timetuple())
38- source = (source.machine.unit_name, source.name, timestamp)
39+ date_last_backed_up = mktime(source.date_last_backed_up.timetuple())
40+ source = (
41+ source.machine.unit_name,
42+ source.name,
43+ self._health_enum[int(source.healthy())],
44+ date_last_backed_up,
45+ )
46 sick_objects.append(source)
47
48 if sources_sick == 0:
49 sys.stdout.write("Nothing to do, all turku sources are healthy\n")
50 return
51
52- timestamp = int(time() * 1000) # milliseconds
53 totals = (
54- "{measurement} sources_unhealthy={sources_sick},sources_total={sources_total} "
55- "{timestamp}\n"
56+ "{measurement} sources_unhealthy={sources_sick},sources_total={sources_total}\n"
57 ).format(
58 measurement=MEASUREMENT_NAME,
59 sources_sick=sources_sick,
60 sources_total=sources_total,
61- timestamp=timestamp,
62 )
63 data.append(totals)
64
65 sys.stdout.write(totals)
66
67 for source in sick_objects:
68- machine_unit_name, name, date_last_backed_up = source
69+ machine_unit_name, name, health, date_last_backed_up = source
70+ # The metrics we are exporting is the unix timestamp of date last backed up. Include a healthy label
71+ # for concise query filtering.
72 metric = (
73- "{measurement},source_machine_name={machine_unit_name},source_name={name},"
74- "date_last_backed_up={date_last_backed_up} unhealthy=1 {timestamp}\n"
75+ "{measurement},source_machine_name={machine_unit_name},source_name={name},healthy={health} "
76+ "date_last_backed_up={date_last_backed_up}\n"
77 ).format(
78 measurement=MEASUREMENT_NAME,
79- date_last_backed_up=date_last_backed_up,
80 machine_unit_name=machine_unit_name,
81 name=name,
82- timestamp=timestamp,
83+ health=health,
84+ date_last_backed_up=date_last_backed_up,
85 )
86 data.append(metric)
87

Subscribers

People subscribed via source and target branches

to all changes: