Merge lp:~tcuthbert/turku/turku-api into lp:turku/turku-api

Proposed by Thomas Cuthbert
Status: Merged
Approved by: Thomas Cuthbert
Approved revision: 68
Merged at revision: 67
Proposed branch: lp:~tcuthbert/turku/turku-api
Merge into: lp:turku/turku-api
Diff against target: 86 lines (+23/-17)
1 file modified
scripts/turku_sick_sources (+23/-17)
To merge this branch: bzr merge lp:~tcuthbert/turku/turku-api
Reviewer Review Type Date Requested Status
Junien Fridrick Approve
Review via email: mp+397467@code.launchpad.net

Commit message

Refactor the metrics based on promtheus querying.

Description of the change

This change refactors the way we are exporting the health of backups metrics based on my findings with prometheus querying. Below outlines the required changes:

* Relying on the influx line protocol "timestamp" element is flimsy, influx expects nanosecond precision while telegraf defaults to 1s. So just abandon the whole idea and let telegraf handle the metric timestamp for us.
* Instead of exporting the metrics as either healthy/unhealthy (0/1), encode the health as a label which we can use to filter. The data is just the date last backed up as a unix timestamp. We can then use the prometheus time() function and subtract the date_last_backed value to work out how many days a back up is out of date.

So to figure out what hasn't been backed up in 90 days we can use an expression like `turku_sick_date_last_backed_up != 0 and (time () - turku_sick_date_last_backed_up{source_machine_name=~"hosts_i_care_about"}) / (60 * 60 * 24) > 90`.

To post a comment you must log in.
Revision history for this message
Canonical IS Mergebot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

lp:~tcuthbert/turku/turku-api updated
68. By Thomas Cuthbert

Refactor the metrics based on findings with prometheus querying.

Revision history for this message
Junien Fridrick (axino) wrote :

+1

review: Approve
Revision history for this message
Canonical IS Mergebot (canonical-is-mergebot) wrote :

Change successfully merged at revision 67

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'scripts/turku_sick_sources'
--- scripts/turku_sick_sources 2021-02-04 05:08:58 +0000
+++ scripts/turku_sick_sources 2021-02-04 07:23:51 +0000
@@ -19,7 +19,7 @@
19import os19import os
20import sys20import sys
2121
22from time import gmtime, mktime, time22from time import mktime
2323
24BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))24BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
25sys.path.append(BASE_DIR)25sys.path.append(BASE_DIR)
@@ -52,6 +52,10 @@
52 sick_sources.out52 sick_sources.out
53 """53 """
5454
55 # Telegraf doesn't like boolean metrics so use this primitive enum-like object
56 # for readability.
57 _health_enum = {0: "no", 1: "yes"}
58
55 def run(self, options):59 def run(self, options):
56 sources_total = 060 sources_total = 0
57 sources_sick = 061 sources_sick = 0
@@ -63,45 +67,47 @@
63 ):67 ):
64 sources_total += 168 sources_total += 1
65 if not source.healthy():69 if not source.healthy():
66 continue70 sources_sick += 1
6771 # Rather than mixing types, initialise timestamp as epoch at default precision.
68 sources_sick += 172 date_last_backed_up = 0.0
69 # Rather than mixing types, initialise timestamp as epoch.
70 timestamp = mktime(gmtime(0))
71 if source.date_last_backed_up is not None:73 if source.date_last_backed_up is not None:
72 timestamp = mktime(source.date_last_backed_up.timetuple())74 date_last_backed_up = mktime(source.date_last_backed_up.timetuple())
73 source = (source.machine.unit_name, source.name, timestamp)75 source = (
76 source.machine.unit_name,
77 source.name,
78 self._health_enum[int(source.healthy())],
79 date_last_backed_up,
80 )
74 sick_objects.append(source)81 sick_objects.append(source)
7582
76 if sources_sick == 0:83 if sources_sick == 0:
77 sys.stdout.write("Nothing to do, all turku sources are healthy\n")84 sys.stdout.write("Nothing to do, all turku sources are healthy\n")
78 return85 return
7986
80 timestamp = int(time() * 1000) # milliseconds
81 totals = (87 totals = (
82 "{measurement} sources_unhealthy={sources_sick},sources_total={sources_total} "88 "{measurement} sources_unhealthy={sources_sick},sources_total={sources_total}\n"
83 "{timestamp}\n"
84 ).format(89 ).format(
85 measurement=MEASUREMENT_NAME,90 measurement=MEASUREMENT_NAME,
86 sources_sick=sources_sick,91 sources_sick=sources_sick,
87 sources_total=sources_total,92 sources_total=sources_total,
88 timestamp=timestamp,
89 )93 )
90 data.append(totals)94 data.append(totals)
9195
92 sys.stdout.write(totals)96 sys.stdout.write(totals)
9397
94 for source in sick_objects:98 for source in sick_objects:
95 machine_unit_name, name, date_last_backed_up = source99 machine_unit_name, name, health, date_last_backed_up = source
100 # The metrics we are exporting is the unix timestamp of date last backed up. Include a healthy label
101 # for concise query filtering.
96 metric = (102 metric = (
97 "{measurement},source_machine_name={machine_unit_name},source_name={name},"103 "{measurement},source_machine_name={machine_unit_name},source_name={name},healthy={health} "
98 "date_last_backed_up={date_last_backed_up} unhealthy=1 {timestamp}\n"104 "date_last_backed_up={date_last_backed_up}\n"
99 ).format(105 ).format(
100 measurement=MEASUREMENT_NAME,106 measurement=MEASUREMENT_NAME,
101 date_last_backed_up=date_last_backed_up,
102 machine_unit_name=machine_unit_name,107 machine_unit_name=machine_unit_name,
103 name=name,108 name=name,
104 timestamp=timestamp,109 health=health,
110 date_last_backed_up=date_last_backed_up,
105 )111 )
106 data.append(metric)112 data.append(metric)
107113

Subscribers

People subscribed via source and target branches

to all changes: