Merge ~hyask/autopkgtest-cloud:skia/fix_metrics into autopkgtest-cloud:master

Proposed by Skia
Status: Merged
Merged at revision: 356ddd15a55b1f6ad657da8bf3ef6e57b9631728
Proposed branch: ~hyask/autopkgtest-cloud:skia/fix_metrics
Merge into: autopkgtest-cloud:master
Diff against target: 58 lines (+13/-2)
1 file modified
charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/metrics (+13/-2)
Reviewer Review Type Date Requested Status
Tim Andersson Approve
Review via email: mp+463596@code.launchpad.net

Description of the change

Two small fixes for issues I've seen on LXD worker.

To post a comment you must log in.
Revision history for this message
Tim Andersson (andersson123) wrote :

in my mp, I amended the script to not run as tuples so I'm confused by your most recent commit. The other commit looks good to me know, although CI has also failed.

review: Needs Fixing
Revision history for this message
Tim Andersson (andersson123) wrote :

but I understand what you mean r.e. the tuple now. Accessing with an index isn't great and I should've caught that, my bad. I'll approve this once CI passes, LGTM!

review: Needs Fixing
Revision history for this message
Skia (hyask) wrote :

CI failure doesn't look related to my changes.

Revision history for this message
Tim Andersson (andersson123) wrote :

CI failure seems to be a tempfail - I've seen it in other recent MP's, but not all. I'll approve this, and let's keep our eyes out for this tempfail again.

Revision history for this message
Tim Andersson (andersson123) :
review: Approve
Revision history for this message
Tim Andersson (andersson123) :
review: Approve
Revision history for this message
Tim Andersson (andersson123) wrote :

I think we can just say measurements sent without having the whole measurements dict in journalctl. This service runs quite often so that'll be a lot of logspam. But this is just my opinion. Otherwise, looks good to me!

review: Needs Fixing
Revision history for this message
Skia (hyask) wrote :

I agree, this would make a lot of noise, that's why I set those logs as `debug`, while the log level is `info`. This just makes it easy to cowboy the debug level when in need, but theses lines are not active on the default version of that script :-)

https://docs.python.org/3/library/logging.html#levels

Revision history for this message
Skia (hyask) wrote :

Here is an example run in debug mode:
```
systemd[1]: Starting Collect & submit metrics...
metrics[3587355]: INFO:__main__:Collecting units metrics
metrics[3587355]: DEBUG:__main__:measurements: [{'measurement': 'autopkgtest_unit_status', 'fields': {'count': 39}, 'tags': {'arch': 'armhf', 'status': 'active', 'instance': 'production'}}, {'measurement': 'autopkgtest_unit_status', 'fields': {'count': 54}, 'tags': {'arch': 'armhf', 'status': 'error', 'instance': 'production'}}]
metrics[3587355]: INFO:__main__:Collecting remotes metrics
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.145.243.125:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.145.243.137:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.145.243.233:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.145.243.237:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.145.243.59:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.44.124.111:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.44.124.162:'])
metrics[3587355]: WARNING:__main__:remote error: CalledProcessError(1, ['lxc', 'list', 'lxd-armhf-10.44.124.24:'])
metrics[3587355]: DEBUG:__main__:measurements: [{'measurement': 'autopkgtest_cluster_status', 'fields': {'count': 0}, 'tags': {'arch': 'armhf', 'status': 'active', 'instance': 'production'}}, {'measurement': 'autopkgtest_cluster_status', 'fields': {'count': 0}, 'tags': {'arch': 'armhf', 'status': 'error', 'instance': 'production'}}]
metrics[3587355]: DEBUG:__main__:measurements: [{'measurement': 'autopkgtest_lxd_status', 'fields': {'count': 24}, 'tags': {'arch': 'armhf', 'status': 'active', 'instance': 'production'}}, {'measurement': 'autopkgtest_lxd_status', 'fields': {'count': 8}, 'tags': {'arch': 'armhf', 'status': 'error', 'instance': 'production'}}]
metrics[3587355]: DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): ubuntu-release-kpi-influx.internal:8086
metrics[3587355]: DEBUG:urllib3.connectionpool:http://ubuntu-release-kpi-influx.internal:8086 "POST /write?db=metrics HTTP/1.1" 204 0
systemd[1]: metrics.service: Succeeded.
systemd[1]: Finished Collect & submit metrics.
```
Default level is even less verbose, so I think we're good, even if this run every 5 minutes.

Revision history for this message
Tim Andersson (andersson123) wrote :

ah I missed that, mb, LGTM

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/metrics b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/metrics
2index 90ca1a3..28c5aa7 100755
3--- a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/metrics
4+++ b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/metrics
5@@ -1,6 +1,7 @@
6 #!/usr/bin/python3
7
8 import json
9+import logging
10 import os
11 import subprocess
12
13@@ -18,6 +19,9 @@ INFLUXDB_PASSWORD = os.environ["INFLUXDB_PASSWORD"]
14 INFLUXDB_PORT = os.environ["INFLUXDB_PORT"]
15 INFLUXDB_USERNAME = os.environ["INFLUXDB_USERNAME"]
16
17+logger = logging.getLogger(__name__)
18+logging.basicConfig(level=logging.INFO)
19+
20
21 def make_submission(counts, measurement):
22 out = []
23@@ -43,10 +47,12 @@ def make_submission(counts, measurement):
24 },
25 }
26 out.append(m)
27+ logger.debug("measurements sent: %s", out)
28 return out
29
30
31 def get_units():
32+ logger.info("Collecting units")
33 counts = {}
34
35 (units,) = SYSTEM_BUS.call_sync(
36@@ -111,6 +117,7 @@ def get_list_of_intended_remote_ips(arch):
37
38
39 def get_remotes():
40+ logger.info("Collecting remotes")
41 cluster_counts = {}
42 noncluster_counts = {}
43 out = subprocess.check_output(
44@@ -155,8 +162,12 @@ def get_remotes():
45 )
46 noncluster_counts[arch][0] += 1
47 functional_ips.append(r.replace(prefix, ""))
48- except subprocess.CalledProcessError:
49- noncluster_counts[arch][0] += 1
50+ except (
51+ subprocess.CalledProcessError,
52+ subprocess.TimeoutExpired,
53+ ) as e:
54+ logger.warning("remote error: %s", repr(e))
55+ noncluster_counts[arch][1] += 1
56
57 cluster_status = make_submission(
58 cluster_counts, "autopkgtest_cluster_status"

Subscribers

People subscribed via source and target branches