Merge ~jfguedez/charm-telegraf:feature/intel-cmt-cat into charm-telegraf:master

Proposed by Jose Guedez
Status: Merged
Approved by: Xav Paice
Approved revision: ed52df526595684011a7df9946549e34519ce560
Merged at revision: fb9446ae1a98b8b25478dfb0048a07ed0ec59bda
Proposed branch: ~jfguedez/charm-telegraf:feature/intel-cmt-cat
Merge into: charm-telegraf:master
Diff against target: 1061 lines (+854/-28)
6 files modified
src/config.yaml (+18/-1)
src/reactive/telegraf.py (+199/-23)
src/templates/base_inputs.conf (+7/-0)
src/templates/dashboards/grafana/IntelRDT.json.j2 (+508/-0)
src/templates/sudoers/telegraf_intel_rdt.tmpl (+4/-0)
src/tests/unit/test_telegraf.py (+118/-4)
Reviewer Review Type Date Requested Status
🤖 prod-jenkaas-bootstack continuous-integration Approve
Celia Wang Approve
Joe Guo (community) Needs Fixing
Junien F Approve
Edin S (community) Approve
Canonical IS Reviewers Pending
Review via email: mp+405064@code.launchpad.net

Commit message

Add support for Memory Bandwidth Monitoring (Intel RDT)

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Jose Guedez (jfguedez) wrote (last edit ):

It seems that the CI might have some issues. I don't think it has had a successful build yet. The failing tests are unrelated to the changes afaict.

When I run the unit tests locally there are no failures before/after the change fwiw - https://pastebin.canonical.com/p/hQ8wMrQ5WV/

Revision history for this message
James Troup (elmo) wrote :

LGTM, one minor comment inline.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
Jose Guedez (jfguedez) wrote :

@James Troup. Thanks, I replied inline and will be adding the comment back

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
Edin S (exsdev) wrote :

LGTM

review: Approve
Revision history for this message
Jose Guedez (jfguedez) wrote :
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Junien F (axino) wrote :

See below about the sudoers file - thanks !

review: Needs Fixing
Revision history for this message
Joe Guo (guoqiao) wrote :

for function `check_valid_intel_rdt_configuration`, a few issues:

1) to report issues, it used both exception and string message, maybe just use one way. I will prefer exception.

2) it returns empty str (false) as ok, which maybe misleading or be misused.

3) for kernel version compare, I noticed[0] there is version like `5.13`?

Can we also use the `fetch.apt_pkg.version_compare` to do this ?

[0]: https://en.wikipedia.org/wiki/Linux_kernel_version_history

review: Needs Fixing
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Jose Guedez (jfguedez) wrote :

@axino:

Thanks, in this case the plugin executes the sudo command only once when the telegraf service starts. However, I did add the extra commands to the sudoers file to avoid logging the command. Please take a look again.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Jose Guedez (jfguedez) wrote (last edit ):

@ guoqiao

Thanks, please see comments inline. Addressed in the latest push.

> for function `check_valid_intel_rdt_configuration`, a few issues:
>
> 1) to report issues, it used both exception and string message, maybe just use
> one way. I will prefer exception.
>
> 2) it returns empty str (false) as ok, which maybe misleading or be misused.
>

I had originally wanted to use the exception as a separate mechanism, but I can see how it would be confusing. Definitely agree with the empty string, so I switched it to use exceptions.

> 3) for kernel version compare, I noticed[0] there is version like `5.13`?
>
> Can we also use the `fetch.apt_pkg.version_compare` to do this ?
>
> [0]: https://en.wikipedia.org/wiki/Linux_kernel_version_history

According to [0], there is always a 3rd number. However, it seems that at least in Ubuntu it's always zero so is has no meaning, as it doesn't match the third digit from upstream. You can see the full table of ubuntu/upstream here [1], they all seem to have the 3 number. The first two numbers (major, minor) always match the kernel version so I changed the validation to use only those (e.g. 5.4), which should be enough for our purposes here.

As to using the apt_pkg version for this, you could have multiple kernels installed, some good, some bad (for example in bionic you need the HWE kernel) so it is more reliable to use the version of the running kernel.

I believe the comments/changes address the issues you brought up. Please take a look again, thanks.

[0] https://ubuntu.com/kernel
[1] https://people.canonical.com/~kernel/info/kernel-version-map.html

Revision history for this message
Junien F (axino) wrote :

Thanks for the sudoers change !

review: Approve
Revision history for this message
Joe Guo (guoqiao) wrote :

Hi Jose,

Thanks for the quick change, another small question:

In doc[0], it mentioned the required minimal pqos version is `4.0.0`.
But here in code we are using `RDT_MINIMUM_PKG_VERSION = "4.1-1ppa3"`.

I understand we have to use ppa to backport for boinic, but as my understanding, is will more generic and reliable to use `4.0.0` here ?

[0]: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/intel_rdt/README.md

Revision history for this message
Joe Guo (guoqiao) wrote :

Jose has explained the version issue in chat. +1.

review: Approve
Revision history for this message
Joe Guo (guoqiao) wrote :

+1, but worth noting:

according to doc[0], so far telegraf can not stop the rdt plugin with sudo=true.

2 potential solutions are suggested there.

Before the final solution on telegraf side is released, we may need to provide workaround for the charm to work.

[0]: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/intel_rdt/README.md

review: Approve
Revision history for this message
Joe Guo (guoqiao) wrote :

Hi Jose:

I am doing some testing with this patch in a lxd container (for conditions unmet case), I noticed the `kernel.modprobe` will raise exception: https://pastebin.canonical.com/p/xYBJXsSqZH/

Instead of creating a new patch, I am wondering could you apply following change to your code and re-push, so we can keep the review history here, please ?

diff --git a/src/reactive/telegraf.py b/src/reactive/telegraf.py
index 0b1e71d..8bfd801 100644
--- a/src/reactive/telegraf.py
+++ b/src/reactive/telegraf.py
@@ -807,7 +807,14 @@ def configure_telegraf(): # noqa: C901
     if config["collect_intel_rdt_metrics"]:
         hookenv.log("Intel RDT enabled, enabling module and running checks")
         # load and persist the required module
- kernel.modprobe(RDT_KERNEL_MODULE_NAME, persist=True)
+ try:
+ kernel.modprobe(RDT_KERNEL_MODULE_NAME, persist=True)
+ except subprocess.CalledProcessError:
+ error_msg = "modprobe {} failed".format(RDT_KERNEL_MODULE_NAME)
+ hookenv.log(error_msg, level=hookenv.ERROR)
+ hookenv.status_set("blocked", error_msg)
+ return
+
         try:
             check_valid_intel_rdt_configuration()
         except InvalidIntelRDTConfiguration as e:

review: Needs Fixing
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Joe Guo (guoqiao) wrote :

@jfguedez CI failed and there is unresolved merge conflict in code.

review: Needs Fixing
Revision history for this message
Joe Guo (guoqiao) wrote :

Re: the `modprobe msr` failure in lxc/lxd, I am able to reproduce it with:

lxc launch ubuntu:20.04 ubuntu
lxc exec ubuntu -- bash
root@ubuntu:~# modprobe msr
modprobe: FATAL: Module msr not found in directory /lib/modules/5.8.0-59-generic

Revision history for this message
Celia Wang (ziyiwang) :
review: Needs Fixing
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Joe Guo (guoqiao) wrote :

new changes pushed:

1) rebased against mater to resolve conflicts.
2) block charm if collect_intel_rdt_metrics enabled but `is_container` returns true
3) rename rdt option `sudo` to `use_sudo`

for 3), the purpose is to be consistent with the existing plugins.
upstream patch: https://github.com/influxdata/telegraf/pull/9501

new ppa built with above patch:

ppa:guoqiao/telegraf or https://launchpad.net/~guoqiao/+archive/ubuntu/telegraf

to use ppa:

juju config telegraf install_sources="[ppa:guoqiao/telegraf, ppa:canonical-bootstack/public]"

New review appreciated !

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Joe Guo (guoqiao) wrote :

Unit tests works on local machine but failed on CI.

I have triggered another CI job on master to see how it works:

https://code.launchpad.net/~guoqiao/charm-telegraf/+git/charm-telegraf/+merge/405790

Revision history for this message
Celia Wang (ziyiwang) wrote :

lgtm

review: Approve
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision fb9446ae1a98b8b25478dfb0048a07ed0ec59bda

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/src/config.yaml b/src/config.yaml
2index e4f6f04..eee3fbe 100644
3--- a/src/config.yaml
4+++ b/src/config.yaml
5@@ -233,4 +233,21 @@ options:
6 description: >
7 Enable the collection of IPMI sensor metrics, using the ipmi sensor telegraf
8 input plugin. Collecting these metrics requires sudo access - enabling
9- this option will install an appropriate, locked-down sudoers file.
10\ No newline at end of file
11+ this option will install an appropriate, locked-down sudoers file.
12+ collect_intel_rdt_metrics:
13+ default: false
14+ type: boolean
15+ description: >
16+ Enable the collection of Intel memory bandwidth metrics, using the
17+ telegraf intel_rdt input plugin. Collecting these metrics requires sudo
18+ access - enabling this option will install an appropriate, locked-down
19+ sudoers file.
20+ .
21+ There are certain requisites to run this plugin, including having a
22+ kernel >= v5.4, Intel RDT tools >= v4.1 available in a repository, and
23+ a supported CPU (as reported by the Intel utility `pqos`)
24+ .
25+ Currently the charm will configure monitoring of all detected cores.
26+ .
27+ See https://github.com/influxdata/telegraf/blob/master/plugins/inputs/intel_rdt/README.md
28+ for info on the telegraf intel_rdt plugin.
29diff --git a/src/reactive/telegraf.py b/src/reactive/telegraf.py
30index ef89a47..7a19d16 100644
31--- a/src/reactive/telegraf.py
32+++ b/src/reactive/telegraf.py
33@@ -22,6 +22,7 @@ import io
34 import ipaddress
35 import json
36 import os
37+import platform
38 import re
39 import socket
40 import subprocess
41@@ -29,9 +30,9 @@ import sys
42 import time
43 from distutils.version import LooseVersion
44
45-from charmhelpers import context
46+from charmhelpers import context, fetch
47 from charmhelpers.contrib.charmsupport import nrpe
48-from charmhelpers.core import hookenv, host, unitdata
49+from charmhelpers.core import hookenv, host, kernel, unitdata
50 from charmhelpers.core.host import is_container
51 from charmhelpers.core.templating import render
52
53@@ -67,9 +68,18 @@ CONFIG_FILE = "telegraf.conf"
54
55 CONFIG_DIR = "telegraf.d"
56
57-GRAFANA_DASHBOARD_TELEGRAF_FILE_NAME = "Telegraf.json.j2"
58-
59-GRAFANA_DASHBOARD_NAME = "telegraf"
60+GRAFANA_DASHBOARD_CONFIG = {
61+ "telegraf": {
62+ "template_file": "Telegraf.json.j2",
63+ "context_vars": {
64+ # TODO: Figure out if metrics exist and then set bools accordingly.
65+ # For now, setting bools to true.
66+ "bonds_enabled": True,
67+ "bcache_enabled": True,
68+ "conntrack_enabled": True,
69+ },
70+ },
71+}
72
73 SNAP_SERVICE = "snap.telegraf.telegraf"
74 DEB_SERVICE = "telegraf"
75@@ -83,6 +93,11 @@ DEB_USER = "telegraf"
76
77 # Utilities #
78
79+# constants related to RDT metrics support
80+RDT_MINIMUM_KERNEL_VERSION = (5, 4)
81+RDT_MINIMUM_PKG_VERSION = "4.1-1ppa3"
82+RDT_KERNEL_MODULE_NAME = "msr"
83+
84
85 class InvalidInstallMethodError(Exception):
86 pass
87@@ -92,6 +107,10 @@ class InvalidPrometheusIPRangeError(Exception):
88 pass
89
90
91+class InvalidIntelRDTConfigurationError(Exception):
92+ pass
93+
94+
95 def write_telegraf_file(path, content):
96 return host.write_file(
97 path,
98@@ -283,6 +302,96 @@ def get_remote_unit_name():
99 return rel["__unit__"]
100
101
102+def check_valid_intel_rdt_configuration():
103+ """
104+ Check that the requirements for RDT are met.
105+
106+ Will raise a InvalidIntelRDTConfigurationError exception when a validation
107+ issue is encountered, otherwise None
108+ """
109+ # check that we meet the minimum kernel version
110+ linux_release = platform.release() # format is like '5.4.0-73-generic'
111+ re_kernel_version = r"^(\d+)\.(\d+)"
112+ match = re.match(re_kernel_version, linux_release)
113+
114+ if match:
115+ current_kernel_version = tuple(int(d) for d in match.groups())
116+ if current_kernel_version < RDT_MINIMUM_KERNEL_VERSION:
117+ raise InvalidIntelRDTConfigurationError(
118+ "unsupported kernel version: {}, need version higher than {}".format(
119+ current_kernel_version, RDT_MINIMUM_KERNEL_VERSION
120+ )
121+ )
122+ else:
123+ raise InvalidIntelRDTConfigurationError(
124+ "Incompatible platform.release output: {}".format(linux_release)
125+ )
126+
127+ # check that package `intel-cmt-cat` is installed
128+ current_pkg_version = fetch.get_installed_version("intel-cmt-cat")
129+ if not current_pkg_version:
130+ raise InvalidIntelRDTConfigurationError(
131+ "package 'intel-cmt-cat' is not installed yet"
132+ )
133+
134+ current_pkg_version_str = current_pkg_version["ver_str"]
135+
136+ # check that package `intel-cmt-cat` is recent enough
137+ if (
138+ fetch.apt_pkg.version_compare(current_pkg_version_str, RDT_MINIMUM_PKG_VERSION)
139+ < 0 # noqa: W503
140+ ):
141+ base_error_msg = "package 'intel-cmt-cat' is older than required"
142+ raise InvalidIntelRDTConfigurationError(
143+ "{}: '{}' (installed '{}')".format(
144+ base_error_msg, RDT_MINIMUM_PKG_VERSION, current_pkg_version_str
145+ )
146+ )
147+
148+ # check that the required module is loaded
149+ if not kernel.is_module_loaded(RDT_KERNEL_MODULE_NAME):
150+ raise InvalidIntelRDTConfigurationError(
151+ "required module '{}' is not loaded".format(RDT_KERNEL_MODULE_NAME)
152+ )
153+
154+ # check that the `pqos` utility reports no issues
155+ # this performs a sanity check on the RDT utility configuration
156+ command = ["sudo", "pqos", "-d"]
157+ try:
158+ subprocess.check_call(command)
159+ # this performs a sanity check on the RDT utility configuration
160+ except subprocess.CalledProcessError as error:
161+ hookenv.log(
162+ "pqos -d call failed:\n{}".format(error.output.decode("utf8")),
163+ level=hookenv.ERROR,
164+ )
165+ raise InvalidIntelRDTConfigurationError("pqos -d failed, see logs for details")
166+
167+ return None
168+
169+
170+def get_cpu_cores():
171+ """Get the list of available cores for the cpu(s)."""
172+ # should return something like ["0-23"]
173+ command = ["lscpu", "--json"]
174+ try:
175+ lscpu_output = subprocess.check_output(command).decode("utf8")
176+ except subprocess.CalledProcessError as error:
177+ hookenv.log(
178+ "lscpu call failed:\n{}".format(error.output.decode("utf8")),
179+ level=hookenv.ERROR,
180+ )
181+ raise error
182+
183+ lscpu_json = json.loads(lscpu_output)
184+
185+ for data_pair in lscpu_json["lscpu"]:
186+ if data_pair["field"] == "On-line CPU(s) list:":
187+ return '["{}"]'.format(data_pair["data"])
188+
189+ raise Exception("Incompatible lscpu output: {}".format(lscpu_output))
190+
191+
192 def get_disabled_plugins():
193 """Return consolidated list of all plugins to be disabled."""
194 config = hookenv.config()
195@@ -324,6 +433,13 @@ def get_base_inputs():
196 ipmi_sensor = config["collect_ipmi_sensor_metrics"]
197 disabled_plugins = get_disabled_plugins()
198
199+ # handle the Intel RDT collection parameters
200+ intel_rdt = config["collect_intel_rdt_metrics"]
201+ if intel_rdt:
202+ intel_rdt_cores = get_cpu_cores()
203+ else:
204+ intel_rdt_cores = None
205+
206 return {
207 "extra_options": extra_options["inputs"],
208 "bcache": is_bcache(),
209@@ -334,6 +450,8 @@ def get_base_inputs():
210 "iptables": iptables,
211 "smart": smart,
212 "ipmi_sensor": ipmi_sensor,
213+ "intel_rdt": intel_rdt,
214+ "intel_rdt_cores": intel_rdt_cores,
215 }
216
217
218@@ -694,6 +812,38 @@ def configure_telegraf(): # noqa: C901
219 else:
220 remove_sudoers_file(sudoers_filename)
221
222+ # handle the configuration of intel_rdt
223+ sudoers_filename = "telegraf_intel_rdt"
224+ if config["collect_intel_rdt_metrics"]:
225+ hookenv.log("Intel RDT enabled, enabling module and running checks")
226+
227+ if is_container():
228+ error_msg = "Intel RDT can not be enabled in container"
229+ hookenv.log(error_msg, level=hookenv.WARNING)
230+ hookenv.status_set("blocked", error_msg)
231+ return
232+
233+ # load and persist the required module
234+ try:
235+ kernel.modprobe(RDT_KERNEL_MODULE_NAME, persist=True)
236+ except subprocess.CalledProcessError:
237+ error_msg = "modprobe {} failed".format(RDT_KERNEL_MODULE_NAME)
238+ hookenv.log(error_msg, level=hookenv.ERROR)
239+ hookenv.status_set("blocked", error_msg)
240+ return
241+
242+ try:
243+ check_valid_intel_rdt_configuration()
244+ except InvalidIntelRDTConfigurationError as e:
245+ # on error we abort configuration and block the charm
246+ error_msg = "Cannot configure Intel RDT: {}".format(e)
247+ hookenv.log(error_msg, level=hookenv.ERROR)
248+ hookenv.status_set("blocked", error_msg)
249+ return
250+ render_sudoers_file(sudoers_filename)
251+ else:
252+ remove_sudoers_file(sudoers_filename)
253+
254 telegraf_exec_metrics = os.path.join(get_files_dir(), "telegraf_exec_metrics.py")
255 cmd = [
256 telegraf_exec_metrics,
257@@ -720,7 +870,12 @@ def configure_telegraf(): # noqa: C901
258 for service in [DEB_SERVICE, SNAP_SERVICE]:
259 if service == get_service():
260 host.service_resume(service)
261- host.service_reload(service)
262+ # skip reload when Intel RDT is enabled, as it stops the plugin from
263+ # publishing data. The service will be restarted via the flag
264+ # "telegraf.needs_reload" on changes later
265+ if not config["collect_intel_rdt_metrics"]:
266+ hookenv.log("reloading service: {}".format(service), level="DEBUG")
267+ host.service_reload(service)
268 else:
269 try:
270 host.service_pause(service)
271@@ -846,6 +1001,13 @@ def handle_config_changes():
272 ):
273 clear_flag("plugins.prometheus-client.configured")
274 clear_flag("prometheus-client.relation.configured")
275+
276+ # handle the Intel RDT/MBM metrics collection
277+ if config.get("collect_intel_rdt_metrics"):
278+ set_flag("telegraf.intel_rdt.enabled")
279+ else:
280+ clear_flag("telegraf.intel_rdt.enabled")
281+
282 clear_flag("telegraf.configured")
283 clear_flag("telegraf.apt.configured")
284 clear_flag("telegraf.snap.configured")
285@@ -1556,32 +1718,40 @@ def prometheus_client_departed():
286 )
287 @when_not("grafana.configured")
288 def register_grafana_dashboard():
289+ config = hookenv.config()
290 grafana = endpoint_from_flag("endpoint.dashboards.joined")
291- hookenv.log("Loading grafana dashboard", level=hookenv.DEBUG)
292- dashboard = _load_grafana_dashboard()
293- digest = hashlib.md5(dashboard.encode("utf8")).hexdigest()
294- dashboard_dict = json.loads(dashboard)
295- dashboard_dict["digest"] = digest
296- hookenv.log(
297- "Rendered dashboard dict:\n{}".format(dashboard_dict), level=hookenv.DEBUG
298- )
299- grafana.register_dashboard(name=GRAFANA_DASHBOARD_NAME, dashboard=dashboard_dict)
300- hookenv.log('Grafana dashboard "{}" registered.'.format(GRAFANA_DASHBOARD_NAME))
301+ grafana_dashboard_config = GRAFANA_DASHBOARD_CONFIG.copy()
302+
303+ # if RDT is enabled inject the relevant dashboard config
304+ if config["collect_intel_rdt_metrics"]:
305+ grafana_dashboard_config["Intel RDT"] = {"template_file": "IntelRDT.json.j2"}
306+
307+ # process all the configured dashboards
308+ for dashboard_name, dashboard_data in grafana_dashboard_config.items():
309+ hookenv.log(
310+ "Loading grafana dashboard: {}".format(dashboard_name), level=hookenv.DEBUG
311+ )
312+ dashboard = _load_grafana_dashboard(dashboard_data)
313+ digest = hashlib.md5(dashboard.encode("utf8")).hexdigest()
314+ dashboard_dict = json.loads(dashboard)
315+ dashboard_dict["digest"] = digest
316+ hookenv.log(
317+ "Rendered dashboard dict:\n{}".format(dashboard_dict), level=hookenv.DEBUG
318+ )
319+ grafana.register_dashboard(name=dashboard_name, dashboard=dashboard_dict)
320+ hookenv.log('Grafana dashboard "{}" registered.'.format(dashboard_name))
321+
322 set_flag("grafana.configured")
323
324
325-def _load_grafana_dashboard():
326+def _load_grafana_dashboard(dashboard_data):
327 prometheus_datasource = "{} - Juju generated source".format(
328 hookenv.config().get("prometheus_datasource", "prometheus")
329 )
330 dashboard_context = dict(datasource=prometheus_datasource)
331- # TODO: Figure out if metrics exist and then set bools accordingly.
332- # For now, setting bools to true.
333- dashboard_context["bonds_enabled"] = True
334- dashboard_context["bcache_enabled"] = True
335- dashboard_context["conntrack_enabled"] = True
336+ dashboard_context.update(dashboard_data.get("context_vars", {}))
337 return render_custom(
338- source=GRAFANA_DASHBOARD_TELEGRAF_FILE_NAME,
339+ source=dashboard_data["template_file"],
340 render_context=dashboard_context,
341 variable_start_string="<<",
342 variable_end_string=">>",
343@@ -1731,3 +1901,9 @@ def configure_nagios(nagios):
344 @when_not("apt.nvme-cli.installed")
345 def install_smart_metrics_packages():
346 apt.queue_install(["smartmontools", "nvme-cli"])
347+
348+
349+@when("telegraf.intel_rdt.enabled")
350+@when_not("apt.installed.intel-cmt-cat")
351+def install_intel_rdt_packages():
352+ apt.queue_install(["intel-cmt-cat"])
353diff --git a/src/templates/base_inputs.conf b/src/templates/base_inputs.conf
354index 7f42543..8feea45 100644
355--- a/src/templates/base_inputs.conf
356+++ b/src/templates/base_inputs.conf
357@@ -170,6 +170,13 @@ use_sudo = true
358 {%- endif %}
359 {% endif %}
360
361+{% if "intel_rdt" not in disabled_plugins %}
362+{% if intel_rdt -%}
363+[[inputs.intel_rdt]]
364+cores = {{ intel_rdt_cores }}
365+use_sudo = true
366+{%- endif %}
367+{% endif %}
368
369 [[inputs.exec]]
370 commands = [
371diff --git a/src/templates/dashboards/grafana/IntelRDT.json.j2 b/src/templates/dashboards/grafana/IntelRDT.json.j2
372new file mode 100644
373index 0000000..5a915c4
374--- /dev/null
375+++ b/src/templates/dashboards/grafana/IntelRDT.json.j2
376@@ -0,0 +1,508 @@
377+{
378+ "annotations": {
379+ "list": [
380+ {
381+ "builtIn": 1,
382+ "datasource": "-- Grafana --",
383+ "enable": true,
384+ "hide": true,
385+ "iconColor": "rgba(0, 211, 255, 1)",
386+ "name": "Annotations & Alerts",
387+ "type": "dashboard"
388+ }
389+ ]
390+ },
391+ "editable": true,
392+ "gnetId": null,
393+ "graphTooltip": 0,
394+ "id": null,
395+ "iteration": 1625126969668,
396+ "links": [],
397+ "panels": [
398+ {
399+ "collapsed": false,
400+ "datasource": null,
401+ "gridPos": {
402+ "h": 1,
403+ "w": 24,
404+ "x": 0,
405+ "y": 0
406+ },
407+ "id": 8,
408+ "panels": [],
409+ "title": "Memory Bandwidth",
410+ "type": "row"
411+ },
412+ {
413+ "datasource": "<< datasource >>",
414+ "fieldConfig": {
415+ "defaults": {
416+ "color": {
417+ "mode": "palette-classic"
418+ },
419+ "custom": {
420+ "axisLabel": "MB/s",
421+ "axisPlacement": "auto",
422+ "barAlignment": 0,
423+ "drawStyle": "line",
424+ "fillOpacity": 0,
425+ "gradientMode": "none",
426+ "hideFrom": {
427+ "legend": false,
428+ "tooltip": false,
429+ "viz": false
430+ },
431+ "lineInterpolation": "linear",
432+ "lineWidth": 1,
433+ "pointSize": 5,
434+ "scaleDistribution": {
435+ "type": "linear"
436+ },
437+ "showPoints": "auto",
438+ "spanNulls": false,
439+ "stacking": {
440+ "group": "A",
441+ "mode": "none"
442+ },
443+ "thresholdsStyle": {
444+ "mode": "off"
445+ }
446+ },
447+ "mappings": [],
448+ "thresholds": {
449+ "mode": "absolute",
450+ "steps": [
451+ {
452+ "color": "green",
453+ "value": null
454+ },
455+ {
456+ "color": "red",
457+ "value": 80
458+ }
459+ ]
460+ }
461+ },
462+ "overrides": []
463+ },
464+ "gridPos": {
465+ "h": 11,
466+ "w": 24,
467+ "x": 0,
468+ "y": 1
469+ },
470+ "id": 2,
471+ "options": {
472+ "legend": {
473+ "calcs": [],
474+ "displayMode": "list",
475+ "placement": "bottom"
476+ },
477+ "tooltip": {
478+ "mode": "single"
479+ }
480+ },
481+ "targets": [
482+ {
483+ "exemplar": true,
484+ "expr": "{name=\"MBL\"}",
485+ "interval": "",
486+ "legendFormat": "{{ host }}",
487+ "queryType": "randomWalk",
488+ "refId": "Memory Bandwidth"
489+ }
490+ ],
491+ "title": "MBL",
492+ "type": "timeseries"
493+ },
494+ {
495+ "datasource": "<< datasource >>",
496+ "fieldConfig": {
497+ "defaults": {
498+ "color": {
499+ "mode": "palette-classic"
500+ },
501+ "custom": {
502+ "axisLabel": "MB/s",
503+ "axisPlacement": "auto",
504+ "barAlignment": 0,
505+ "drawStyle": "line",
506+ "fillOpacity": 0,
507+ "gradientMode": "none",
508+ "hideFrom": {
509+ "legend": false,
510+ "tooltip": false,
511+ "viz": false
512+ },
513+ "lineInterpolation": "linear",
514+ "lineWidth": 1,
515+ "pointSize": 5,
516+ "scaleDistribution": {
517+ "type": "linear"
518+ },
519+ "showPoints": "auto",
520+ "spanNulls": false,
521+ "stacking": {
522+ "group": "A",
523+ "mode": "none"
524+ },
525+ "thresholdsStyle": {
526+ "mode": "off"
527+ }
528+ },
529+ "mappings": [],
530+ "thresholds": {
531+ "mode": "absolute",
532+ "steps": [
533+ {
534+ "color": "green",
535+ "value": null
536+ },
537+ {
538+ "color": "red",
539+ "value": 80
540+ }
541+ ]
542+ }
543+ },
544+ "overrides": []
545+ },
546+ "gridPos": {
547+ "h": 11,
548+ "w": 24,
549+ "x": 0,
550+ "y": 12
551+ },
552+ "id": 11,
553+ "options": {
554+ "legend": {
555+ "calcs": [],
556+ "displayMode": "list",
557+ "placement": "bottom"
558+ },
559+ "tooltip": {
560+ "mode": "single"
561+ }
562+ },
563+ "targets": [
564+ {
565+ "exemplar": true,
566+ "expr": "{name=\"MBR\"}",
567+ "interval": "",
568+ "legendFormat": "{{ host }}",
569+ "queryType": "randomWalk",
570+ "refId": "Memory Bandwidth"
571+ }
572+ ],
573+ "title": "MBR",
574+ "type": "timeseries"
575+ },
576+ {
577+ "datasource": "<< datasource >>",
578+ "fieldConfig": {
579+ "defaults": {
580+ "color": {
581+ "mode": "palette-classic"
582+ },
583+ "custom": {
584+ "axisLabel": "MB/s",
585+ "axisPlacement": "auto",
586+ "barAlignment": 0,
587+ "drawStyle": "line",
588+ "fillOpacity": 0,
589+ "gradientMode": "none",
590+ "hideFrom": {
591+ "legend": false,
592+ "tooltip": false,
593+ "viz": false
594+ },
595+ "lineInterpolation": "linear",
596+ "lineWidth": 1,
597+ "pointSize": 5,
598+ "scaleDistribution": {
599+ "type": "linear"
600+ },
601+ "showPoints": "auto",
602+ "spanNulls": false,
603+ "stacking": {
604+ "group": "A",
605+ "mode": "none"
606+ },
607+ "thresholdsStyle": {
608+ "mode": "off"
609+ }
610+ },
611+ "mappings": [],
612+ "thresholds": {
613+ "mode": "absolute",
614+ "steps": [
615+ {
616+ "color": "green",
617+ "value": null
618+ },
619+ {
620+ "color": "red",
621+ "value": 80
622+ }
623+ ]
624+ }
625+ },
626+ "overrides": []
627+ },
628+ "gridPos": {
629+ "h": 11,
630+ "w": 24,
631+ "x": 0,
632+ "y": 23
633+ },
634+ "id": 10,
635+ "options": {
636+ "legend": {
637+ "calcs": [],
638+ "displayMode": "list",
639+ "placement": "bottom"
640+ },
641+ "tooltip": {
642+ "mode": "single"
643+ }
644+ },
645+ "targets": [
646+ {
647+ "exemplar": true,
648+ "expr": "{name=\"MBT\"}",
649+ "interval": "",
650+ "legendFormat": "{{ host }}",
651+ "queryType": "randomWalk",
652+ "refId": "Memory Bandwidth"
653+ }
654+ ],
655+ "title": "MBT",
656+ "type": "timeseries"
657+ },
658+ {
659+ "collapsed": true,
660+ "datasource": null,
661+ "gridPos": {
662+ "h": 1,
663+ "w": 24,
664+ "x": 0,
665+ "y": 34
666+ },
667+ "id": 6,
668+ "panels": [
669+ {
670+ "datasource": "<< datasource >>",
671+ "fieldConfig": {
672+ "defaults": {
673+ "color": {
674+ "mode": "palette-classic"
675+ },
676+ "custom": {
677+ "axisLabel": "",
678+ "axisPlacement": "auto",
679+ "barAlignment": 0,
680+ "drawStyle": "line",
681+ "fillOpacity": 0,
682+ "gradientMode": "none",
683+ "hideFrom": {
684+ "legend": false,
685+ "tooltip": false,
686+ "viz": false
687+ },
688+ "lineInterpolation": "linear",
689+ "lineWidth": 1,
690+ "pointSize": 5,
691+ "scaleDistribution": {
692+ "type": "linear"
693+ },
694+ "showPoints": "auto",
695+ "spanNulls": false,
696+ "stacking": {
697+ "group": "A",
698+ "mode": "none"
699+ },
700+ "thresholdsStyle": {
701+ "mode": "off"
702+ }
703+ },
704+ "mappings": [],
705+ "thresholds": {
706+ "mode": "absolute",
707+ "steps": [
708+ {
709+ "color": "green",
710+ "value": null
711+ },
712+ {
713+ "color": "red",
714+ "value": 80
715+ }
716+ ]
717+ },
718+ "unit": "deckbytes"
719+ },
720+ "overrides": []
721+ },
722+ "gridPos": {
723+ "h": 11,
724+ "w": 24,
725+ "x": 0,
726+ "y": 2
727+ },
728+ "id": 4,
729+ "options": {
730+ "legend": {
731+ "calcs": [],
732+ "displayMode": "list",
733+ "placement": "bottom"
734+ },
735+ "tooltip": {
736+ "mode": "single"
737+ }
738+ },
739+ "targets": [
740+ {
741+ "exemplar": true,
742+ "expr": "rdt_metric{name=\"LLC\", host=\"$host\"}",
743+ "interval": "",
744+ "legendFormat": "{{ host }}",
745+ "queryType": "randomWalk",
746+ "refId": "A"
747+ }
748+ ],
749+ "title": "LLC",
750+ "type": "timeseries"
751+ },
752+ {
753+ "datasource": "<< datasource >>",
754+ "fieldConfig": {
755+ "defaults": {
756+ "color": {
757+ "mode": "palette-classic"
758+ },
759+ "custom": {
760+ "axisLabel": "",
761+ "axisPlacement": "auto",
762+ "barAlignment": 0,
763+ "drawStyle": "line",
764+ "fillOpacity": 0,
765+ "gradientMode": "none",
766+ "hideFrom": {
767+ "legend": false,
768+ "tooltip": false,
769+ "viz": false
770+ },
771+ "lineInterpolation": "linear",
772+ "lineWidth": 1,
773+ "pointSize": 5,
774+ "scaleDistribution": {
775+ "type": "linear"
776+ },
777+ "showPoints": "auto",
778+ "spanNulls": false,
779+ "stacking": {
780+ "group": "A",
781+ "mode": "none"
782+ },
783+ "thresholdsStyle": {
784+ "mode": "off"
785+ }
786+ },
787+ "mappings": [],
788+ "thresholds": {
789+ "mode": "absolute",
790+ "steps": [
791+ {
792+ "color": "green",
793+ "value": null
794+ },
795+ {
796+ "color": "red",
797+ "value": 80
798+ }
799+ ]
800+ },
801+ "unit": "short"
802+ },
803+ "overrides": []
804+ },
805+ "gridPos": {
806+ "h": 11,
807+ "w": 24,
808+ "x": 0,
809+ "y": 13
810+ },
811+ "id": 9,
812+ "options": {
813+ "legend": {
814+ "calcs": [],
815+ "displayMode": "list",
816+ "placement": "bottom"
817+ },
818+ "tooltip": {
819+ "mode": "single"
820+ }
821+ },
822+ "targets": [
823+ {
824+ "exemplar": true,
825+ "expr": "rdt_metric{name=\"LLC_Misses\", host=\"$host\"}",
826+ "interval": "",
827+ "legendFormat": "{{ host }}",
828+ "queryType": "randomWalk",
829+ "refId": "A"
830+ }
831+ ],
832+ "title": "LLC Misses",
833+ "type": "timeseries"
834+ }
835+ ],
836+ "title": "Cache Occupancy",
837+ "type": "row"
838+ }
839+ ],
840+ "refresh": "",
841+ "schemaVersion": 30,
842+ "style": "dark",
843+ "tags": [],
844+ "templating": {
845+ "list": [
846+ {
847+ "allValue": null,
848+ "current": {
849+ "selected": false,
850+ "text": "controller:ubuntu-1",
851+ "value": "controller:ubuntu-1"
852+ },
853+ "datasource": "<< datasource >>",
854+ "definition": "label_values(host)",
855+ "description": null,
856+ "error": null,
857+ "hide": 0,
858+ "includeAll": false,
859+ "label": null,
860+ "multi": true,
861+ "name": "host",
862+ "options": [],
863+ "query": {
864+ "query": "label_values(host)",
865+ "refId": "StandardVariableQuery"
866+ },
867+ "refresh": 1,
868+ "regex": "",
869+ "skipUrlSync": false,
870+ "sort": 0,
871+ "type": "query"
872+ }
873+ ]
874+ },
875+ "time": {
876+ "from": "now-6h",
877+ "to": "now"
878+ },
879+ "timepicker": {},
880+ "timezone": "utc",
881+ "title": "Intel RDT - Memory Bandwidth Monitoring",
882+ "uid": "GWblKcRnd",
883+ "version": 5
884+}
885diff --git a/src/templates/sudoers/telegraf_intel_rdt.tmpl b/src/templates/sudoers/telegraf_intel_rdt.tmpl
886new file mode 100644
887index 0000000..0324a77
888--- /dev/null
889+++ b/src/templates/sudoers/telegraf_intel_rdt.tmpl
890@@ -0,0 +1,4 @@
891+Cmnd_Alias PQOS = /usr/sbin/pqos -r --iface-os --mon-file-type=csv --mon-interval=*
892+{{ telegraf_user }} ALL=(root) NOPASSWD: PQOS
893+Defaults!PQOS !logfile, !syslog, !pam_session
894+
895diff --git a/src/tests/unit/test_telegraf.py b/src/tests/unit/test_telegraf.py
896index 4178a34..251a9b9 100644
897--- a/src/tests/unit/test_telegraf.py
898+++ b/src/tests/unit/test_telegraf.py
899@@ -20,6 +20,7 @@ import getpass
900 import grp
901 import json
902 import os
903+import platform
904 import shutil
905 import subprocess
906 import sys
907@@ -27,9 +28,11 @@ from textwrap import dedent
908 from unittest import mock
909 from unittest.mock import MagicMock, call, patch
910
911-from charmhelpers.core import host
912+from charmhelpers import fetch
913+from charmhelpers.core import host, kernel
914 from charmhelpers.core.hookenv import Config
915 from charmhelpers.core.templating import render
916+from charmhelpers.fetch import apt_pkg
917
918 import charms
919 from charms.reactive import RelationBase, bus, helpers, set_flag
920@@ -1465,7 +1468,9 @@ class TestGrafanaDashboard:
921 mock_render,
922 ):
923 expected_datasource = "my_prometheus"
924- fake_config = dict(prometheus_datasource=expected_datasource)
925+ fake_config = dict(
926+ prometheus_datasource=expected_datasource, collect_intel_rdt_metrics=False
927+ )
928 expected_dashboard_context = dict(
929 datasource="{} - Juju generated source".format(expected_datasource),
930 bonds_enabled=True,
931@@ -1484,15 +1489,19 @@ class TestGrafanaDashboard:
932 mock_render.return_value = mock_rendered_content
933
934 telegraf.register_grafana_dashboard()
935+ dashboard_name = "telegraf"
936+ dashboard_filename = telegraf.GRAFANA_DASHBOARD_CONFIG[dashboard_name][
937+ "template_file"
938+ ]
939
940 mock_render.assert_called_once_with(
941- source=telegraf.GRAFANA_DASHBOARD_TELEGRAF_FILE_NAME,
942+ source=dashboard_filename,
943 render_context=expected_dashboard_context,
944 variable_start_string="<<",
945 variable_end_string=">>",
946 )
947 mock_grafana.register_dashboard.assert_called_once_with(
948- name=telegraf.GRAFANA_DASHBOARD_NAME, dashboard=mock_dashboard_dict
949+ name=dashboard_name, dashboard=mock_dashboard_dict
950 )
951 mock_set_flag.assert_called_once_with("grafana.configured")
952
953@@ -1617,3 +1626,108 @@ def test_collect_ipmi_sensor_metrics(monkeypatch, config):
954 """
955 config_file = base_dir().join("telegraf.conf")
956 assert expected in config_file.read()
957+
958+
959+def test_collect_intel_rdt_metrics(monkeypatch, config):
960+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
961+ config["collect_intel_rdt_metrics"] = True
962+ monkeypatch.setattr(telegraf, "get_cpu_cores", lambda: '["0-23"]')
963+ monkeypatch.setattr(telegraf, "check_valid_intel_rdt_configuration", lambda: "")
964+ monkeypatch.setattr(kernel, "modprobe", lambda module, persist: None)
965+ telegraf.configure_telegraf()
966+
967+ expected = """
968+[[inputs.intel_rdt]]
969+cores = ["0-23"]
970+use_sudo = true
971+"""
972+ config_file = base_dir().join("telegraf.conf")
973+ assert expected in config_file.read()
974+
975+
976+def test_get_cpu_cores(monkeypatch):
977+ lscpu_output = """
978+{
979+ "lscpu": [
980+ {"field": "Architecture:", "data": "x86_64"},
981+ {"field": "On-line CPU(s) list:", "data": "0-23"}
982+ ]
983+}
984+""".encode(
985+ "utf8"
986+ )
987+ monkeypatch.setattr(subprocess, "check_output", lambda cmd: lscpu_output)
988+ cores = telegraf.get_cpu_cores()
989+ assert cores == '["0-23"]'
990+
991+
992+def test_check_valid_intel_rdt_configuration_kernel_version(monkeypatch):
993+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
994+ monkeypatch.setattr(platform, "release", lambda: "4.4.0-73-generic")
995+ with pytest.raises(
996+ telegraf.InvalidIntelRDTConfigurationError, match="unsupported kernel version"
997+ ):
998+ telegraf.check_valid_intel_rdt_configuration()
999+
1000+
1001+def test_check_valid_intel_rdt_configuration_pkg_present(monkeypatch):
1002+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
1003+ monkeypatch.setattr(platform, "release", lambda: "5.4.0-73-generic")
1004+ monkeypatch.setattr(fetch, "get_installed_version", lambda pkg: None)
1005+ with pytest.raises(
1006+ telegraf.InvalidIntelRDTConfigurationError,
1007+ match="package 'intel-cmt-cat' is not installed yet",
1008+ ):
1009+ telegraf.check_valid_intel_rdt_configuration()
1010+
1011+
1012+def test_check_valid_intel_rdt_configuration_pkg_version(monkeypatch):
1013+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
1014+ monkeypatch.setattr(platform, "release", lambda: "5.4.0-73-generic")
1015+ monkeypatch.setattr(fetch, "get_installed_version", lambda pkg: {"ver_str": "0.0"})
1016+ monkeypatch.setattr(apt_pkg, "version_compare", lambda a, b: -1)
1017+ with pytest.raises(
1018+ telegraf.InvalidIntelRDTConfigurationError,
1019+ match="package 'intel-cmt-cat' is older than required",
1020+ ):
1021+ telegraf.check_valid_intel_rdt_configuration()
1022+
1023+
1024+def test_check_valid_intel_rdt_configuration_kernel_module(monkeypatch):
1025+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
1026+ monkeypatch.setattr(platform, "release", lambda: "5.4.0-73-generic")
1027+ monkeypatch.setattr(kernel, "is_module_loaded", lambda module: False)
1028+ monkeypatch.setattr(
1029+ fetch,
1030+ "get_installed_version",
1031+ lambda pkg: {"ver_str": telegraf.RDT_MINIMUM_PKG_VERSION},
1032+ )
1033+ monkeypatch.setattr(apt_pkg, "version_compare", lambda a, b: 0)
1034+ with pytest.raises(
1035+ telegraf.InvalidIntelRDTConfigurationError,
1036+ match="required module",
1037+ ):
1038+ telegraf.check_valid_intel_rdt_configuration()
1039+
1040+
1041+def test_check_valid_intel_rdt_configuration_pqos(monkeypatch):
1042+ def mock_check_call(*args, **kwargs):
1043+ raise subprocess.CalledProcessError(
1044+ cmd="fake", returncode=1, output="fail".encode("utf8")
1045+ )
1046+
1047+ monkeypatch.setattr(telegraf, "is_container", lambda: False)
1048+ monkeypatch.setattr(platform, "release", lambda: "5.4.0-73-generic")
1049+ monkeypatch.setattr(kernel, "is_module_loaded", lambda module: True)
1050+ monkeypatch.setattr(
1051+ fetch,
1052+ "get_installed_version",
1053+ lambda pkg: {"ver_str": telegraf.RDT_MINIMUM_PKG_VERSION},
1054+ )
1055+ monkeypatch.setattr(apt_pkg, "version_compare", lambda a, b: 0)
1056+ monkeypatch.setattr(subprocess, "check_call", mock_check_call)
1057+ with pytest.raises(
1058+ telegraf.InvalidIntelRDTConfigurationError,
1059+ match="pqos -d failed",
1060+ ):
1061+ telegraf.check_valid_intel_rdt_configuration()

Subscribers

People subscribed via source and target branches

to all changes: