Merge ~lcvcode/charm-nagios:LP1939068-duplicates into charm-nagios:master

Proposed by Connor Chamberlain
Status: Merged
Merged at revision: 67fe2db9f283fb4fdf584578597e8520d566b92d
Proposed branch: ~lcvcode/charm-nagios:LP1939068-duplicates
Merge into: charm-nagios:master
Prerequisite: ~lcvcode/charm-nagios:LP1939068
Diff against target: 153 lines (+51/-16)
1 file modified
hooks/monitors_relation_changed.py (+51/-16)
Reviewer Review Type Date Requested Status
Paul Goins Needs Fixing
🤖 prod-jenkaas-bootstack (community) continuous-integration Approve
Zachary Zehring (community) Needs Fixing
Drew Freiberger (community) Needs Fixing
James Troup Pending
Review via email: mp+406873@code.launchpad.net

Commit message

Force uniqueness in cases of host_name duplication

Description of the change

This change ensures that overlapping host names in nagios are made unique so that parent-child relationships are never ambiguous.

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Unable to determine commit message from repository - please click "Set commit message" and enter the commit message manually.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

FAILED: Continuous integration, rev:b4f0578daeeb92bdbcd0d51cc21262822a023cfc

No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want jenkins to rebuild you need to trigger it yourself):
https://code.launchpad.net/~lcvcode/charm-nagios/+git/charm-nagios/+merge/406873/+edit-commit-message

https://jenkins.canonical.com/bootstack/job/lp-charm-nagios-ci/6/
Executed test runs:
    SUCCESS: https://jenkins.canonical.com/bootstack/job/lp-charm-test-functest/42/
    None: https://jenkins.canonical.com/bootstack/job/lp-update-mp/407/

Click here to trigger a rebuild:
https://jenkins.canonical.com/bootstack/job/lp-charm-nagios-ci/6//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Please see comments inline.

review: Needs Fixing
Revision history for this message
Zachary Zehring (zzehring) wrote :

Added some comments.

review: Needs Fixing
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
Paul Goins (vultaire) wrote :

I'm +1 for the change as it fills a gap in the case that nagios_host_context is specified with the same value on multiple models.

However, I believe it is superior to use distinct nagios_host_contexts for each model, and to include something in the host context to identify the model from which the record came.

For example, if you have keystone deployed to both an OpenStack and K8s model; and if hostnames are autogenerated as e.g. keystone-0, keystone-1, etc.; then including the model name in the host context would generate hostnames like e.g. <prefix>-openstack-keystone-0, <prefix>-kubernetes-keystone-0. That'd give more semantic meaning to the hostnames and would have less problems re: event history in cases things get added/removed.

review: Approve
Revision history for this message
Paul Goins (vultaire) wrote :

During a 1:1, I remembered: I had concerns over how the status was being set here, and whether it might be being clobbered elsewhere in the charm. I very much support this MR as a stopgap to avoid having services clobber the wrong host in Nagios, especially if we have a message indicating that duplicate hostnames have been found, as that may guide toward reconfiguring the nagios_host_context to something more appropriate to avoid the collisions.

Temporarily marking as abstain; intending to check this tomorrow.

review: Abstain
Revision history for this message
Paul Goins (vultaire) wrote :

status_set is only really called in 2 places - one on charm upgrade, the other in the main() used by basically all other hooks. With that in mind, this is probably good enough in my opinion. I'm +1.

review: Approve
Revision history for this message
Paul Goins (vultaire) wrote :

Just FYI: added a warning log message as part of a new MR which builds upon this one as a prerequisite: https://code.launchpad.net/~vultaire/charm-nagios/+git/charm-nagios/+merge/407277

To be clear: this MR still needs one more +1 before merge.

Revision history for this message
Paul Goins (vultaire) wrote :

Sorry for the thrash here.

I've tried to deploy this on serverstack, with multiple related models and nrpe related to apps on top-level nodes and to LXDs, and I'm seeing these issues:

* For a hostname collision at the parent level, I do see that I get records for both hosts; one has a suffix appended. However, the host group names are not consistent. For example, host juju-host-0 is associated with host group juju-host, while host juju-host-0-[1] is associated with host group juju-host-0. This is potentially confusing, especially since the latter host group matches the former host which isn't even part of that host group.

* If I have an app of the same name installed onto containers in two different models, they collide and I only see one in nagios.

This is with the following deployment: https://pastebin.ubuntu.com/p/26hb2ptYHt/
(Summary: nagios on one model, plus 2 other models, each with cs:ubuntu deployed as host and LXD, with nrpe related to each.)

This was done on serverstack (i.e. OpenStack), not a MAAS or LXD based controller.

I'm not sure what exactly in the code needs fixing yet, but I'm -1'ing for now.

review: Needs Fixing

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/hooks/monitors_relation_changed.py b/hooks/monitors_relation_changed.py
2index 9297cdf..441fe4e 100755
3--- a/hooks/monitors_relation_changed.py
4+++ b/hooks/monitors_relation_changed.py
5@@ -28,6 +28,7 @@ from charmhelpers.core.hookenv import (
6 related_units,
7 relation_get,
8 relation_ids,
9+ status_set,
10 )
11
12 from common import (
13@@ -41,8 +42,10 @@ from common import (
14
15 import yaml
16
17-REQUIRED_REL_DATA_KEYS = ["target-address", "monitors", "target-id"]
18+MACHINE_ID_KEY = "machine_id"
19 MODEL_ID_KEY = "model_id"
20+TARGET_ID_KEY = "target-id"
21+REQUIRED_REL_DATA_KEYS = ["target-address", "monitors", TARGET_ID_KEY]
22
23
24 def _prepare_relation_data(unit, rid):
25@@ -58,7 +61,7 @@ def _prepare_relation_data(unit, rid):
26
27 if rid.split(":")[0] == "nagios":
28 # Fake it for the more generic 'nagios' relation
29- relation_data["target-id"] = unit.replace("/", "-")
30+ relation_data[TARGET_ID_KEY] = unit.replace("/", "-")
31 relation_data["monitors"] = {"monitors": {"remote": {}}}
32
33 if not relation_data.get("target-address"):
34@@ -101,7 +104,7 @@ def main(argv): # noqa: C901
35 #
36
37 if len(argv) > 1:
38- relation_settings = {"monitors": open(argv[1]).read(), "target-id": argv[2]}
39+ relation_settings = {"monitors": open(argv[1]).read(), TARGET_ID_KEY: argv[2]}
40
41 if len(argv) > 3:
42 relation_settings["target-address"] = argv[3]
43@@ -114,13 +117,13 @@ def main(argv): # noqa: C901
44
45 for relid, units in all_relations.iteritems():
46 for unit, relation_settings in units.items():
47- if "target-id" in relation_settings:
48- targets_with_addresses.add(relation_settings["target-id"])
49+ if TARGET_ID_KEY in relation_settings:
50+ targets_with_addresses.add(relation_settings.get(TARGET_ID_KEY))
51 new_all_relations = {}
52
53 for relid, units in all_relations.iteritems():
54 for unit, relation_settings in units.items():
55- if relation_settings["target-id"] in targets_with_addresses:
56+ if relation_settings.get(TARGET_ID_KEY) in targets_with_addresses:
57 if relid not in new_all_relations:
58 new_all_relations[relid] = {}
59 new_all_relations[relid][unit] = relation_settings
60@@ -129,23 +132,55 @@ def main(argv): # noqa: C901
61
62 initialize_inprogress_config()
63
64+ uniq_hostnames = set()
65+ duplicate_hostnames = defaultdict(int)
66+
67+ def record_hostname(hostname):
68+ if hostname not in uniq_hostnames:
69+ uniq_hostnames.add(hostname)
70+ else:
71+ duplicate_hostnames[hostname] += 1
72+
73 # make a dict of machine ids to target-id hostnames
74 all_hosts = {}
75 for relid, units in all_relations.items():
76 for unit, relation_settings in units.iteritems():
77- machine_id = relation_settings.get("machine_id", None)
78- model_id = relation_settings.get(MODEL_ID_KEY, None)
79+ machine_id = relation_settings.get(MACHINE_ID_KEY)
80+ model_id = relation_settings.get(MODEL_ID_KEY)
81+ target_id = relation_settings.get(TARGET_ID_KEY)
82
83- if not machine_id:
84+ if not machine_id or not target_id:
85 continue
86
87+ # Check for duplicate hostnames and amend them if needed
88+ record_hostname(target_id)
89+ if target_id in duplicate_hostnames:
90+ # Duplicate hostname has been found
91+ # Append "-[<number of duplicates>]" hostname
92+ # Example:
93+ # 1st hostname of "juju-ubuntu-0" is unchanged
94+ # 2nd hostname of "juju-ubuntu-0" is changed to "juju-ubuntu-0-[1]"
95+ # 3rd hostname of "juju-ubuntu-0" is changed to "juju-ubuntu-0-[2]"
96+ target_id += "-[{}]".format(duplicate_hostnames[target_id])
97+ relation_settings[TARGET_ID_KEY] = target_id
98+
99 # Backwards compatible hostname from machine id
100 if not model_id:
101- all_hosts[machine_id] = relation_settings["target-id"]
102+ all_hosts[machine_id] = target_id
103 # New hostname from machine id using model id
104 else:
105 all_hosts.setdefault(model_id, {})
106- all_hosts[model_id][machine_id] = relation_settings["target-id"]
107+ all_hosts[model_id][machine_id] = target_id
108+
109+ if duplicate_hostnames:
110+ status_set(
111+ "active",
112+ "Duplicate host names detected: {}".format(
113+ ", ".join(duplicate_hostnames.keys())
114+ ),
115+ )
116+ else:
117+ status_set("active", "ready")
118
119 for relid, units in all_relations.items():
120 apply_relation_config(relid, units, all_hosts)
121@@ -158,11 +193,11 @@ def main(argv): # noqa: C901
122 def apply_relation_config(relid, units, all_hosts): # noqa: C901
123 for unit, relation_settings in units.iteritems():
124 monitors = relation_settings["monitors"]
125- target_id = relation_settings["target-id"]
126- machine_id = relation_settings.get("machine_id", None)
127+ target_id = relation_settings[TARGET_ID_KEY]
128+ machine_id = relation_settings.get(MACHINE_ID_KEY)
129 parent_host = None
130
131- model_id = relation_settings.get(MODEL_ID_KEY, None)
132+ model_id = relation_settings.get(MODEL_ID_KEY)
133
134 if machine_id:
135
136@@ -173,7 +208,7 @@ def apply_relation_config(relid, units, all_hosts): # noqa: C901
137 # Get hostname using model id
138 if model_id:
139 model_hosts = all_hosts.get(model_id, {})
140- parent_host = model_hosts.get(parent_machine, None)
141+ parent_host = model_hosts.get(parent_machine)
142
143 # Get hostname without model id
144 # this conserves backwards compatibility with older
145@@ -184,7 +219,7 @@ def apply_relation_config(relid, units, all_hosts): # noqa: C901
146 # If not set, we don't mess with it, as multiple services may feed
147 # monitors in for a particular address. Generally a primary will set
148 # this to its own private-address
149- target_address = relation_settings.get("target-address", None)
150+ target_address = relation_settings.get("target-address")
151
152 if type(monitors) != dict:
153 monitors = yaml.safe_load(monitors)

Subscribers

People subscribed via source and target branches

to all changes: