Merge ~lcvcode/charm-nagios:LP1939068-duplicates into charm-nagios:master

Proposed by Connor Chamberlain
Status: Merged
Merged at revision: 67fe2db9f283fb4fdf584578597e8520d566b92d
Proposed branch: ~lcvcode/charm-nagios:LP1939068-duplicates
Merge into: charm-nagios:master
Prerequisite: ~lcvcode/charm-nagios:LP1939068
Diff against target: 153 lines (+51/-16)
1 file modified
hooks/monitors_relation_changed.py (+51/-16)
Reviewer Review Type Date Requested Status
Paul Goins Needs Fixing
🤖 prod-jenkaas-bootstack (community) continuous-integration Approve
Zachary Zehring (community) Needs Fixing
Drew Freiberger (community) Needs Fixing
James Troup Pending
Review via email: mp+406873@code.launchpad.net

Commit message

Force uniqueness in cases of host_name duplication

Description of the change

This change ensures that overlapping host names in nagios are made unique so that parent-child relationships are never ambiguous.

To post a comment you must log in.
Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Unable to determine commit message from repository - please click "Set commit message" and enter the commit message manually.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

FAILED: Continuous integration, rev:b4f0578daeeb92bdbcd0d51cc21262822a023cfc

No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want jenkins to rebuild you need to trigger it yourself):
https://code.launchpad.net/~lcvcode/charm-nagios/+git/charm-nagios/+merge/406873/+edit-commit-message

https://jenkins.canonical.com/bootstack/job/lp-charm-nagios-ci/6/
Executed test runs:
    SUCCESS: https://jenkins.canonical.com/bootstack/job/lp-charm-test-functest/42/
    None: https://jenkins.canonical.com/bootstack/job/lp-update-mp/407/

Click here to trigger a rebuild:
https://jenkins.canonical.com/bootstack/job/lp-charm-nagios-ci/6//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Please see comments inline.

review: Needs Fixing
Revision history for this message
Zachary Zehring (zzehring) wrote :

Added some comments.

review: Needs Fixing
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
Paul Goins (vultaire) wrote :

I'm +1 for the change as it fills a gap in the case that nagios_host_context is specified with the same value on multiple models.

However, I believe it is superior to use distinct nagios_host_contexts for each model, and to include something in the host context to identify the model from which the record came.

For example, if you have keystone deployed to both an OpenStack and K8s model; and if hostnames are autogenerated as e.g. keystone-0, keystone-1, etc.; then including the model name in the host context would generate hostnames like e.g. <prefix>-openstack-keystone-0, <prefix>-kubernetes-keystone-0. That'd give more semantic meaning to the hostnames and would have less problems re: event history in cases things get added/removed.

review: Approve
Revision history for this message
Paul Goins (vultaire) wrote :

During a 1:1, I remembered: I had concerns over how the status was being set here, and whether it might be being clobbered elsewhere in the charm. I very much support this MR as a stopgap to avoid having services clobber the wrong host in Nagios, especially if we have a message indicating that duplicate hostnames have been found, as that may guide toward reconfiguring the nagios_host_context to something more appropriate to avoid the collisions.

Temporarily marking as abstain; intending to check this tomorrow.

review: Abstain
Revision history for this message
Paul Goins (vultaire) wrote :

status_set is only really called in 2 places - one on charm upgrade, the other in the main() used by basically all other hooks. With that in mind, this is probably good enough in my opinion. I'm +1.

review: Approve
Revision history for this message
Paul Goins (vultaire) wrote :

Just FYI: added a warning log message as part of a new MR which builds upon this one as a prerequisite: https://code.launchpad.net/~vultaire/charm-nagios/+git/charm-nagios/+merge/407277

To be clear: this MR still needs one more +1 before merge.

Revision history for this message
Paul Goins (vultaire) wrote :

Sorry for the thrash here.

I've tried to deploy this on serverstack, with multiple related models and nrpe related to apps on top-level nodes and to LXDs, and I'm seeing these issues:

* For a hostname collision at the parent level, I do see that I get records for both hosts; one has a suffix appended. However, the host group names are not consistent. For example, host juju-host-0 is associated with host group juju-host, while host juju-host-0-[1] is associated with host group juju-host-0. This is potentially confusing, especially since the latter host group matches the former host which isn't even part of that host group.

* If I have an app of the same name installed onto containers in two different models, they collide and I only see one in nagios.

This is with the following deployment: https://pastebin.ubuntu.com/p/26hb2ptYHt/
(Summary: nagios on one model, plus 2 other models, each with cs:ubuntu deployed as host and LXD, with nrpe related to each.)

This was done on serverstack (i.e. OpenStack), not a MAAS or LXD based controller.

I'm not sure what exactly in the code needs fixing yet, but I'm -1'ing for now.

review: Needs Fixing

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/hooks/monitors_relation_changed.py b/hooks/monitors_relation_changed.py
index 9297cdf..441fe4e 100755
--- a/hooks/monitors_relation_changed.py
+++ b/hooks/monitors_relation_changed.py
@@ -28,6 +28,7 @@ from charmhelpers.core.hookenv import (
28 related_units,28 related_units,
29 relation_get,29 relation_get,
30 relation_ids,30 relation_ids,
31 status_set,
31)32)
3233
33from common import (34from common import (
@@ -41,8 +42,10 @@ from common import (
4142
42import yaml43import yaml
4344
44REQUIRED_REL_DATA_KEYS = ["target-address", "monitors", "target-id"]45MACHINE_ID_KEY = "machine_id"
45MODEL_ID_KEY = "model_id"46MODEL_ID_KEY = "model_id"
47TARGET_ID_KEY = "target-id"
48REQUIRED_REL_DATA_KEYS = ["target-address", "monitors", TARGET_ID_KEY]
4649
4750
48def _prepare_relation_data(unit, rid):51def _prepare_relation_data(unit, rid):
@@ -58,7 +61,7 @@ def _prepare_relation_data(unit, rid):
5861
59 if rid.split(":")[0] == "nagios":62 if rid.split(":")[0] == "nagios":
60 # Fake it for the more generic 'nagios' relation63 # Fake it for the more generic 'nagios' relation
61 relation_data["target-id"] = unit.replace("/", "-")64 relation_data[TARGET_ID_KEY] = unit.replace("/", "-")
62 relation_data["monitors"] = {"monitors": {"remote": {}}}65 relation_data["monitors"] = {"monitors": {"remote": {}}}
6366
64 if not relation_data.get("target-address"):67 if not relation_data.get("target-address"):
@@ -101,7 +104,7 @@ def main(argv): # noqa: C901
101 #104 #
102105
103 if len(argv) > 1:106 if len(argv) > 1:
104 relation_settings = {"monitors": open(argv[1]).read(), "target-id": argv[2]}107 relation_settings = {"monitors": open(argv[1]).read(), TARGET_ID_KEY: argv[2]}
105108
106 if len(argv) > 3:109 if len(argv) > 3:
107 relation_settings["target-address"] = argv[3]110 relation_settings["target-address"] = argv[3]
@@ -114,13 +117,13 @@ def main(argv): # noqa: C901
114117
115 for relid, units in all_relations.iteritems():118 for relid, units in all_relations.iteritems():
116 for unit, relation_settings in units.items():119 for unit, relation_settings in units.items():
117 if "target-id" in relation_settings:120 if TARGET_ID_KEY in relation_settings:
118 targets_with_addresses.add(relation_settings["target-id"])121 targets_with_addresses.add(relation_settings.get(TARGET_ID_KEY))
119 new_all_relations = {}122 new_all_relations = {}
120123
121 for relid, units in all_relations.iteritems():124 for relid, units in all_relations.iteritems():
122 for unit, relation_settings in units.items():125 for unit, relation_settings in units.items():
123 if relation_settings["target-id"] in targets_with_addresses:126 if relation_settings.get(TARGET_ID_KEY) in targets_with_addresses:
124 if relid not in new_all_relations:127 if relid not in new_all_relations:
125 new_all_relations[relid] = {}128 new_all_relations[relid] = {}
126 new_all_relations[relid][unit] = relation_settings129 new_all_relations[relid][unit] = relation_settings
@@ -129,23 +132,55 @@ def main(argv): # noqa: C901
129132
130 initialize_inprogress_config()133 initialize_inprogress_config()
131134
135 uniq_hostnames = set()
136 duplicate_hostnames = defaultdict(int)
137
138 def record_hostname(hostname):
139 if hostname not in uniq_hostnames:
140 uniq_hostnames.add(hostname)
141 else:
142 duplicate_hostnames[hostname] += 1
143
132 # make a dict of machine ids to target-id hostnames144 # make a dict of machine ids to target-id hostnames
133 all_hosts = {}145 all_hosts = {}
134 for relid, units in all_relations.items():146 for relid, units in all_relations.items():
135 for unit, relation_settings in units.iteritems():147 for unit, relation_settings in units.iteritems():
136 machine_id = relation_settings.get("machine_id", None)148 machine_id = relation_settings.get(MACHINE_ID_KEY)
137 model_id = relation_settings.get(MODEL_ID_KEY, None)149 model_id = relation_settings.get(MODEL_ID_KEY)
150 target_id = relation_settings.get(TARGET_ID_KEY)
138151
139 if not machine_id:152 if not machine_id or not target_id:
140 continue153 continue
141154
155 # Check for duplicate hostnames and amend them if needed
156 record_hostname(target_id)
157 if target_id in duplicate_hostnames:
158 # Duplicate hostname has been found
159 # Append "-[<number of duplicates>]" hostname
160 # Example:
161 # 1st hostname of "juju-ubuntu-0" is unchanged
162 # 2nd hostname of "juju-ubuntu-0" is changed to "juju-ubuntu-0-[1]"
163 # 3rd hostname of "juju-ubuntu-0" is changed to "juju-ubuntu-0-[2]"
164 target_id += "-[{}]".format(duplicate_hostnames[target_id])
165 relation_settings[TARGET_ID_KEY] = target_id
166
142 # Backwards compatible hostname from machine id167 # Backwards compatible hostname from machine id
143 if not model_id:168 if not model_id:
144 all_hosts[machine_id] = relation_settings["target-id"]169 all_hosts[machine_id] = target_id
145 # New hostname from machine id using model id170 # New hostname from machine id using model id
146 else:171 else:
147 all_hosts.setdefault(model_id, {})172 all_hosts.setdefault(model_id, {})
148 all_hosts[model_id][machine_id] = relation_settings["target-id"]173 all_hosts[model_id][machine_id] = target_id
174
175 if duplicate_hostnames:
176 status_set(
177 "active",
178 "Duplicate host names detected: {}".format(
179 ", ".join(duplicate_hostnames.keys())
180 ),
181 )
182 else:
183 status_set("active", "ready")
149184
150 for relid, units in all_relations.items():185 for relid, units in all_relations.items():
151 apply_relation_config(relid, units, all_hosts)186 apply_relation_config(relid, units, all_hosts)
@@ -158,11 +193,11 @@ def main(argv): # noqa: C901
158def apply_relation_config(relid, units, all_hosts): # noqa: C901193def apply_relation_config(relid, units, all_hosts): # noqa: C901
159 for unit, relation_settings in units.iteritems():194 for unit, relation_settings in units.iteritems():
160 monitors = relation_settings["monitors"]195 monitors = relation_settings["monitors"]
161 target_id = relation_settings["target-id"]196 target_id = relation_settings[TARGET_ID_KEY]
162 machine_id = relation_settings.get("machine_id", None)197 machine_id = relation_settings.get(MACHINE_ID_KEY)
163 parent_host = None198 parent_host = None
164199
165 model_id = relation_settings.get(MODEL_ID_KEY, None)200 model_id = relation_settings.get(MODEL_ID_KEY)
166201
167 if machine_id:202 if machine_id:
168203
@@ -173,7 +208,7 @@ def apply_relation_config(relid, units, all_hosts): # noqa: C901
173 # Get hostname using model id208 # Get hostname using model id
174 if model_id:209 if model_id:
175 model_hosts = all_hosts.get(model_id, {})210 model_hosts = all_hosts.get(model_id, {})
176 parent_host = model_hosts.get(parent_machine, None)211 parent_host = model_hosts.get(parent_machine)
177212
178 # Get hostname without model id213 # Get hostname without model id
179 # this conserves backwards compatibility with older214 # this conserves backwards compatibility with older
@@ -184,7 +219,7 @@ def apply_relation_config(relid, units, all_hosts): # noqa: C901
184 # If not set, we don't mess with it, as multiple services may feed219 # If not set, we don't mess with it, as multiple services may feed
185 # monitors in for a particular address. Generally a primary will set220 # monitors in for a particular address. Generally a primary will set
186 # this to its own private-address221 # this to its own private-address
187 target_address = relation_settings.get("target-address", None)222 target_address = relation_settings.get("target-address")
188223
189 if type(monitors) != dict:224 if type(monitors) != dict:
190 monitors = yaml.safe_load(monitors)225 monitors = yaml.safe_load(monitors)

Subscribers

People subscribed via source and target branches

to all changes: