Merge ~vtqanh/cloud-init:ProvisioningTelemetry into cloud-init:master
- Git
- lp:~vtqanh/cloud-init
- ProvisioningTelemetry
- Merge into master
Status: | Merged |
---|---|
Approved by: | Chad Smith |
Approved revision: | 5193bea529862589394538d91ff4e0510dce3430 |
Merge reported by: | Server Team CI bot |
Merged at revision: | not available |
Proposed branch: | ~vtqanh/cloud-init:ProvisioningTelemetry |
Merge into: | cloud-init:master |
Diff against target: |
674 lines (+353/-44) 4 files modified
cloudinit/sources/DataSourceAzure.py (+126/-31) cloudinit/sources/helpers/azure.py (+152/-8) tests/unittests/test_datasource/test_azure.py (+10/-5) tests/unittests/test_reporting_hyperv.py (+65/-0) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Chad Smith | Approve | ||
Server Team CI bot | continuous-integration | Approve | |
Ryan Harper | Approve | ||
Review via email: mp+369785@code.launchpad.net |
Commit message
Azure: Record boot timestamps, system information, and diagnostic events
Collect and record the following information through KVP:
+ timestamps related to kernel initialization and systemd activation
of cloud-init services
+ system information including cloud-init version, kernel version,
distro version, and python version
+ diagnostic events for the most common provisioning error issues
such as empty dhcp lease, corrupted ovf-env.xml, etc.
+ increasing the log frequency of polling IMDS during reprovision.
Description of the change
Server Team CI bot (server-team-bot) wrote : | # |
Ryan Harper (raharper) wrote : | # |
I've some inline questions and comments. Thanks!
Anh Vo (MSFT) (vtqanh) wrote : | # |
Thanks Ryan, I've resolved all the comments (including updating the commit message)
Anh Vo (MSFT) (vtqanh) wrote : | # |
Updated the author
Ryan Harper (raharper) wrote : | # |
Thanks. I've added some inline comments and suggestions.
Anh Vo (MSFT) (vtqanh) wrote : | # |
Thanks for the review and comments Ryan. I've pushed an update that addressed all those issue you pointed out.
Ryan Harper (raharper) wrote : | # |
Thank you for the refactor. I've a few comments inline for the code; and I'd like to change how the unittests are run so they don't depend on the host environment.
Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:e1844e2f1e6
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
- 2db1300... by Anh Vo (MSFT)
-
Have unit test not depending on host system having systemd
Server Team CI bot (server-team-bot) wrote : | # |
FAILED: Continuous integration, rev:2db1300dd08
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
FAILED: Ubuntu LTS: Integration
Click here to trigger a rebuild:
https:/
Server Team CI bot (server-team-bot) wrote : | # |
FAILED: Continuous integration, rev:2db1300dd08
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
FAILED: Ubuntu LTS: Integration
Click here to trigger a rebuild:
https:/
Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:2db1300dd08
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
Ryan Harper (raharper) wrote : | # |
Thanks for updating the unittests. This looks good to me.
Chad Smith (chad.smith) wrote : | # |
Looks good!
A couple of inline questions:
- Can we emit the same diagnostic messages with their complement logging.
- Can we add a message about the how much the exception handling backoff algorithm is increased?
Validated behavior of this branch on an Azure Eoan instance
ubuntu@my-e1:~$ grep 'Did not find Azure data source' /var/lib/
Binary file /var/lib/
/var/log/
ubuntu@my-e1:~$ grep Trace /var/log/
ubuntu@my-e1:~$ fgrep -ir Free /var/log/
ubuntu@my-e1:~$ fgrep -r diagnostic /var/log/
2019-08-14 16:37:51,421 - handlers.py[DEBUG]: diagnostic: diagnostic message: Did not find Azure data source in /var/lib/
2019-08-14 16:37:54,792 - handlers.py[DEBUG]: diagnostic: diagnostic message: No Azure endpoint from dhcp options
2019-08-14 16:37:54,794 - handlers.py[DEBUG]: diagnostic: diagnostic message: Azure endpoint found at 168.63.129.16
2019-08-14 16:37:54,801 - handlers.py[DEBUG]: diagnostic: diagnostic message: container_id 88e472d3-
- 5193bea... by Anh Vo (MSFT)
-
consistent messages for diagnostic and logging. Also add log to indicate logging being backed off
Anh Vo (MSFT) (vtqanh) wrote : | # |
Thanks Chad for reviewing. I addressed the two points you mentioned. The first one I couldn't reuse the exact same message because we don't want the full stack trace reported to KVP and just the top level exception), but I did match the message to make triage easier.
Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:5193bea5298
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
Preview Diff
1 | diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py |
2 | index e6ed2f3..4984fa8 100755 |
3 | --- a/cloudinit/sources/DataSourceAzure.py |
4 | +++ b/cloudinit/sources/DataSourceAzure.py |
5 | @@ -26,9 +26,14 @@ from cloudinit.url_helper import UrlError, readurl, retry_on_url_exc |
6 | from cloudinit import util |
7 | from cloudinit.reporting import events |
8 | |
9 | -from cloudinit.sources.helpers.azure import (azure_ds_reporter, |
10 | - azure_ds_telemetry_reporter, |
11 | - get_metadata_from_fabric) |
12 | +from cloudinit.sources.helpers.azure import ( |
13 | + azure_ds_reporter, |
14 | + azure_ds_telemetry_reporter, |
15 | + get_metadata_from_fabric, |
16 | + get_boot_telemetry, |
17 | + get_system_info, |
18 | + report_diagnostic_event, |
19 | + EphemeralDHCPv4WithReporting) |
20 | |
21 | LOG = logging.getLogger(__name__) |
22 | |
23 | @@ -354,7 +359,7 @@ class DataSourceAzure(sources.DataSource): |
24 | bname = str(pk['fingerprint'] + ".crt") |
25 | fp_files += [os.path.join(ddir, bname)] |
26 | LOG.debug("ssh authentication: " |
27 | - "using fingerprint from fabirc") |
28 | + "using fingerprint from fabric") |
29 | |
30 | with events.ReportEventStack( |
31 | name="waiting-for-ssh-public-key", |
32 | @@ -419,12 +424,17 @@ class DataSourceAzure(sources.DataSource): |
33 | ret = load_azure_ds_dir(cdev) |
34 | |
35 | except NonAzureDataSource: |
36 | + report_diagnostic_event( |
37 | + "Did not find Azure data source in %s" % cdev) |
38 | continue |
39 | except BrokenAzureDataSource as exc: |
40 | msg = 'BrokenAzureDataSource: %s' % exc |
41 | + report_diagnostic_event(msg) |
42 | raise sources.InvalidMetaDataException(msg) |
43 | except util.MountFailedError: |
44 | - LOG.warning("%s was not mountable", cdev) |
45 | + msg = '%s was not mountable' % cdev |
46 | + report_diagnostic_event(msg) |
47 | + LOG.warning(msg) |
48 | continue |
49 | |
50 | perform_reprovision = reprovision or self._should_reprovision(ret) |
51 | @@ -432,6 +442,7 @@ class DataSourceAzure(sources.DataSource): |
52 | if util.is_FreeBSD(): |
53 | msg = "Free BSD is not supported for PPS VMs" |
54 | LOG.error(msg) |
55 | + report_diagnostic_event(msg) |
56 | raise sources.InvalidMetaDataException(msg) |
57 | ret = self._reprovision() |
58 | imds_md = get_metadata_from_imds( |
59 | @@ -450,7 +461,9 @@ class DataSourceAzure(sources.DataSource): |
60 | break |
61 | |
62 | if not found: |
63 | - raise sources.InvalidMetaDataException('No Azure metadata found') |
64 | + msg = 'No Azure metadata found' |
65 | + report_diagnostic_event(msg) |
66 | + raise sources.InvalidMetaDataException(msg) |
67 | |
68 | if found == ddir: |
69 | LOG.debug("using files cached in %s", ddir) |
70 | @@ -469,9 +482,14 @@ class DataSourceAzure(sources.DataSource): |
71 | self._report_ready(lease=self._ephemeral_dhcp_ctx.lease) |
72 | self._ephemeral_dhcp_ctx.clean_network() # Teardown ephemeral |
73 | else: |
74 | - with EphemeralDHCPv4() as lease: |
75 | - self._report_ready(lease=lease) |
76 | - |
77 | + try: |
78 | + with EphemeralDHCPv4WithReporting( |
79 | + azure_ds_reporter) as lease: |
80 | + self._report_ready(lease=lease) |
81 | + except Exception as e: |
82 | + report_diagnostic_event( |
83 | + "exception while reporting ready: %s" % e) |
84 | + raise |
85 | return crawled_data |
86 | |
87 | def _is_platform_viable(self): |
88 | @@ -493,6 +511,16 @@ class DataSourceAzure(sources.DataSource): |
89 | if not self._is_platform_viable(): |
90 | return False |
91 | try: |
92 | + get_boot_telemetry() |
93 | + except Exception as e: |
94 | + LOG.warning("Failed to get boot telemetry: %s", e) |
95 | + |
96 | + try: |
97 | + get_system_info() |
98 | + except Exception as e: |
99 | + LOG.warning("Failed to get system information: %s", e) |
100 | + |
101 | + try: |
102 | crawled_data = util.log_time( |
103 | logfunc=LOG.debug, msg='Crawl of metadata service', |
104 | func=self.crawl_metadata) |
105 | @@ -551,27 +579,55 @@ class DataSourceAzure(sources.DataSource): |
106 | headers = {"Metadata": "true"} |
107 | nl_sock = None |
108 | report_ready = bool(not os.path.isfile(REPORTED_READY_MARKER_FILE)) |
109 | + self.imds_logging_threshold = 1 |
110 | + self.imds_poll_counter = 1 |
111 | + dhcp_attempts = 0 |
112 | + vnet_switched = False |
113 | + return_val = None |
114 | |
115 | def exc_cb(msg, exception): |
116 | if isinstance(exception, UrlError) and exception.code == 404: |
117 | + if self.imds_poll_counter == self.imds_logging_threshold: |
118 | + # Reducing the logging frequency as we are polling IMDS |
119 | + self.imds_logging_threshold *= 2 |
120 | + LOG.debug("Call to IMDS with arguments %s failed " |
121 | + "with status code %s after %s retries", |
122 | + msg, exception.code, self.imds_poll_counter) |
123 | + LOG.debug("Backing off logging threshold for the same " |
124 | + "exception to %d", self.imds_logging_threshold) |
125 | + self.imds_poll_counter += 1 |
126 | return True |
127 | + |
128 | # If we get an exception while trying to call IMDS, we |
129 | # call DHCP and setup the ephemeral network to acquire the new IP. |
130 | + LOG.debug("Call to IMDS with arguments %s failed with " |
131 | + "status code %s", msg, exception.code) |
132 | + report_diagnostic_event("polling IMDS failed with exception %s" |
133 | + % exception.code) |
134 | return False |
135 | |
136 | LOG.debug("Wait for vnetswitch to happen") |
137 | while True: |
138 | try: |
139 | - # Save our EphemeralDHCPv4 context so we avoid repeated dhcp |
140 | - self._ephemeral_dhcp_ctx = EphemeralDHCPv4() |
141 | - lease = self._ephemeral_dhcp_ctx.obtain_lease() |
142 | + # Save our EphemeralDHCPv4 context to avoid repeated dhcp |
143 | + with events.ReportEventStack( |
144 | + name="obtain-dhcp-lease", |
145 | + description="obtain dhcp lease", |
146 | + parent=azure_ds_reporter): |
147 | + self._ephemeral_dhcp_ctx = EphemeralDHCPv4() |
148 | + lease = self._ephemeral_dhcp_ctx.obtain_lease() |
149 | + |
150 | + if vnet_switched: |
151 | + dhcp_attempts += 1 |
152 | if report_ready: |
153 | try: |
154 | nl_sock = netlink.create_bound_netlink_socket() |
155 | except netlink.NetlinkCreateSocketError as e: |
156 | + report_diagnostic_event(e) |
157 | LOG.warning(e) |
158 | self._ephemeral_dhcp_ctx.clean_network() |
159 | - return |
160 | + break |
161 | + |
162 | path = REPORTED_READY_MARKER_FILE |
163 | LOG.info( |
164 | "Creating a marker file to report ready: %s", path) |
165 | @@ -579,17 +635,33 @@ class DataSourceAzure(sources.DataSource): |
166 | pid=os.getpid(), time=time())) |
167 | self._report_ready(lease=lease) |
168 | report_ready = False |
169 | - try: |
170 | - netlink.wait_for_media_disconnect_connect( |
171 | - nl_sock, lease['interface']) |
172 | - except AssertionError as error: |
173 | - LOG.error(error) |
174 | - return |
175 | + |
176 | + with events.ReportEventStack( |
177 | + name="wait-for-media-disconnect-connect", |
178 | + description="wait for vnet switch", |
179 | + parent=azure_ds_reporter): |
180 | + try: |
181 | + netlink.wait_for_media_disconnect_connect( |
182 | + nl_sock, lease['interface']) |
183 | + except AssertionError as error: |
184 | + report_diagnostic_event(error) |
185 | + LOG.error(error) |
186 | + break |
187 | + |
188 | + vnet_switched = True |
189 | self._ephemeral_dhcp_ctx.clean_network() |
190 | else: |
191 | - return readurl(url, timeout=IMDS_TIMEOUT_IN_SECONDS, |
192 | - headers=headers, exception_cb=exc_cb, |
193 | - infinite=True, log_req_resp=False).contents |
194 | + with events.ReportEventStack( |
195 | + name="get-reprovision-data-from-imds", |
196 | + description="get reprovision data from imds", |
197 | + parent=azure_ds_reporter): |
198 | + return_val = readurl(url, |
199 | + timeout=IMDS_TIMEOUT_IN_SECONDS, |
200 | + headers=headers, |
201 | + exception_cb=exc_cb, |
202 | + infinite=True, |
203 | + log_req_resp=False).contents |
204 | + break |
205 | except UrlError: |
206 | # Teardown our EphemeralDHCPv4 context on failure as we retry |
207 | self._ephemeral_dhcp_ctx.clean_network() |
208 | @@ -598,6 +670,14 @@ class DataSourceAzure(sources.DataSource): |
209 | if nl_sock: |
210 | nl_sock.close() |
211 | |
212 | + if vnet_switched: |
213 | + report_diagnostic_event("attempted dhcp %d times after reuse" % |
214 | + dhcp_attempts) |
215 | + report_diagnostic_event("polled imds %d times after reuse" % |
216 | + self.imds_poll_counter) |
217 | + |
218 | + return return_val |
219 | + |
220 | @azure_ds_telemetry_reporter |
221 | def _report_ready(self, lease): |
222 | """Tells the fabric provisioning has completed """ |
223 | @@ -666,9 +746,12 @@ class DataSourceAzure(sources.DataSource): |
224 | self.ds_cfg['agent_command']) |
225 | try: |
226 | fabric_data = metadata_func() |
227 | - except Exception: |
228 | + except Exception as e: |
229 | + report_diagnostic_event( |
230 | + "Error communicating with Azure fabric; You may experience " |
231 | + "connectivity issues: %s" % e) |
232 | LOG.warning( |
233 | - "Error communicating with Azure fabric; You may experience." |
234 | + "Error communicating with Azure fabric; You may experience " |
235 | "connectivity issues.", exc_info=True) |
236 | return False |
237 | |
238 | @@ -1027,7 +1110,9 @@ def read_azure_ovf(contents): |
239 | try: |
240 | dom = minidom.parseString(contents) |
241 | except Exception as e: |
242 | - raise BrokenAzureDataSource("Invalid ovf-env.xml: %s" % e) |
243 | + error_str = "Invalid ovf-env.xml: %s" % e |
244 | + report_diagnostic_event(error_str) |
245 | + raise BrokenAzureDataSource(error_str) |
246 | |
247 | results = find_child(dom.documentElement, |
248 | lambda n: n.localName == "ProvisioningSection") |
249 | @@ -1299,8 +1384,13 @@ def get_metadata_from_imds(fallback_nic, retries): |
250 | if net.is_up(fallback_nic): |
251 | return util.log_time(**kwargs) |
252 | else: |
253 | - with EphemeralDHCPv4(fallback_nic): |
254 | - return util.log_time(**kwargs) |
255 | + try: |
256 | + with EphemeralDHCPv4WithReporting( |
257 | + azure_ds_reporter, fallback_nic): |
258 | + return util.log_time(**kwargs) |
259 | + except Exception as e: |
260 | + report_diagnostic_event("exception while getting metadata: %s" % e) |
261 | + raise |
262 | |
263 | |
264 | @azure_ds_telemetry_reporter |
265 | @@ -1313,11 +1403,14 @@ def _get_metadata_from_imds(retries): |
266 | url, timeout=IMDS_TIMEOUT_IN_SECONDS, headers=headers, |
267 | retries=retries, exception_cb=retry_on_url_exc) |
268 | except Exception as e: |
269 | - LOG.debug('Ignoring IMDS instance metadata: %s', e) |
270 | + msg = 'Ignoring IMDS instance metadata: %s' % e |
271 | + report_diagnostic_event(msg) |
272 | + LOG.debug(msg) |
273 | return {} |
274 | try: |
275 | return util.load_json(str(response)) |
276 | - except json.decoder.JSONDecodeError: |
277 | + except json.decoder.JSONDecodeError as e: |
278 | + report_diagnostic_event('non-json imds response' % e) |
279 | LOG.warning( |
280 | 'Ignoring non-json IMDS instance metadata: %s', str(response)) |
281 | return {} |
282 | @@ -1370,8 +1463,10 @@ def _is_platform_viable(seed_dir): |
283 | asset_tag = util.read_dmi_data('chassis-asset-tag') |
284 | if asset_tag == AZURE_CHASSIS_ASSET_TAG: |
285 | return True |
286 | - LOG.debug("Non-Azure DMI asset tag '%s' discovered.", asset_tag) |
287 | - evt.description = "Non-Azure DMI asset tag '%s' discovered.", asset_tag |
288 | + msg = "Non-Azure DMI asset tag '%s' discovered." % asset_tag |
289 | + LOG.debug(msg) |
290 | + evt.description = msg |
291 | + report_diagnostic_event(msg) |
292 | if os.path.exists(os.path.join(seed_dir, 'ovf-env.xml')): |
293 | return True |
294 | return False |
295 | diff --git a/cloudinit/sources/helpers/azure.py b/cloudinit/sources/helpers/azure.py |
296 | index 82c4c8c..f1fba17 100755 |
297 | --- a/cloudinit/sources/helpers/azure.py |
298 | +++ b/cloudinit/sources/helpers/azure.py |
299 | @@ -16,7 +16,11 @@ from xml.etree import ElementTree |
300 | |
301 | from cloudinit import url_helper |
302 | from cloudinit import util |
303 | +from cloudinit import version |
304 | +from cloudinit import distros |
305 | from cloudinit.reporting import events |
306 | +from cloudinit.net.dhcp import EphemeralDHCPv4 |
307 | +from datetime import datetime |
308 | |
309 | LOG = logging.getLogger(__name__) |
310 | |
311 | @@ -24,6 +28,10 @@ LOG = logging.getLogger(__name__) |
312 | # value is applied if the endpoint can't be found within a lease file |
313 | DEFAULT_WIRESERVER_ENDPOINT = "a8:3f:81:10" |
314 | |
315 | +BOOT_EVENT_TYPE = 'boot-telemetry' |
316 | +SYSTEMINFO_EVENT_TYPE = 'system-info' |
317 | +DIAGNOSTIC_EVENT_TYPE = 'diagnostic' |
318 | + |
319 | azure_ds_reporter = events.ReportEventStack( |
320 | name="azure-ds", |
321 | description="initialize reporter for azure ds", |
322 | @@ -40,6 +48,105 @@ def azure_ds_telemetry_reporter(func): |
323 | return impl |
324 | |
325 | |
326 | +@azure_ds_telemetry_reporter |
327 | +def get_boot_telemetry(): |
328 | + """Report timestamps related to kernel initialization and systemd |
329 | + activation of cloud-init""" |
330 | + if not distros.uses_systemd(): |
331 | + raise RuntimeError( |
332 | + "distro not using systemd, skipping boot telemetry") |
333 | + |
334 | + LOG.debug("Collecting boot telemetry") |
335 | + try: |
336 | + kernel_start = float(time.time()) - float(util.uptime()) |
337 | + except ValueError: |
338 | + raise RuntimeError("Failed to determine kernel start timestamp") |
339 | + |
340 | + try: |
341 | + out, _ = util.subp(['/bin/systemctl', |
342 | + 'show', '-p', |
343 | + 'UserspaceTimestampMonotonic'], |
344 | + capture=True) |
345 | + tsm = None |
346 | + if out and '=' in out: |
347 | + tsm = out.split("=")[1] |
348 | + |
349 | + if not tsm: |
350 | + raise RuntimeError("Failed to parse " |
351 | + "UserspaceTimestampMonotonic from systemd") |
352 | + |
353 | + user_start = kernel_start + (float(tsm) / 1000000) |
354 | + except util.ProcessExecutionError as e: |
355 | + raise RuntimeError("Failed to get UserspaceTimestampMonotonic: %s" |
356 | + % e) |
357 | + except ValueError as e: |
358 | + raise RuntimeError("Failed to parse " |
359 | + "UserspaceTimestampMonotonic from systemd: %s" |
360 | + % e) |
361 | + |
362 | + try: |
363 | + out, _ = util.subp(['/bin/systemctl', 'show', |
364 | + 'cloud-init-local', '-p', |
365 | + 'InactiveExitTimestampMonotonic'], |
366 | + capture=True) |
367 | + tsm = None |
368 | + if out and '=' in out: |
369 | + tsm = out.split("=")[1] |
370 | + if not tsm: |
371 | + raise RuntimeError("Failed to parse " |
372 | + "InactiveExitTimestampMonotonic from systemd") |
373 | + |
374 | + cloudinit_activation = kernel_start + (float(tsm) / 1000000) |
375 | + except util.ProcessExecutionError as e: |
376 | + raise RuntimeError("Failed to get InactiveExitTimestampMonotonic: %s" |
377 | + % e) |
378 | + except ValueError as e: |
379 | + raise RuntimeError("Failed to parse " |
380 | + "InactiveExitTimestampMonotonic from systemd: %s" |
381 | + % e) |
382 | + |
383 | + evt = events.ReportingEvent( |
384 | + BOOT_EVENT_TYPE, 'boot-telemetry', |
385 | + "kernel_start=%s user_start=%s cloudinit_activation=%s" % |
386 | + (datetime.utcfromtimestamp(kernel_start).isoformat() + 'Z', |
387 | + datetime.utcfromtimestamp(user_start).isoformat() + 'Z', |
388 | + datetime.utcfromtimestamp(cloudinit_activation).isoformat() + 'Z'), |
389 | + events.DEFAULT_EVENT_ORIGIN) |
390 | + events.report_event(evt) |
391 | + |
392 | + # return the event for unit testing purpose |
393 | + return evt |
394 | + |
395 | + |
396 | +@azure_ds_telemetry_reporter |
397 | +def get_system_info(): |
398 | + """Collect and report system information""" |
399 | + info = util.system_info() |
400 | + evt = events.ReportingEvent( |
401 | + SYSTEMINFO_EVENT_TYPE, 'system information', |
402 | + "cloudinit_version=%s, kernel_version=%s, variant=%s, " |
403 | + "distro_name=%s, distro_version=%s, flavor=%s, " |
404 | + "python_version=%s" % |
405 | + (version.version_string(), info['release'], info['variant'], |
406 | + info['dist'][0], info['dist'][1], info['dist'][2], |
407 | + info['python']), events.DEFAULT_EVENT_ORIGIN) |
408 | + events.report_event(evt) |
409 | + |
410 | + # return the event for unit testing purpose |
411 | + return evt |
412 | + |
413 | + |
414 | +def report_diagnostic_event(str): |
415 | + """Report a diagnostic event""" |
416 | + evt = events.ReportingEvent( |
417 | + DIAGNOSTIC_EVENT_TYPE, 'diagnostic message', |
418 | + str, events.DEFAULT_EVENT_ORIGIN) |
419 | + events.report_event(evt) |
420 | + |
421 | + # return the event for unit testing purpose |
422 | + return evt |
423 | + |
424 | + |
425 | @contextmanager |
426 | def cd(newdir): |
427 | prevdir = os.getcwd() |
428 | @@ -360,16 +467,19 @@ class WALinuxAgentShim(object): |
429 | value = dhcp245 |
430 | LOG.debug("Using Azure Endpoint from dhcp options") |
431 | if value is None: |
432 | + report_diagnostic_event("No Azure endpoint from dhcp options") |
433 | LOG.debug('Finding Azure endpoint from networkd...') |
434 | value = WALinuxAgentShim._networkd_get_value_from_leases() |
435 | if value is None: |
436 | # Option-245 stored in /run/cloud-init/dhclient.hooks/<ifc>.json |
437 | # a dhclient exit hook that calls cloud-init-dhclient-hook |
438 | + report_diagnostic_event("No Azure endpoint from networkd") |
439 | LOG.debug('Finding Azure endpoint from hook json...') |
440 | dhcp_options = WALinuxAgentShim._load_dhclient_json() |
441 | value = WALinuxAgentShim._get_value_from_dhcpoptions(dhcp_options) |
442 | if value is None: |
443 | # Fallback and check the leases file if unsuccessful |
444 | + report_diagnostic_event("No Azure endpoint from dhclient logs") |
445 | LOG.debug("Unable to find endpoint in dhclient logs. " |
446 | " Falling back to check lease files") |
447 | if fallback_lease_file is None: |
448 | @@ -381,11 +491,15 @@ class WALinuxAgentShim(object): |
449 | value = WALinuxAgentShim._get_value_from_leases_file( |
450 | fallback_lease_file) |
451 | if value is None: |
452 | - LOG.warning("No lease found; using default endpoint") |
453 | + msg = "No lease found; using default endpoint" |
454 | + report_diagnostic_event(msg) |
455 | + LOG.warning(msg) |
456 | value = DEFAULT_WIRESERVER_ENDPOINT |
457 | |
458 | endpoint_ip_address = WALinuxAgentShim.get_ip_from_lease_value(value) |
459 | - LOG.debug('Azure endpoint found at %s', endpoint_ip_address) |
460 | + msg = 'Azure endpoint found at %s' % endpoint_ip_address |
461 | + report_diagnostic_event(msg) |
462 | + LOG.debug(msg) |
463 | return endpoint_ip_address |
464 | |
465 | @azure_ds_telemetry_reporter |
466 | @@ -399,16 +513,19 @@ class WALinuxAgentShim(object): |
467 | try: |
468 | response = http_client.get( |
469 | 'http://{0}/machine/?comp=goalstate'.format(self.endpoint)) |
470 | - except Exception: |
471 | + except Exception as e: |
472 | if attempts < 10: |
473 | time.sleep(attempts + 1) |
474 | else: |
475 | + report_diagnostic_event( |
476 | + "failed to register with Azure: %s" % e) |
477 | raise |
478 | else: |
479 | break |
480 | attempts += 1 |
481 | LOG.debug('Successfully fetched GoalState XML.') |
482 | goal_state = GoalState(response.contents, http_client) |
483 | + report_diagnostic_event("container_id %s" % goal_state.container_id) |
484 | ssh_keys = [] |
485 | if goal_state.certificates_xml is not None and pubkey_info is not None: |
486 | LOG.debug('Certificate XML found; parsing out public keys.') |
487 | @@ -449,11 +566,20 @@ class WALinuxAgentShim(object): |
488 | container_id=goal_state.container_id, |
489 | instance_id=goal_state.instance_id, |
490 | ) |
491 | - http_client.post( |
492 | - "http://{0}/machine?comp=health".format(self.endpoint), |
493 | - data=document, |
494 | - extra_headers={'Content-Type': 'text/xml; charset=utf-8'}, |
495 | - ) |
496 | + # Host will collect kvps when cloud-init reports ready. |
497 | + # some kvps might still be in the queue. We yield the scheduler |
498 | + # to make sure we process all kvps up till this point. |
499 | + time.sleep(0) |
500 | + try: |
501 | + http_client.post( |
502 | + "http://{0}/machine?comp=health".format(self.endpoint), |
503 | + data=document, |
504 | + extra_headers={'Content-Type': 'text/xml; charset=utf-8'}, |
505 | + ) |
506 | + except Exception as e: |
507 | + report_diagnostic_event("exception while reporting ready: %s" % e) |
508 | + raise |
509 | + |
510 | LOG.info('Reported ready to Azure fabric.') |
511 | |
512 | |
513 | @@ -467,4 +593,22 @@ def get_metadata_from_fabric(fallback_lease_file=None, dhcp_opts=None, |
514 | finally: |
515 | shim.clean_up() |
516 | |
517 | + |
518 | +class EphemeralDHCPv4WithReporting(object): |
519 | + def __init__(self, reporter, nic=None): |
520 | + self.reporter = reporter |
521 | + self.ephemeralDHCPv4 = EphemeralDHCPv4(iface=nic) |
522 | + |
523 | + def __enter__(self): |
524 | + with events.ReportEventStack( |
525 | + name="obtain-dhcp-lease", |
526 | + description="obtain dhcp lease", |
527 | + parent=self.reporter): |
528 | + return self.ephemeralDHCPv4.__enter__() |
529 | + |
530 | + def __exit__(self, excp_type, excp_value, excp_traceback): |
531 | + self.ephemeralDHCPv4.__exit__( |
532 | + excp_type, excp_value, excp_traceback) |
533 | + |
534 | + |
535 | # vi: ts=4 expandtab |
536 | diff --git a/tests/unittests/test_datasource/test_azure.py b/tests/unittests/test_datasource/test_azure.py |
537 | index 4d57ceb..3547dd9 100644 |
538 | --- a/tests/unittests/test_datasource/test_azure.py |
539 | +++ b/tests/unittests/test_datasource/test_azure.py |
540 | @@ -181,7 +181,7 @@ class TestGetMetadataFromIMDS(HttprettyTestCase): |
541 | self.logs.getvalue()) |
542 | |
543 | @mock.patch(MOCKPATH + 'readurl') |
544 | - @mock.patch(MOCKPATH + 'EphemeralDHCPv4') |
545 | + @mock.patch(MOCKPATH + 'EphemeralDHCPv4WithReporting') |
546 | @mock.patch(MOCKPATH + 'net.is_up') |
547 | def test_get_metadata_performs_dhcp_when_network_is_down( |
548 | self, m_net_is_up, m_dhcp, m_readurl): |
549 | @@ -195,7 +195,7 @@ class TestGetMetadataFromIMDS(HttprettyTestCase): |
550 | dsaz.get_metadata_from_imds('eth9', retries=2)) |
551 | |
552 | m_net_is_up.assert_called_with('eth9') |
553 | - m_dhcp.assert_called_with('eth9') |
554 | + m_dhcp.assert_called_with(mock.ANY, 'eth9') |
555 | self.assertIn( |
556 | "Crawl of Azure Instance Metadata Service (IMDS) took", # log_time |
557 | self.logs.getvalue()) |
558 | @@ -552,7 +552,8 @@ scbus-1 on xpt0 bus 0 |
559 | dsrc.crawl_metadata() |
560 | self.assertEqual(str(cm.exception), error_msg) |
561 | |
562 | - @mock.patch('cloudinit.sources.DataSourceAzure.EphemeralDHCPv4') |
563 | + @mock.patch( |
564 | + 'cloudinit.sources.DataSourceAzure.EphemeralDHCPv4WithReporting') |
565 | @mock.patch('cloudinit.sources.DataSourceAzure.util.write_file') |
566 | @mock.patch( |
567 | 'cloudinit.sources.DataSourceAzure.DataSourceAzure._report_ready') |
568 | @@ -1308,7 +1309,9 @@ class TestAzureBounce(CiTestCase): |
569 | self.assertEqual(initial_host_name, |
570 | self.set_hostname.call_args_list[-1][0][0]) |
571 | |
572 | - def test_environment_correct_for_bounce_command(self): |
573 | + @mock.patch.object(dsaz, 'get_boot_telemetry') |
574 | + def test_environment_correct_for_bounce_command( |
575 | + self, mock_get_boot_telemetry): |
576 | interface = 'int0' |
577 | hostname = 'my-new-host' |
578 | old_hostname = 'my-old-host' |
579 | @@ -1324,7 +1327,9 @@ class TestAzureBounce(CiTestCase): |
580 | self.assertEqual(hostname, bounce_env['hostname']) |
581 | self.assertEqual(old_hostname, bounce_env['old_hostname']) |
582 | |
583 | - def test_default_bounce_command_ifup_used_by_default(self): |
584 | + @mock.patch.object(dsaz, 'get_boot_telemetry') |
585 | + def test_default_bounce_command_ifup_used_by_default( |
586 | + self, mock_get_boot_telemetry): |
587 | cfg = {'hostname_bounce': {'policy': 'force'}} |
588 | data = self.get_ovf_env_with_dscfg('some-hostname', cfg) |
589 | dsrc = self._get_ds(data, agent_command=['not', '__builtin__']) |
590 | diff --git a/tests/unittests/test_reporting_hyperv.py b/tests/unittests/test_reporting_hyperv.py |
591 | old mode 100755 |
592 | new mode 100644 |
593 | index d01ed5b..640895a |
594 | --- a/tests/unittests/test_reporting_hyperv.py |
595 | +++ b/tests/unittests/test_reporting_hyperv.py |
596 | @@ -7,9 +7,12 @@ import json |
597 | import os |
598 | import struct |
599 | import time |
600 | +import re |
601 | +import mock |
602 | |
603 | from cloudinit import util |
604 | from cloudinit.tests.helpers import CiTestCase |
605 | +from cloudinit.sources.helpers import azure |
606 | |
607 | |
608 | class TestKvpEncoding(CiTestCase): |
609 | @@ -126,3 +129,65 @@ class TextKvpReporter(CiTestCase): |
610 | reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path) |
611 | kvps = list(reporter._iterate_kvps(0)) |
612 | self.assertEqual(0, len(kvps)) |
613 | + |
614 | + @mock.patch('cloudinit.distros.uses_systemd') |
615 | + @mock.patch('cloudinit.util.subp') |
616 | + def test_get_boot_telemetry(self, m_subp, m_sysd): |
617 | + reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path) |
618 | + datetime_pattern = r"\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]" |
619 | + r"\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z)" |
620 | + |
621 | + # get_boot_telemetry makes two subp calls to systemctl. We provide |
622 | + # a list of values that the subp calls should return |
623 | + m_subp.side_effect = [ |
624 | + ('UserspaceTimestampMonotonic=1844838', ''), |
625 | + ('InactiveExitTimestampMonotonic=3068203', '')] |
626 | + m_sysd.return_value = True |
627 | + |
628 | + reporter.publish_event(azure.get_boot_telemetry()) |
629 | + reporter.q.join() |
630 | + kvps = list(reporter._iterate_kvps(0)) |
631 | + self.assertEqual(1, len(kvps)) |
632 | + |
633 | + evt_msg = kvps[0]['value'] |
634 | + if not re.search("kernel_start=" + datetime_pattern, evt_msg): |
635 | + raise AssertionError("missing kernel_start timestamp") |
636 | + if not re.search("user_start=" + datetime_pattern, evt_msg): |
637 | + raise AssertionError("missing user_start timestamp") |
638 | + if not re.search("cloudinit_activation=" + datetime_pattern, |
639 | + evt_msg): |
640 | + raise AssertionError( |
641 | + "missing cloudinit_activation timestamp") |
642 | + |
643 | + def test_get_system_info(self): |
644 | + reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path) |
645 | + pattern = r"[^=\s]+" |
646 | + |
647 | + reporter.publish_event(azure.get_system_info()) |
648 | + reporter.q.join() |
649 | + kvps = list(reporter._iterate_kvps(0)) |
650 | + self.assertEqual(1, len(kvps)) |
651 | + evt_msg = kvps[0]['value'] |
652 | + |
653 | + # the most important information is cloudinit version, |
654 | + # kernel_version, and the distro variant. It is ok if |
655 | + # if the rest is not available |
656 | + if not re.search("cloudinit_version=" + pattern, evt_msg): |
657 | + raise AssertionError("missing cloudinit_version string") |
658 | + if not re.search("kernel_version=" + pattern, evt_msg): |
659 | + raise AssertionError("missing kernel_version string") |
660 | + if not re.search("variant=" + pattern, evt_msg): |
661 | + raise AssertionError("missing distro variant string") |
662 | + |
663 | + def test_report_diagnostic_event(self): |
664 | + reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path) |
665 | + |
666 | + reporter.publish_event( |
667 | + azure.report_diagnostic_event("test_diagnostic")) |
668 | + reporter.q.join() |
669 | + kvps = list(reporter._iterate_kvps(0)) |
670 | + self.assertEqual(1, len(kvps)) |
671 | + evt_msg = kvps[0]['value'] |
672 | + |
673 | + if "test_diagnostic" not in evt_msg: |
674 | + raise AssertionError("missing expected diagnostic message") |
PASSED: Continuous integration, rev:ff1b785b5ba 28a0ef1b9de407a 3f0c0bf94afc21 /jenkins. ubuntu. com/server/ job/cloud- init-ci/ 756/
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild: /jenkins. ubuntu. com/server/ job/cloud- init-ci/ 756/rebuild
https:/