Merge ~vtqanh/cloud-init:ProvisioningTelemetry into cloud-init:master

Proposed by Anh Vo (MSFT)
Status: Merged
Approved by: Chad Smith
Approved revision: 5193bea529862589394538d91ff4e0510dce3430
Merge reported by: Server Team CI bot
Merged at revision: not available
Proposed branch: ~vtqanh/cloud-init:ProvisioningTelemetry
Merge into: cloud-init:master
Diff against target: 674 lines (+353/-44)
4 files modified
cloudinit/sources/DataSourceAzure.py (+126/-31)
cloudinit/sources/helpers/azure.py (+152/-8)
tests/unittests/test_datasource/test_azure.py (+10/-5)
tests/unittests/test_reporting_hyperv.py (+65/-0)
Reviewer Review Type Date Requested Status
Chad Smith Approve
Server Team CI bot continuous-integration Approve
Ryan Harper Approve
Review via email: mp+369785@code.launchpad.net

Commit message

Azure: Record boot timestamps, system information, and diagnostic events

Collect and record the following information through KVP:
 + timestamps related to kernel initialization and systemd activation
   of cloud-init services
 + system information including cloud-init version, kernel version,
   distro version, and python version
 + diagnostic events for the most common provisioning error issues
   such as empty dhcp lease, corrupted ovf-env.xml, etc.
 + increasing the log frequency of polling IMDS during reprovision.

To post a comment you must log in.
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:ff1b785b5ba28a0ef1b9de407a3f0c0bf94afc21
https://jenkins.ubuntu.com/server/job/cloud-init-ci/756/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/756/rebuild

review: Approve (continuous-integration)
Revision history for this message
Ryan Harper (raharper) wrote :

I've some inline questions and comments. Thanks!

review: Needs Fixing
Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

Thanks Ryan, I've resolved all the comments (including updating the commit message)

Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

Updated the author

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks. I've added some inline comments and suggestions.

review: Needs Fixing
Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

Thanks for the review and comments Ryan. I've pushed an update that addressed all those issue you pointed out.

Revision history for this message
Ryan Harper (raharper) wrote :

Thank you for the refactor. I've a few comments inline for the code; and I'd like to change how the unittests are run so they don't depend on the host environment.

review: Needs Fixing
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:e1844e2f1e635bd8015f3c947e542b93caef41f6
https://jenkins.ubuntu.com/server/job/cloud-init-ci/802/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/802/rebuild

review: Approve (continuous-integration)
2db1300... by Anh Vo (MSFT)

Have unit test not depending on host system having systemd

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:2db1300dd089bdd5d768482d9c50af5112c54c8a
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1038/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    FAILED: Ubuntu LTS: Integration

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1038//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:2db1300dd089bdd5d768482d9c50af5112c54c8a
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1039/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    FAILED: Ubuntu LTS: Integration

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1039//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:2db1300dd089bdd5d768482d9c50af5112c54c8a
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1040/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1040//rebuild

review: Approve (continuous-integration)
Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for updating the unittests. This looks good to me.

review: Approve
Revision history for this message
Chad Smith (chad.smith) wrote :

Looks good!

A couple of inline questions:

- Can we emit the same diagnostic messages with their complement logging.WARNING|DEBUG) messages to ease triage or correlation of kvp diagnostics to cloud-init.log files.
- Can we add a message about the how much the exception handling backoff algorithm is increased?

Validated behavior of this branch on an Azure Eoan instance
ubuntu@my-e1:~$ grep 'Did not find Azure data source' /var/lib/hyperv/.kvp_pool_1 /var/log/cloud-init.log
Binary file /var/lib/hyperv/.kvp_pool_1 matches
/var/log/cloud-init.log:2019-08-14 16:37:51,421 - handlers.py[DEBUG]: diagnostic: diagnostic message: Did not find Azure data source in /var/lib/cloud/seed/azure
ubuntu@my-e1:~$ grep Trace /var/log/cloud-init.log
ubuntu@my-e1:~$ fgrep -ir Free /var/log/cloud-init.log
ubuntu@my-e1:~$ fgrep -r diagnostic /var/log/cloud-init.log
2019-08-14 16:37:51,421 - handlers.py[DEBUG]: diagnostic: diagnostic message: Did not find Azure data source in /var/lib/cloud/seed/azure
2019-08-14 16:37:54,792 - handlers.py[DEBUG]: diagnostic: diagnostic message: No Azure endpoint from dhcp options
2019-08-14 16:37:54,794 - handlers.py[DEBUG]: diagnostic: diagnostic message: Azure endpoint found at 168.63.129.16
2019-08-14 16:37:54,801 - handlers.py[DEBUG]: diagnostic: diagnostic message: container_id 88e472d3-5e18-43cd-b43b-59dfe8199cf7

5193bea... by Anh Vo (MSFT)

consistent messages for diagnostic and logging. Also add log to indicate logging being backed off

Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

Thanks Chad for reviewing. I addressed the two points you mentioned. The first one I couldn't reuse the exact same message because we don't want the full stack trace reported to KVP and just the top level exception), but I did match the message to make triage easier.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:5193bea529862589394538d91ff4e0510dce3430
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1043/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1043//rebuild

review: Approve (continuous-integration)
Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks Anh Vo!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
2index e6ed2f3..4984fa8 100755
3--- a/cloudinit/sources/DataSourceAzure.py
4+++ b/cloudinit/sources/DataSourceAzure.py
5@@ -26,9 +26,14 @@ from cloudinit.url_helper import UrlError, readurl, retry_on_url_exc
6 from cloudinit import util
7 from cloudinit.reporting import events
8
9-from cloudinit.sources.helpers.azure import (azure_ds_reporter,
10- azure_ds_telemetry_reporter,
11- get_metadata_from_fabric)
12+from cloudinit.sources.helpers.azure import (
13+ azure_ds_reporter,
14+ azure_ds_telemetry_reporter,
15+ get_metadata_from_fabric,
16+ get_boot_telemetry,
17+ get_system_info,
18+ report_diagnostic_event,
19+ EphemeralDHCPv4WithReporting)
20
21 LOG = logging.getLogger(__name__)
22
23@@ -354,7 +359,7 @@ class DataSourceAzure(sources.DataSource):
24 bname = str(pk['fingerprint'] + ".crt")
25 fp_files += [os.path.join(ddir, bname)]
26 LOG.debug("ssh authentication: "
27- "using fingerprint from fabirc")
28+ "using fingerprint from fabric")
29
30 with events.ReportEventStack(
31 name="waiting-for-ssh-public-key",
32@@ -419,12 +424,17 @@ class DataSourceAzure(sources.DataSource):
33 ret = load_azure_ds_dir(cdev)
34
35 except NonAzureDataSource:
36+ report_diagnostic_event(
37+ "Did not find Azure data source in %s" % cdev)
38 continue
39 except BrokenAzureDataSource as exc:
40 msg = 'BrokenAzureDataSource: %s' % exc
41+ report_diagnostic_event(msg)
42 raise sources.InvalidMetaDataException(msg)
43 except util.MountFailedError:
44- LOG.warning("%s was not mountable", cdev)
45+ msg = '%s was not mountable' % cdev
46+ report_diagnostic_event(msg)
47+ LOG.warning(msg)
48 continue
49
50 perform_reprovision = reprovision or self._should_reprovision(ret)
51@@ -432,6 +442,7 @@ class DataSourceAzure(sources.DataSource):
52 if util.is_FreeBSD():
53 msg = "Free BSD is not supported for PPS VMs"
54 LOG.error(msg)
55+ report_diagnostic_event(msg)
56 raise sources.InvalidMetaDataException(msg)
57 ret = self._reprovision()
58 imds_md = get_metadata_from_imds(
59@@ -450,7 +461,9 @@ class DataSourceAzure(sources.DataSource):
60 break
61
62 if not found:
63- raise sources.InvalidMetaDataException('No Azure metadata found')
64+ msg = 'No Azure metadata found'
65+ report_diagnostic_event(msg)
66+ raise sources.InvalidMetaDataException(msg)
67
68 if found == ddir:
69 LOG.debug("using files cached in %s", ddir)
70@@ -469,9 +482,14 @@ class DataSourceAzure(sources.DataSource):
71 self._report_ready(lease=self._ephemeral_dhcp_ctx.lease)
72 self._ephemeral_dhcp_ctx.clean_network() # Teardown ephemeral
73 else:
74- with EphemeralDHCPv4() as lease:
75- self._report_ready(lease=lease)
76-
77+ try:
78+ with EphemeralDHCPv4WithReporting(
79+ azure_ds_reporter) as lease:
80+ self._report_ready(lease=lease)
81+ except Exception as e:
82+ report_diagnostic_event(
83+ "exception while reporting ready: %s" % e)
84+ raise
85 return crawled_data
86
87 def _is_platform_viable(self):
88@@ -493,6 +511,16 @@ class DataSourceAzure(sources.DataSource):
89 if not self._is_platform_viable():
90 return False
91 try:
92+ get_boot_telemetry()
93+ except Exception as e:
94+ LOG.warning("Failed to get boot telemetry: %s", e)
95+
96+ try:
97+ get_system_info()
98+ except Exception as e:
99+ LOG.warning("Failed to get system information: %s", e)
100+
101+ try:
102 crawled_data = util.log_time(
103 logfunc=LOG.debug, msg='Crawl of metadata service',
104 func=self.crawl_metadata)
105@@ -551,27 +579,55 @@ class DataSourceAzure(sources.DataSource):
106 headers = {"Metadata": "true"}
107 nl_sock = None
108 report_ready = bool(not os.path.isfile(REPORTED_READY_MARKER_FILE))
109+ self.imds_logging_threshold = 1
110+ self.imds_poll_counter = 1
111+ dhcp_attempts = 0
112+ vnet_switched = False
113+ return_val = None
114
115 def exc_cb(msg, exception):
116 if isinstance(exception, UrlError) and exception.code == 404:
117+ if self.imds_poll_counter == self.imds_logging_threshold:
118+ # Reducing the logging frequency as we are polling IMDS
119+ self.imds_logging_threshold *= 2
120+ LOG.debug("Call to IMDS with arguments %s failed "
121+ "with status code %s after %s retries",
122+ msg, exception.code, self.imds_poll_counter)
123+ LOG.debug("Backing off logging threshold for the same "
124+ "exception to %d", self.imds_logging_threshold)
125+ self.imds_poll_counter += 1
126 return True
127+
128 # If we get an exception while trying to call IMDS, we
129 # call DHCP and setup the ephemeral network to acquire the new IP.
130+ LOG.debug("Call to IMDS with arguments %s failed with "
131+ "status code %s", msg, exception.code)
132+ report_diagnostic_event("polling IMDS failed with exception %s"
133+ % exception.code)
134 return False
135
136 LOG.debug("Wait for vnetswitch to happen")
137 while True:
138 try:
139- # Save our EphemeralDHCPv4 context so we avoid repeated dhcp
140- self._ephemeral_dhcp_ctx = EphemeralDHCPv4()
141- lease = self._ephemeral_dhcp_ctx.obtain_lease()
142+ # Save our EphemeralDHCPv4 context to avoid repeated dhcp
143+ with events.ReportEventStack(
144+ name="obtain-dhcp-lease",
145+ description="obtain dhcp lease",
146+ parent=azure_ds_reporter):
147+ self._ephemeral_dhcp_ctx = EphemeralDHCPv4()
148+ lease = self._ephemeral_dhcp_ctx.obtain_lease()
149+
150+ if vnet_switched:
151+ dhcp_attempts += 1
152 if report_ready:
153 try:
154 nl_sock = netlink.create_bound_netlink_socket()
155 except netlink.NetlinkCreateSocketError as e:
156+ report_diagnostic_event(e)
157 LOG.warning(e)
158 self._ephemeral_dhcp_ctx.clean_network()
159- return
160+ break
161+
162 path = REPORTED_READY_MARKER_FILE
163 LOG.info(
164 "Creating a marker file to report ready: %s", path)
165@@ -579,17 +635,33 @@ class DataSourceAzure(sources.DataSource):
166 pid=os.getpid(), time=time()))
167 self._report_ready(lease=lease)
168 report_ready = False
169- try:
170- netlink.wait_for_media_disconnect_connect(
171- nl_sock, lease['interface'])
172- except AssertionError as error:
173- LOG.error(error)
174- return
175+
176+ with events.ReportEventStack(
177+ name="wait-for-media-disconnect-connect",
178+ description="wait for vnet switch",
179+ parent=azure_ds_reporter):
180+ try:
181+ netlink.wait_for_media_disconnect_connect(
182+ nl_sock, lease['interface'])
183+ except AssertionError as error:
184+ report_diagnostic_event(error)
185+ LOG.error(error)
186+ break
187+
188+ vnet_switched = True
189 self._ephemeral_dhcp_ctx.clean_network()
190 else:
191- return readurl(url, timeout=IMDS_TIMEOUT_IN_SECONDS,
192- headers=headers, exception_cb=exc_cb,
193- infinite=True, log_req_resp=False).contents
194+ with events.ReportEventStack(
195+ name="get-reprovision-data-from-imds",
196+ description="get reprovision data from imds",
197+ parent=azure_ds_reporter):
198+ return_val = readurl(url,
199+ timeout=IMDS_TIMEOUT_IN_SECONDS,
200+ headers=headers,
201+ exception_cb=exc_cb,
202+ infinite=True,
203+ log_req_resp=False).contents
204+ break
205 except UrlError:
206 # Teardown our EphemeralDHCPv4 context on failure as we retry
207 self._ephemeral_dhcp_ctx.clean_network()
208@@ -598,6 +670,14 @@ class DataSourceAzure(sources.DataSource):
209 if nl_sock:
210 nl_sock.close()
211
212+ if vnet_switched:
213+ report_diagnostic_event("attempted dhcp %d times after reuse" %
214+ dhcp_attempts)
215+ report_diagnostic_event("polled imds %d times after reuse" %
216+ self.imds_poll_counter)
217+
218+ return return_val
219+
220 @azure_ds_telemetry_reporter
221 def _report_ready(self, lease):
222 """Tells the fabric provisioning has completed """
223@@ -666,9 +746,12 @@ class DataSourceAzure(sources.DataSource):
224 self.ds_cfg['agent_command'])
225 try:
226 fabric_data = metadata_func()
227- except Exception:
228+ except Exception as e:
229+ report_diagnostic_event(
230+ "Error communicating with Azure fabric; You may experience "
231+ "connectivity issues: %s" % e)
232 LOG.warning(
233- "Error communicating with Azure fabric; You may experience."
234+ "Error communicating with Azure fabric; You may experience "
235 "connectivity issues.", exc_info=True)
236 return False
237
238@@ -1027,7 +1110,9 @@ def read_azure_ovf(contents):
239 try:
240 dom = minidom.parseString(contents)
241 except Exception as e:
242- raise BrokenAzureDataSource("Invalid ovf-env.xml: %s" % e)
243+ error_str = "Invalid ovf-env.xml: %s" % e
244+ report_diagnostic_event(error_str)
245+ raise BrokenAzureDataSource(error_str)
246
247 results = find_child(dom.documentElement,
248 lambda n: n.localName == "ProvisioningSection")
249@@ -1299,8 +1384,13 @@ def get_metadata_from_imds(fallback_nic, retries):
250 if net.is_up(fallback_nic):
251 return util.log_time(**kwargs)
252 else:
253- with EphemeralDHCPv4(fallback_nic):
254- return util.log_time(**kwargs)
255+ try:
256+ with EphemeralDHCPv4WithReporting(
257+ azure_ds_reporter, fallback_nic):
258+ return util.log_time(**kwargs)
259+ except Exception as e:
260+ report_diagnostic_event("exception while getting metadata: %s" % e)
261+ raise
262
263
264 @azure_ds_telemetry_reporter
265@@ -1313,11 +1403,14 @@ def _get_metadata_from_imds(retries):
266 url, timeout=IMDS_TIMEOUT_IN_SECONDS, headers=headers,
267 retries=retries, exception_cb=retry_on_url_exc)
268 except Exception as e:
269- LOG.debug('Ignoring IMDS instance metadata: %s', e)
270+ msg = 'Ignoring IMDS instance metadata: %s' % e
271+ report_diagnostic_event(msg)
272+ LOG.debug(msg)
273 return {}
274 try:
275 return util.load_json(str(response))
276- except json.decoder.JSONDecodeError:
277+ except json.decoder.JSONDecodeError as e:
278+ report_diagnostic_event('non-json imds response' % e)
279 LOG.warning(
280 'Ignoring non-json IMDS instance metadata: %s', str(response))
281 return {}
282@@ -1370,8 +1463,10 @@ def _is_platform_viable(seed_dir):
283 asset_tag = util.read_dmi_data('chassis-asset-tag')
284 if asset_tag == AZURE_CHASSIS_ASSET_TAG:
285 return True
286- LOG.debug("Non-Azure DMI asset tag '%s' discovered.", asset_tag)
287- evt.description = "Non-Azure DMI asset tag '%s' discovered.", asset_tag
288+ msg = "Non-Azure DMI asset tag '%s' discovered." % asset_tag
289+ LOG.debug(msg)
290+ evt.description = msg
291+ report_diagnostic_event(msg)
292 if os.path.exists(os.path.join(seed_dir, 'ovf-env.xml')):
293 return True
294 return False
295diff --git a/cloudinit/sources/helpers/azure.py b/cloudinit/sources/helpers/azure.py
296index 82c4c8c..f1fba17 100755
297--- a/cloudinit/sources/helpers/azure.py
298+++ b/cloudinit/sources/helpers/azure.py
299@@ -16,7 +16,11 @@ from xml.etree import ElementTree
300
301 from cloudinit import url_helper
302 from cloudinit import util
303+from cloudinit import version
304+from cloudinit import distros
305 from cloudinit.reporting import events
306+from cloudinit.net.dhcp import EphemeralDHCPv4
307+from datetime import datetime
308
309 LOG = logging.getLogger(__name__)
310
311@@ -24,6 +28,10 @@ LOG = logging.getLogger(__name__)
312 # value is applied if the endpoint can't be found within a lease file
313 DEFAULT_WIRESERVER_ENDPOINT = "a8:3f:81:10"
314
315+BOOT_EVENT_TYPE = 'boot-telemetry'
316+SYSTEMINFO_EVENT_TYPE = 'system-info'
317+DIAGNOSTIC_EVENT_TYPE = 'diagnostic'
318+
319 azure_ds_reporter = events.ReportEventStack(
320 name="azure-ds",
321 description="initialize reporter for azure ds",
322@@ -40,6 +48,105 @@ def azure_ds_telemetry_reporter(func):
323 return impl
324
325
326+@azure_ds_telemetry_reporter
327+def get_boot_telemetry():
328+ """Report timestamps related to kernel initialization and systemd
329+ activation of cloud-init"""
330+ if not distros.uses_systemd():
331+ raise RuntimeError(
332+ "distro not using systemd, skipping boot telemetry")
333+
334+ LOG.debug("Collecting boot telemetry")
335+ try:
336+ kernel_start = float(time.time()) - float(util.uptime())
337+ except ValueError:
338+ raise RuntimeError("Failed to determine kernel start timestamp")
339+
340+ try:
341+ out, _ = util.subp(['/bin/systemctl',
342+ 'show', '-p',
343+ 'UserspaceTimestampMonotonic'],
344+ capture=True)
345+ tsm = None
346+ if out and '=' in out:
347+ tsm = out.split("=")[1]
348+
349+ if not tsm:
350+ raise RuntimeError("Failed to parse "
351+ "UserspaceTimestampMonotonic from systemd")
352+
353+ user_start = kernel_start + (float(tsm) / 1000000)
354+ except util.ProcessExecutionError as e:
355+ raise RuntimeError("Failed to get UserspaceTimestampMonotonic: %s"
356+ % e)
357+ except ValueError as e:
358+ raise RuntimeError("Failed to parse "
359+ "UserspaceTimestampMonotonic from systemd: %s"
360+ % e)
361+
362+ try:
363+ out, _ = util.subp(['/bin/systemctl', 'show',
364+ 'cloud-init-local', '-p',
365+ 'InactiveExitTimestampMonotonic'],
366+ capture=True)
367+ tsm = None
368+ if out and '=' in out:
369+ tsm = out.split("=")[1]
370+ if not tsm:
371+ raise RuntimeError("Failed to parse "
372+ "InactiveExitTimestampMonotonic from systemd")
373+
374+ cloudinit_activation = kernel_start + (float(tsm) / 1000000)
375+ except util.ProcessExecutionError as e:
376+ raise RuntimeError("Failed to get InactiveExitTimestampMonotonic: %s"
377+ % e)
378+ except ValueError as e:
379+ raise RuntimeError("Failed to parse "
380+ "InactiveExitTimestampMonotonic from systemd: %s"
381+ % e)
382+
383+ evt = events.ReportingEvent(
384+ BOOT_EVENT_TYPE, 'boot-telemetry',
385+ "kernel_start=%s user_start=%s cloudinit_activation=%s" %
386+ (datetime.utcfromtimestamp(kernel_start).isoformat() + 'Z',
387+ datetime.utcfromtimestamp(user_start).isoformat() + 'Z',
388+ datetime.utcfromtimestamp(cloudinit_activation).isoformat() + 'Z'),
389+ events.DEFAULT_EVENT_ORIGIN)
390+ events.report_event(evt)
391+
392+ # return the event for unit testing purpose
393+ return evt
394+
395+
396+@azure_ds_telemetry_reporter
397+def get_system_info():
398+ """Collect and report system information"""
399+ info = util.system_info()
400+ evt = events.ReportingEvent(
401+ SYSTEMINFO_EVENT_TYPE, 'system information',
402+ "cloudinit_version=%s, kernel_version=%s, variant=%s, "
403+ "distro_name=%s, distro_version=%s, flavor=%s, "
404+ "python_version=%s" %
405+ (version.version_string(), info['release'], info['variant'],
406+ info['dist'][0], info['dist'][1], info['dist'][2],
407+ info['python']), events.DEFAULT_EVENT_ORIGIN)
408+ events.report_event(evt)
409+
410+ # return the event for unit testing purpose
411+ return evt
412+
413+
414+def report_diagnostic_event(str):
415+ """Report a diagnostic event"""
416+ evt = events.ReportingEvent(
417+ DIAGNOSTIC_EVENT_TYPE, 'diagnostic message',
418+ str, events.DEFAULT_EVENT_ORIGIN)
419+ events.report_event(evt)
420+
421+ # return the event for unit testing purpose
422+ return evt
423+
424+
425 @contextmanager
426 def cd(newdir):
427 prevdir = os.getcwd()
428@@ -360,16 +467,19 @@ class WALinuxAgentShim(object):
429 value = dhcp245
430 LOG.debug("Using Azure Endpoint from dhcp options")
431 if value is None:
432+ report_diagnostic_event("No Azure endpoint from dhcp options")
433 LOG.debug('Finding Azure endpoint from networkd...')
434 value = WALinuxAgentShim._networkd_get_value_from_leases()
435 if value is None:
436 # Option-245 stored in /run/cloud-init/dhclient.hooks/<ifc>.json
437 # a dhclient exit hook that calls cloud-init-dhclient-hook
438+ report_diagnostic_event("No Azure endpoint from networkd")
439 LOG.debug('Finding Azure endpoint from hook json...')
440 dhcp_options = WALinuxAgentShim._load_dhclient_json()
441 value = WALinuxAgentShim._get_value_from_dhcpoptions(dhcp_options)
442 if value is None:
443 # Fallback and check the leases file if unsuccessful
444+ report_diagnostic_event("No Azure endpoint from dhclient logs")
445 LOG.debug("Unable to find endpoint in dhclient logs. "
446 " Falling back to check lease files")
447 if fallback_lease_file is None:
448@@ -381,11 +491,15 @@ class WALinuxAgentShim(object):
449 value = WALinuxAgentShim._get_value_from_leases_file(
450 fallback_lease_file)
451 if value is None:
452- LOG.warning("No lease found; using default endpoint")
453+ msg = "No lease found; using default endpoint"
454+ report_diagnostic_event(msg)
455+ LOG.warning(msg)
456 value = DEFAULT_WIRESERVER_ENDPOINT
457
458 endpoint_ip_address = WALinuxAgentShim.get_ip_from_lease_value(value)
459- LOG.debug('Azure endpoint found at %s', endpoint_ip_address)
460+ msg = 'Azure endpoint found at %s' % endpoint_ip_address
461+ report_diagnostic_event(msg)
462+ LOG.debug(msg)
463 return endpoint_ip_address
464
465 @azure_ds_telemetry_reporter
466@@ -399,16 +513,19 @@ class WALinuxAgentShim(object):
467 try:
468 response = http_client.get(
469 'http://{0}/machine/?comp=goalstate'.format(self.endpoint))
470- except Exception:
471+ except Exception as e:
472 if attempts < 10:
473 time.sleep(attempts + 1)
474 else:
475+ report_diagnostic_event(
476+ "failed to register with Azure: %s" % e)
477 raise
478 else:
479 break
480 attempts += 1
481 LOG.debug('Successfully fetched GoalState XML.')
482 goal_state = GoalState(response.contents, http_client)
483+ report_diagnostic_event("container_id %s" % goal_state.container_id)
484 ssh_keys = []
485 if goal_state.certificates_xml is not None and pubkey_info is not None:
486 LOG.debug('Certificate XML found; parsing out public keys.')
487@@ -449,11 +566,20 @@ class WALinuxAgentShim(object):
488 container_id=goal_state.container_id,
489 instance_id=goal_state.instance_id,
490 )
491- http_client.post(
492- "http://{0}/machine?comp=health".format(self.endpoint),
493- data=document,
494- extra_headers={'Content-Type': 'text/xml; charset=utf-8'},
495- )
496+ # Host will collect kvps when cloud-init reports ready.
497+ # some kvps might still be in the queue. We yield the scheduler
498+ # to make sure we process all kvps up till this point.
499+ time.sleep(0)
500+ try:
501+ http_client.post(
502+ "http://{0}/machine?comp=health".format(self.endpoint),
503+ data=document,
504+ extra_headers={'Content-Type': 'text/xml; charset=utf-8'},
505+ )
506+ except Exception as e:
507+ report_diagnostic_event("exception while reporting ready: %s" % e)
508+ raise
509+
510 LOG.info('Reported ready to Azure fabric.')
511
512
513@@ -467,4 +593,22 @@ def get_metadata_from_fabric(fallback_lease_file=None, dhcp_opts=None,
514 finally:
515 shim.clean_up()
516
517+
518+class EphemeralDHCPv4WithReporting(object):
519+ def __init__(self, reporter, nic=None):
520+ self.reporter = reporter
521+ self.ephemeralDHCPv4 = EphemeralDHCPv4(iface=nic)
522+
523+ def __enter__(self):
524+ with events.ReportEventStack(
525+ name="obtain-dhcp-lease",
526+ description="obtain dhcp lease",
527+ parent=self.reporter):
528+ return self.ephemeralDHCPv4.__enter__()
529+
530+ def __exit__(self, excp_type, excp_value, excp_traceback):
531+ self.ephemeralDHCPv4.__exit__(
532+ excp_type, excp_value, excp_traceback)
533+
534+
535 # vi: ts=4 expandtab
536diff --git a/tests/unittests/test_datasource/test_azure.py b/tests/unittests/test_datasource/test_azure.py
537index 4d57ceb..3547dd9 100644
538--- a/tests/unittests/test_datasource/test_azure.py
539+++ b/tests/unittests/test_datasource/test_azure.py
540@@ -181,7 +181,7 @@ class TestGetMetadataFromIMDS(HttprettyTestCase):
541 self.logs.getvalue())
542
543 @mock.patch(MOCKPATH + 'readurl')
544- @mock.patch(MOCKPATH + 'EphemeralDHCPv4')
545+ @mock.patch(MOCKPATH + 'EphemeralDHCPv4WithReporting')
546 @mock.patch(MOCKPATH + 'net.is_up')
547 def test_get_metadata_performs_dhcp_when_network_is_down(
548 self, m_net_is_up, m_dhcp, m_readurl):
549@@ -195,7 +195,7 @@ class TestGetMetadataFromIMDS(HttprettyTestCase):
550 dsaz.get_metadata_from_imds('eth9', retries=2))
551
552 m_net_is_up.assert_called_with('eth9')
553- m_dhcp.assert_called_with('eth9')
554+ m_dhcp.assert_called_with(mock.ANY, 'eth9')
555 self.assertIn(
556 "Crawl of Azure Instance Metadata Service (IMDS) took", # log_time
557 self.logs.getvalue())
558@@ -552,7 +552,8 @@ scbus-1 on xpt0 bus 0
559 dsrc.crawl_metadata()
560 self.assertEqual(str(cm.exception), error_msg)
561
562- @mock.patch('cloudinit.sources.DataSourceAzure.EphemeralDHCPv4')
563+ @mock.patch(
564+ 'cloudinit.sources.DataSourceAzure.EphemeralDHCPv4WithReporting')
565 @mock.patch('cloudinit.sources.DataSourceAzure.util.write_file')
566 @mock.patch(
567 'cloudinit.sources.DataSourceAzure.DataSourceAzure._report_ready')
568@@ -1308,7 +1309,9 @@ class TestAzureBounce(CiTestCase):
569 self.assertEqual(initial_host_name,
570 self.set_hostname.call_args_list[-1][0][0])
571
572- def test_environment_correct_for_bounce_command(self):
573+ @mock.patch.object(dsaz, 'get_boot_telemetry')
574+ def test_environment_correct_for_bounce_command(
575+ self, mock_get_boot_telemetry):
576 interface = 'int0'
577 hostname = 'my-new-host'
578 old_hostname = 'my-old-host'
579@@ -1324,7 +1327,9 @@ class TestAzureBounce(CiTestCase):
580 self.assertEqual(hostname, bounce_env['hostname'])
581 self.assertEqual(old_hostname, bounce_env['old_hostname'])
582
583- def test_default_bounce_command_ifup_used_by_default(self):
584+ @mock.patch.object(dsaz, 'get_boot_telemetry')
585+ def test_default_bounce_command_ifup_used_by_default(
586+ self, mock_get_boot_telemetry):
587 cfg = {'hostname_bounce': {'policy': 'force'}}
588 data = self.get_ovf_env_with_dscfg('some-hostname', cfg)
589 dsrc = self._get_ds(data, agent_command=['not', '__builtin__'])
590diff --git a/tests/unittests/test_reporting_hyperv.py b/tests/unittests/test_reporting_hyperv.py
591old mode 100755
592new mode 100644
593index d01ed5b..640895a
594--- a/tests/unittests/test_reporting_hyperv.py
595+++ b/tests/unittests/test_reporting_hyperv.py
596@@ -7,9 +7,12 @@ import json
597 import os
598 import struct
599 import time
600+import re
601+import mock
602
603 from cloudinit import util
604 from cloudinit.tests.helpers import CiTestCase
605+from cloudinit.sources.helpers import azure
606
607
608 class TestKvpEncoding(CiTestCase):
609@@ -126,3 +129,65 @@ class TextKvpReporter(CiTestCase):
610 reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path)
611 kvps = list(reporter._iterate_kvps(0))
612 self.assertEqual(0, len(kvps))
613+
614+ @mock.patch('cloudinit.distros.uses_systemd')
615+ @mock.patch('cloudinit.util.subp')
616+ def test_get_boot_telemetry(self, m_subp, m_sysd):
617+ reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path)
618+ datetime_pattern = r"\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]"
619+ r"\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z)"
620+
621+ # get_boot_telemetry makes two subp calls to systemctl. We provide
622+ # a list of values that the subp calls should return
623+ m_subp.side_effect = [
624+ ('UserspaceTimestampMonotonic=1844838', ''),
625+ ('InactiveExitTimestampMonotonic=3068203', '')]
626+ m_sysd.return_value = True
627+
628+ reporter.publish_event(azure.get_boot_telemetry())
629+ reporter.q.join()
630+ kvps = list(reporter._iterate_kvps(0))
631+ self.assertEqual(1, len(kvps))
632+
633+ evt_msg = kvps[0]['value']
634+ if not re.search("kernel_start=" + datetime_pattern, evt_msg):
635+ raise AssertionError("missing kernel_start timestamp")
636+ if not re.search("user_start=" + datetime_pattern, evt_msg):
637+ raise AssertionError("missing user_start timestamp")
638+ if not re.search("cloudinit_activation=" + datetime_pattern,
639+ evt_msg):
640+ raise AssertionError(
641+ "missing cloudinit_activation timestamp")
642+
643+ def test_get_system_info(self):
644+ reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path)
645+ pattern = r"[^=\s]+"
646+
647+ reporter.publish_event(azure.get_system_info())
648+ reporter.q.join()
649+ kvps = list(reporter._iterate_kvps(0))
650+ self.assertEqual(1, len(kvps))
651+ evt_msg = kvps[0]['value']
652+
653+ # the most important information is cloudinit version,
654+ # kernel_version, and the distro variant. It is ok if
655+ # if the rest is not available
656+ if not re.search("cloudinit_version=" + pattern, evt_msg):
657+ raise AssertionError("missing cloudinit_version string")
658+ if not re.search("kernel_version=" + pattern, evt_msg):
659+ raise AssertionError("missing kernel_version string")
660+ if not re.search("variant=" + pattern, evt_msg):
661+ raise AssertionError("missing distro variant string")
662+
663+ def test_report_diagnostic_event(self):
664+ reporter = HyperVKvpReportingHandler(kvp_file_path=self.tmp_file_path)
665+
666+ reporter.publish_event(
667+ azure.report_diagnostic_event("test_diagnostic"))
668+ reporter.q.join()
669+ kvps = list(reporter._iterate_kvps(0))
670+ self.assertEqual(1, len(kvps))
671+ evt_msg = kvps[0]['value']
672+
673+ if "test_diagnostic" not in evt_msg:
674+ raise AssertionError("missing expected diagnostic message")

Subscribers

People subscribed via source and target branches