Merge ~samgilson/cloud-init:cloud-init-analyze-boot into cloud-init:master

Proposed by Sam Gilson on 2019-06-17
Status: Merged
Approved by: Dan Watkins on 2019-07-15
Approved revision: 54c7df160a48df264447727a85f684f18e2428ef
Merge reported by: Server Team CI bot
Merged at revision: not available
Proposed branch: ~samgilson/cloud-init:cloud-init-analyze-boot
Merge into: cloud-init:master
Diff against target: 661 lines (+546/-12)
6 files modified
cloudinit/analyze/__main__.py (+86/-2)
cloudinit/analyze/show.py (+192/-10)
cloudinit/analyze/tests/test_boot.py (+170/-0)
doc/rtd/topics/analyze.rst (+84/-0)
doc/rtd/topics/capabilities.rst (+1/-0)
doc/rtd/topics/debugging.rst (+13/-0)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve on 2019-07-15
Dan Watkins Approve on 2019-07-15
Sam Gilson (community) Approve on 2019-07-15
Chad Smith 2019-06-17 Approve on 2019-07-12
Review via email: mp+368943@code.launchpad.net

Commit message

Cloud-init analyze module: Added ability to analyze boot events.

This branch introduces a new command line feature for cloud-init.
Currently, the cloud-init module has the capability to analyze events in
cloud-init.log in three ways: 'show', 'blame', 'dump'.
These changes add a fourth capability, called 'boot'.
Running the command 'cloud-init analyze boot' will provide the user three
timestamps.
1) Timestamp for when the kernel starts initializing.
2) Timestamp for when the kernel finishes its initialization.
3) Timestamp for when systemd activates cloud-init.
This feature enables cloud-init users to analyze different boot phases.
This would aid in debugging performance issues related
to cloud-init startup or tracking regression.

To post a comment you must log in.
Chad Smith (chad.smith) wrote :

Thank you for this branch Sam and for making cloud-init better!

A couple of notes:
 1. ^M characters present throughout:
    - cloudinit/analyze/tests/test_boot.py
    - doc/rtd/topics/analyze.rst

 2. Timestamps don't seem to be reported properly in an lxc container for me: (note that the kernel ended boot time abd cloud-init start time is in the future on my system:

root@x1:~# cloud-init analyze boot
-- Most Recent Boot Record --
   Kernel Started at: 2019-06-24 20:24:25.527606
   Kernel ended boot at: 2019-06-27 18:53:04.667329
   Kernel time to boot (seconds): 253719.139722
   Cloud-init start: 2019-06-27 18:53:05.277550
   Time between Kernel boot and Cloud-init start (seconds): 0.610222

root@x1:~# date
Mon Jun 24 20:39:25 UTC 2019

 2. timestamp reported by cloud-init boot doesn't match timestamp in cloud-init logs
  on my system. This may lead to confusing errors because you

 cloud-init analyze boot
   Cloud-init start: 2019-06-21 21:56:24.461153

YET cloud-init does startup @ 21.56:08

grep "running 'init-local" /var/log/cloud-init.log | tail -n 1
2019-06-21 21:56:09,067 - util.py[DEBUG]: Cloud-init v. 18.5-59-gb05832a5-1~bddeb running 'init-local' at Fri, 21 Jun 2019 21:56:08 +0000. Up 5.00 seconds.

Maybe we should be using ActiveEnterTimestampMonotonic or ExecMainStartTimestampMonotonic to represesent on cloud-init-local starts it's datasource discovery?

I'll continue to try to triage what might give us a better handle on cloud-init start time

Chad Smith (chad.smith) :
Chad Smith (chad.smith) wrote :

Marking needs information until we have an idea of how best to align cloud-init start time to something that aligns more with the timestamp in cloud-init logs.
Currently ActiveEnterTimestamp and InactiveExitTimestamp do show matching string values with the timestampl recoreded in cloud-init.log

# cloud-init.log

2019-06-24 20:24:25,463 - util.py[DEBUG]: Cloud-init v. 19.1-11-ge44586f2-1~bddeb running 'init-local' at Mon, 24 Jun 2019 20:24:25 +0000. Up 0.00 seconds.

# systemctl properties
root@x1:~# systemctl show -p ActiveEnterTimestamp cloud-init-local
ActiveEnterTimestamp=Mon 2019-06-24 20:24:25 UTC
root@x1:~# systemctl show -p InactiveExitTimestamp cloud-init-local
InactiveExitTimestamp=Mon 2019-06-24 20:24:25 UTC

So I think that maybe the timestamp arithmetic in the branch may be suspect

review: Needs Information
Sam Gilson (samgilson) wrote :

This is very interesting. I created an Ubuntu 18.04:latest VM on Azure and did the same checks as you, and here is what I got.
cat /var/log/cloud-init.log | grep "running 'init-local"
2019-06-24 18:59:34,735 - util.py[DEBUG]: Cloud-init v. 19.1-1-gbaa47854-0ubuntu1~18.04.1 running 'init-local' at Mon, 24 Jun 2019 18:59:34 +0000. Up 22.97 seconds.

systemctl show -p UserspaceTimestamp
UserspaceTimestamp=Mon 2019-06-24 18:59:22 UTC

systemctl show -p ExecMainStartTimestamp cloud-init-local
ExecMainStartTimestamp=Mon 2019-06-24 19:43:33 UTC

systemctl show -p ActiveEnterTimestamp cloud-init-local
ActiveEnterTimestamp=Mon 2019-06-24 19:43:33 UTC

systemctl show -p InactiveExitTimestamp cloud-init-local
InactiveExitTimestamp=Mon 2019-06-24 18:59:30 UTC

cloud-init analyze boot
-- Most Recent Boot Record --
   Kernel Started at: 2019-06-24 18:59:11.739511
   Kernel ended boot at: 2019-06-24 18:59:22.125180
   Kernel time to boot (seconds): 10.385669
   Cloud-init start: 2019-06-24 18:59:30.426426
   Time between Kernel boot and Cloud-init start (seconds): 8.301246

In my case, the InactiveExitTimestamp and the timestamp shown in cloud-init.log are the closest together (still not correct), with the ActiveEnter and ExecMainStart timestamps being very far off the mark. I am not really certain as to why the ActiveEnter and ExecMainStart are so far off in this case.

Sam Gilson (samgilson) wrote :

Keep in mind I had created this VM earlier in the day, hence why all the timestamps are so far off the current time.
> This is very interesting. I created an Ubuntu 18.04:latest VM on Azure and did
> the same checks as you, and here is what I got.
> cat /var/log/cloud-init.log | grep "running 'init-local"
> 2019-06-24 18:59:34,735 - util.py[DEBUG]: Cloud-init v.
> 19.1-1-gbaa47854-0ubuntu1~18.04.1 running 'init-local' at Mon, 24 Jun 2019
> 18:59:34 +0000. Up 22.97 seconds.
>
> systemctl show -p UserspaceTimestamp
> UserspaceTimestamp=Mon 2019-06-24 18:59:22 UTC
>
> systemctl show -p ExecMainStartTimestamp cloud-init-local
> ExecMainStartTimestamp=Mon 2019-06-24 19:43:33 UTC
>
> systemctl show -p ActiveEnterTimestamp cloud-init-local
> ActiveEnterTimestamp=Mon 2019-06-24 19:43:33 UTC
>
> systemctl show -p InactiveExitTimestamp cloud-init-local
> InactiveExitTimestamp=Mon 2019-06-24 18:59:30 UTC
>
> cloud-init analyze boot
> -- Most Recent Boot Record --
> Kernel Started at: 2019-06-24 18:59:11.739511
> Kernel ended boot at: 2019-06-24 18:59:22.125180
> Kernel time to boot (seconds): 10.385669
> Cloud-init start: 2019-06-24 18:59:30.426426
> Time between Kernel boot and Cloud-init start (seconds): 8.301246
>
> In my case, the InactiveExitTimestamp and the timestamp shown in cloud-
> init.log are the closest together (still not correct), with the ActiveEnter
> and ExecMainStart timestamps being very far off the mark. I am not really
> certain as to why the ActiveEnter and ExecMainStart are so far off in this
> case.

Sam Gilson (samgilson) wrote :

Hey Chad, in your first comment, what did you mean by "on my system"? I have been testing this on Ubuntu 18.04:latest and have found that on all the VMs I booted and tried, the cloud-init boot analyze is always consistently 4 seconds behind the cloud-init.log reported time. However, when checking the timestamps, it seems to be that ActiveEnter and ExecMainStart are consistently never correct, often off by several factors of 10. However in your test, it seems that InactiveExit and ActiveEnter are the same.

Sam Gilson (samgilson) wrote :

It seems that on Azure VMs that have been pre-provisioned, the ActiveEnter and ExecMainStart timestamps are incorrect due to the waiting that the pre-provisioning must do during the cloud-init-local phase of cloud-init in order to get the necessary data from Azure. Due to this, the more consistent time stamp is the InactiveExit timestamp, as it is written before any pre-provisioning occurs. However, InactiveExit is consistently 4 seconds off of the cloud-init.log timestamp.

However, when on a VM that has not been pre-provisioned, using ActiveEnter instead of InactiveExit always nets you the correct answer.

Sam Gilson (samgilson) wrote :

Hey Chad, thanks for all the great help in finding these bugs. I went back through and fixed all of the ones you found, so this MP is now ready for a re-review.

review: Resubmit
Chad Smith (chad.smith) wrote :

looks much better Sam, one suggestion inline to drop/adapt get_ci_start_from_log to use exinst event parsing from cloud-init.log

Chad Smith (chad.smith) wrote :

one more on some of the formatting success/failure too

Sam Gilson (samgilson) wrote :

Hey Chad, thank you for the reviews. I made all the changes you specified, specifically being dropping get_ci_start_from_log and replacing it with your inline comment, and changing the constant functions to formatted strings in analyze_boot. The previous two commits represent these changes, and this merge proposal is now ready for a re-review.

FAILED: Continuous integration, rev:8e6d3c9f176e18cf60d3e3281d8f87334ce72417
https://jenkins.ubuntu.com/server/job/cloud-init-ci/766/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    FAILED: Ubuntu LTS: Build

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/766/rebuild

review: Needs Fixing (continuous-integration)

PASSED: Continuous integration, rev:8e6d3c9f176e18cf60d3e3281d8f87334ce72417
https://jenkins.ubuntu.com/server/job/cloud-init-ci/772/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/772/rebuild

review: Approve (continuous-integration)
Chad Smith (chad.smith) wrote :

One final round of review commments from me on this.
here's a pastebin of the changes if you think it'll work:

1. I think we need to document that we are currently working around an lxc issue in this code
with a comment to pointing to https://github.com/lxc/lxcfs/issues/292

2. we probably shouldn't subshell with subp to stat a file, but use python's os.stat

here's a pastebin of the changes I suggest:
https://paste.ubuntu.com/p/TYbddMbVJ2/

Chad Smith (chad.smith) wrote :

This looks good to me. Will talk with other upstream devs on Monday to see if we have any concerns with including this in 19.2.

review: Approve
Dan Watkins (daniel-thewatkins) wrote :

I haven't done a full review (because I'm on a plane and about to be interrupted), but I did find some stylistic issues as well as perhaps some functional ones.

Sam Gilson (samgilson) :
review: Approve
Dan Watkins (daniel-thewatkins) wrote :

Thanks Sam!

review: Approve

Commit message lints:
 - Line #0 has 26 too many characters. Line starts with: "Cloud-init analyze module:"... - Line #2 has 264 too many characters. Line starts with: "This branch introduces"... - Line #6 has 139 too many characters. Line starts with: "This feature enables"...

review: Needs Fixing
review: Approve (continuous-integration)

Commit message lints:
 - Line #12 has 26 too many characters. Line starts with: "This would aid in debugging"...

review: Needs Fixing
review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/analyze/__main__.py b/cloudinit/analyze/__main__.py
2index f861365..99e5c20 100644
3--- a/cloudinit/analyze/__main__.py
4+++ b/cloudinit/analyze/__main__.py
5@@ -7,7 +7,7 @@ import re
6 import sys
7
8 from cloudinit.util import json_dumps
9-
10+from datetime import datetime
11 from . import dump
12 from . import show
13
14@@ -52,9 +52,93 @@ def get_parser(parser=None):
15 dest='outfile', default='-',
16 help='specify where to write output. ')
17 parser_dump.set_defaults(action=('dump', analyze_dump))
18+ parser_boot = subparsers.add_parser(
19+ 'boot', help='Print list of boot times for kernel and cloud-init')
20+ parser_boot.add_argument('-i', '--infile', action='store',
21+ dest='infile', default='/var/log/cloud-init.log',
22+ help='specify where to read input. ')
23+ parser_boot.add_argument('-o', '--outfile', action='store',
24+ dest='outfile', default='-',
25+ help='specify where to write output.')
26+ parser_boot.set_defaults(action=('boot', analyze_boot))
27 return parser
28
29
30+def analyze_boot(name, args):
31+ """Report a list of how long different boot operations took.
32+
33+ For Example:
34+ -- Most Recent Boot Record --
35+ Kernel Started at: <time>
36+ Kernel ended boot at: <time>
37+ Kernel time to boot (seconds): <time>
38+ Cloud-init activated by systemd at: <time>
39+ Time between Kernel end boot and Cloud-init activation (seconds):<time>
40+ Cloud-init start: <time>
41+ """
42+ infh, outfh = configure_io(args)
43+ kernel_info = show.dist_check_timestamp()
44+ status_code, kernel_start, kernel_end, ci_sysd_start = \
45+ kernel_info
46+ kernel_start_timestamp = datetime.utcfromtimestamp(kernel_start)
47+ kernel_end_timestamp = datetime.utcfromtimestamp(kernel_end)
48+ ci_sysd_start_timestamp = datetime.utcfromtimestamp(ci_sysd_start)
49+ try:
50+ last_init_local = \
51+ [e for e in _get_events(infh) if e['name'] == 'init-local' and
52+ 'starting search' in e['description']][-1]
53+ ci_start = datetime.utcfromtimestamp(last_init_local['timestamp'])
54+ except IndexError:
55+ ci_start = 'Could not find init-local log-line in cloud-init.log'
56+ status_code = show.FAIL_CODE
57+
58+ FAILURE_MSG = 'Your Linux distro or container does not support this ' \
59+ 'functionality.\n' \
60+ 'You must be running a Kernel Telemetry supported ' \
61+ 'distro.\nPlease check ' \
62+ 'https://cloudinit.readthedocs.io/en/latest' \
63+ '/topics/analyze.html for more ' \
64+ 'information on supported distros.\n'
65+
66+ SUCCESS_MSG = '-- Most Recent Boot Record --\n' \
67+ ' Kernel Started at: {k_s_t}\n' \
68+ ' Kernel ended boot at: {k_e_t}\n' \
69+ ' Kernel time to boot (seconds): {k_r}\n' \
70+ ' Cloud-init activated by systemd at: {ci_sysd_t}\n' \
71+ ' Time between Kernel end boot and Cloud-init ' \
72+ 'activation (seconds): {bt_r}\n' \
73+ ' Cloud-init start: {ci_start}\n'
74+
75+ CONTAINER_MSG = '-- Most Recent Container Boot Record --\n' \
76+ ' Container started at: {k_s_t}\n' \
77+ ' Cloud-init activated by systemd at: {ci_sysd_t}\n' \
78+ ' Cloud-init start: {ci_start}\n' \
79+
80+ status_map = {
81+ show.FAIL_CODE: FAILURE_MSG,
82+ show.CONTAINER_CODE: CONTAINER_MSG,
83+ show.SUCCESS_CODE: SUCCESS_MSG
84+ }
85+
86+ kernel_runtime = kernel_end - kernel_start
87+ between_process_runtime = ci_sysd_start - kernel_end
88+
89+ kwargs = {
90+ 'k_s_t': kernel_start_timestamp,
91+ 'k_e_t': kernel_end_timestamp,
92+ 'k_r': kernel_runtime,
93+ 'bt_r': between_process_runtime,
94+ 'k_e': kernel_end,
95+ 'k_s': kernel_start,
96+ 'ci_sysd': ci_sysd_start,
97+ 'ci_sysd_t': ci_sysd_start_timestamp,
98+ 'ci_start': ci_start
99+ }
100+
101+ outfh.write(status_map[status_code].format(**kwargs))
102+ return status_code
103+
104+
105 def analyze_blame(name, args):
106 """Report a list of records sorted by largest time delta.
107
108@@ -119,7 +203,7 @@ def analyze_dump(name, args):
109
110 def _get_events(infile):
111 rawdata = None
112- events, rawdata = show.load_events(infile, None)
113+ events, rawdata = show.load_events_infile(infile)
114 if not events:
115 events, _ = dump.dump_events(rawdata=rawdata)
116 return events
117diff --git a/cloudinit/analyze/show.py b/cloudinit/analyze/show.py
118index 3e778b8..511b808 100644
119--- a/cloudinit/analyze/show.py
120+++ b/cloudinit/analyze/show.py
121@@ -8,8 +8,11 @@ import base64
122 import datetime
123 import json
124 import os
125+import time
126+import sys
127
128 from cloudinit import util
129+from cloudinit.distros import uses_systemd
130
131 # An event:
132 '''
133@@ -49,6 +52,10 @@ format_key = {
134
135 formatting_help = " ".join(["{0}: {1}".format(k.replace('%', '%%'), v)
136 for k, v in format_key.items()])
137+SUCCESS_CODE = 'successful'
138+FAIL_CODE = 'failure'
139+CONTAINER_CODE = 'container'
140+TIMESTAMP_UNKNOWN = (FAIL_CODE, -1, -1, -1)
141
142
143 def format_record(msg, event):
144@@ -125,9 +132,175 @@ def total_time_record(total_time):
145 return 'Total Time: %3.5f seconds\n' % total_time
146
147
148+class SystemctlReader(object):
149+ '''
150+ Class for dealing with all systemctl subp calls in a consistent manner.
151+ '''
152+ def __init__(self, property, parameter=None):
153+ self.epoch = None
154+ self.args = ['/bin/systemctl', 'show']
155+ if parameter:
156+ self.args.append(parameter)
157+ self.args.extend(['-p', property])
158+ # Don't want the init of our object to break. Instead of throwing
159+ # an exception, set an error code that gets checked when data is
160+ # requested from the object
161+ self.failure = self.subp()
162+
163+ def subp(self):
164+ '''
165+ Make a subp call based on set args and handle errors by setting
166+ failure code
167+
168+ :return: whether the subp call failed or not
169+ '''
170+ try:
171+ value, err = util.subp(self.args, capture=True)
172+ if err:
173+ return err
174+ self.epoch = value
175+ return None
176+ except Exception as systemctl_fail:
177+ return systemctl_fail
178+
179+ def parse_epoch_as_float(self):
180+ '''
181+ If subp call succeeded, return the timestamp from subp as a float.
182+
183+ :return: timestamp as a float
184+ '''
185+ # subp has 2 ways to fail: it either fails and throws an exception,
186+ # or returns an error code. Raise an exception here in order to make
187+ # sure both scenarios throw exceptions
188+ if self.failure:
189+ raise RuntimeError('Subprocess call to systemctl has failed, '
190+ 'returning error code ({})'
191+ .format(self.failure))
192+ # Output from systemctl show has the format Property=Value.
193+ # For example, UserspaceMonotonic=1929304
194+ timestamp = self.epoch.split('=')[1]
195+ # Timestamps reported by systemctl are in microseconds, converting
196+ return float(timestamp) / 1000000
197+
198+
199+def dist_check_timestamp():
200+ '''
201+ Determine which init system a particular linux distro is using.
202+ Each init system (systemd, upstart, etc) has a different way of
203+ providing timestamps.
204+
205+ :return: timestamps of kernelboot, kernelendboot, and cloud-initstart
206+ or TIMESTAMP_UNKNOWN if the timestamps cannot be retrieved.
207+ '''
208+
209+ if uses_systemd():
210+ return gather_timestamps_using_systemd()
211+
212+ # Use dmesg to get timestamps if the distro does not have systemd
213+ if util.is_FreeBSD() or 'gentoo' in \
214+ util.system_info()['system'].lower():
215+ return gather_timestamps_using_dmesg()
216+
217+ # this distro doesn't fit anything that is supported by cloud-init. just
218+ # return error codes
219+ return TIMESTAMP_UNKNOWN
220+
221+
222+def gather_timestamps_using_dmesg():
223+ '''
224+ Gather timestamps that corresponds to kernel begin initialization,
225+ kernel finish initialization using dmesg as opposed to systemctl
226+
227+ :return: the two timestamps plus a dummy timestamp to keep consistency
228+ with gather_timestamps_using_systemd
229+ '''
230+ try:
231+ data, _ = util.subp(['dmesg'], capture=True)
232+ split_entries = data[0].splitlines()
233+ for i in split_entries:
234+ if i.decode('UTF-8').find('user') != -1:
235+ splitup = i.decode('UTF-8').split()
236+ stripped = splitup[1].strip(']')
237+
238+ # kernel timestamp from dmesg is equal to 0,
239+ # with the userspace timestamp relative to it.
240+ user_space_timestamp = float(stripped)
241+ kernel_start = float(time.time()) - float(util.uptime())
242+ kernel_end = kernel_start + user_space_timestamp
243+
244+ # systemd wont start cloud-init in this case,
245+ # so we cannot get that timestamp
246+ return SUCCESS_CODE, kernel_start, kernel_end, \
247+ kernel_end
248+
249+ except Exception:
250+ pass
251+ return TIMESTAMP_UNKNOWN
252+
253+
254+def gather_timestamps_using_systemd():
255+ '''
256+ Gather timestamps that corresponds to kernel begin initialization,
257+ kernel finish initialization. and cloud-init systemd unit activation
258+
259+ :return: the three timestamps
260+ '''
261+ kernel_start = float(time.time()) - float(util.uptime())
262+ try:
263+ delta_k_end = SystemctlReader('UserspaceTimestampMonotonic')\
264+ .parse_epoch_as_float()
265+ delta_ci_s = SystemctlReader('InactiveExitTimestampMonotonic',
266+ 'cloud-init-local').parse_epoch_as_float()
267+ base_time = kernel_start
268+ status = SUCCESS_CODE
269+ # lxc based containers do not set their monotonic zero point to be when
270+ # the container starts, instead keep using host boot as zero point
271+ # time.CLOCK_MONOTONIC_RAW is only available in python 3.3
272+ if util.is_container():
273+ # clock.monotonic also uses host boot as zero point
274+ if sys.version_info >= (3, 3):
275+ base_time = float(time.time()) - float(time.monotonic())
276+ # TODO: lxcfs automatically truncates /proc/uptime to seconds
277+ # in containers when https://github.com/lxc/lxcfs/issues/292
278+ # is fixed, util.uptime() should be used instead of stat on
279+ try:
280+ file_stat = os.stat('/proc/1/cmdline')
281+ kernel_start = file_stat.st_atime
282+ except OSError as err:
283+ raise RuntimeError('Could not determine container boot '
284+ 'time from /proc/1/cmdline. ({})'
285+ .format(err))
286+ status = CONTAINER_CODE
287+ else:
288+ status = FAIL_CODE
289+ kernel_end = base_time + delta_k_end
290+ cloudinit_sysd = base_time + delta_ci_s
291+
292+ except Exception as e:
293+ # Except ALL exceptions as Systemctl reader can throw many different
294+ # errors, but any failure in systemctl means that timestamps cannot be
295+ # obtained
296+ print(e)
297+ return TIMESTAMP_UNKNOWN
298+ return status, kernel_start, kernel_end, cloudinit_sysd
299+
300+
301 def generate_records(events, blame_sort=False,
302 print_format="(%n) %d seconds in %I%D",
303 dump_files=False, log_datafiles=False):
304+ '''
305+ Take in raw events and create parent-child dependencies between events
306+ in order to order events in chronological order.
307+
308+ :param events: JSONs from dump that represents events taken from logs
309+ :param blame_sort: whether to sort by timestamp or by time taken.
310+ :param print_format: formatting to represent event, time stamp,
311+ and time taken by the event in one line
312+ :param dump_files: whether to dump files into JSONs
313+ :param log_datafiles: whether or not to log events generated
314+
315+ :return: boot records ordered chronologically
316+ '''
317
318 sorted_events = sorted(events, key=lambda x: x['timestamp'])
319 records = []
320@@ -189,19 +362,28 @@ def generate_records(events, blame_sort=False,
321
322
323 def show_events(events, print_format):
324+ '''
325+ A passthrough method that makes it easier to call generate_records()
326+
327+ :param events: JSONs from dump that represents events taken from logs
328+ :param print_format: formatting to represent event, time stamp,
329+ and time taken by the event in one line
330+
331+ :return: boot records ordered chronologically
332+ '''
333 return generate_records(events, print_format=print_format)
334
335
336-def load_events(infile, rawdata=None):
337- if rawdata:
338- data = rawdata.read()
339- else:
340- data = infile.read()
341+def load_events_infile(infile):
342+ '''
343+ Takes in a log file, read it, and convert to json.
344+
345+ :param infile: The Log file to be read
346
347- j = None
348+ :return: json version of logfile, raw file
349+ '''
350+ data = infile.read()
351 try:
352- j = json.loads(data)
353+ return json.loads(data), data
354 except ValueError:
355- pass
356-
357- return j, data
358+ return None, data
359diff --git a/cloudinit/analyze/tests/test_boot.py b/cloudinit/analyze/tests/test_boot.py
360new file mode 100644
361index 0000000..706e2cc
362--- /dev/null
363+++ b/cloudinit/analyze/tests/test_boot.py
364@@ -0,0 +1,170 @@
365+import os
366+from cloudinit.analyze.__main__ import (analyze_boot, get_parser)
367+from cloudinit.tests.helpers import CiTestCase, mock
368+from cloudinit.analyze.show import dist_check_timestamp, SystemctlReader, \
369+ FAIL_CODE, CONTAINER_CODE
370+
371+err_code = (FAIL_CODE, -1, -1, -1)
372+
373+
374+class TestDistroChecker(CiTestCase):
375+
376+ @mock.patch('cloudinit.util.system_info', return_value={'dist': ('', '',
377+ ''),
378+ 'system': ''})
379+ @mock.patch('platform.linux_distribution', return_value=('', '', ''))
380+ @mock.patch('cloudinit.util.is_FreeBSD', return_value=False)
381+ def test_blank_distro(self, m_sys_info, m_linux_distribution, m_free_bsd):
382+ self.assertEqual(err_code, dist_check_timestamp())
383+
384+ @mock.patch('cloudinit.util.system_info', return_value={'dist': ('', '',
385+ '')})
386+ @mock.patch('platform.linux_distribution', return_value=('', '', ''))
387+ @mock.patch('cloudinit.util.is_FreeBSD', return_value=True)
388+ def test_freebsd_gentoo_cant_find(self, m_sys_info,
389+ m_linux_distribution, m_is_FreeBSD):
390+ self.assertEqual(err_code, dist_check_timestamp())
391+
392+ @mock.patch('cloudinit.util.subp', return_value=(0, 1))
393+ def test_subp_fails(self, m_subp):
394+ self.assertEqual(err_code, dist_check_timestamp())
395+
396+
397+class TestSystemCtlReader(CiTestCase):
398+
399+ def test_systemctl_invalid_property(self):
400+ reader = SystemctlReader('dummyProperty')
401+ with self.assertRaises(RuntimeError):
402+ reader.parse_epoch_as_float()
403+
404+ def test_systemctl_invalid_parameter(self):
405+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
406+ with self.assertRaises(RuntimeError):
407+ reader.parse_epoch_as_float()
408+
409+ @mock.patch('cloudinit.util.subp', return_value=('U=1000000', None))
410+ def test_systemctl_works_correctly_threshold(self, m_subp):
411+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
412+ self.assertEqual(1.0, reader.parse_epoch_as_float())
413+ thresh = 1.0 - reader.parse_epoch_as_float()
414+ self.assertTrue(thresh < 1e-6)
415+ self.assertTrue(thresh > (-1 * 1e-6))
416+
417+ @mock.patch('cloudinit.util.subp', return_value=('U=0', None))
418+ def test_systemctl_succeed_zero(self, m_subp):
419+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
420+ self.assertEqual(0.0, reader.parse_epoch_as_float())
421+
422+ @mock.patch('cloudinit.util.subp', return_value=('U=1', None))
423+ def test_systemctl_succeed_distinct(self, m_subp):
424+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
425+ val1 = reader.parse_epoch_as_float()
426+ m_subp.return_value = ('U=2', None)
427+ reader2 = SystemctlReader('dummyProperty', 'dummyParameter')
428+ val2 = reader2.parse_epoch_as_float()
429+ self.assertNotEqual(val1, val2)
430+
431+ @mock.patch('cloudinit.util.subp', return_value=('100', None))
432+ def test_systemctl_epoch_not_splittable(self, m_subp):
433+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
434+ with self.assertRaises(IndexError):
435+ reader.parse_epoch_as_float()
436+
437+ @mock.patch('cloudinit.util.subp', return_value=('U=foobar', None))
438+ def test_systemctl_cannot_convert_epoch_to_float(self, m_subp):
439+ reader = SystemctlReader('dummyProperty', 'dummyParameter')
440+ with self.assertRaises(ValueError):
441+ reader.parse_epoch_as_float()
442+
443+
444+class TestAnalyzeBoot(CiTestCase):
445+
446+ def set_up_dummy_file_ci(self, path, log_path):
447+ infh = open(path, 'w+')
448+ infh.write('2019-07-08 17:40:49,601 - util.py[DEBUG]: Cloud-init v. '
449+ '19.1-1-gbaa47854-0ubuntu1~18.04.1 running \'init-local\' '
450+ 'at Mon, 08 Jul 2019 17:40:49 +0000. Up 18.84 seconds.')
451+ infh.close()
452+ outfh = open(log_path, 'w+')
453+ outfh.close()
454+
455+ def set_up_dummy_file(self, path, log_path):
456+ infh = open(path, 'w+')
457+ infh.write('dummy data')
458+ infh.close()
459+ outfh = open(log_path, 'w+')
460+ outfh.close()
461+
462+ def remove_dummy_file(self, path, log_path):
463+ if os.path.isfile(path):
464+ os.remove(path)
465+ if os.path.isfile(log_path):
466+ os.remove(log_path)
467+
468+ @mock.patch('cloudinit.analyze.show.dist_check_timestamp',
469+ return_value=err_code)
470+ def test_boot_invalid_distro(self, m_dist_check_timestamp):
471+
472+ path = os.path.dirname(os.path.abspath(__file__))
473+ log_path = path + '/boot-test.log'
474+ path += '/dummy.log'
475+ self.set_up_dummy_file(path, log_path)
476+
477+ parser = get_parser()
478+ args = parser.parse_args(args=['boot', '-i', path, '-o',
479+ log_path])
480+ name_default = ''
481+ analyze_boot(name_default, args)
482+ # now args have been tested, go into outfile and make sure error
483+ # message is in the outfile
484+ outfh = open(args.outfile, 'r')
485+ data = outfh.read()
486+ err_string = 'Your Linux distro or container does not support this ' \
487+ 'functionality.\nYou must be running a Kernel ' \
488+ 'Telemetry supported distro.\nPlease check ' \
489+ 'https://cloudinit.readthedocs.io/en/latest/topics' \
490+ '/analyze.html for more information on supported ' \
491+ 'distros.\n'
492+
493+ self.remove_dummy_file(path, log_path)
494+ self.assertEqual(err_string, data)
495+
496+ @mock.patch("cloudinit.util.is_container", return_value=True)
497+ @mock.patch('cloudinit.util.subp', return_value=('U=1000000', None))
498+ def test_container_no_ci_log_line(self, m_is_container, m_subp):
499+ path = os.path.dirname(os.path.abspath(__file__))
500+ log_path = path + '/boot-test.log'
501+ path += '/dummy.log'
502+ self.set_up_dummy_file(path, log_path)
503+
504+ parser = get_parser()
505+ args = parser.parse_args(args=['boot', '-i', path, '-o',
506+ log_path])
507+ name_default = ''
508+
509+ finish_code = analyze_boot(name_default, args)
510+
511+ self.remove_dummy_file(path, log_path)
512+ self.assertEqual(FAIL_CODE, finish_code)
513+
514+ @mock.patch("cloudinit.util.is_container", return_value=True)
515+ @mock.patch('cloudinit.util.subp', return_value=('U=1000000', None))
516+ @mock.patch('cloudinit.analyze.__main__._get_events', return_value=[{
517+ 'name': 'init-local', 'description': 'starting search', 'timestamp':
518+ 100000}])
519+ @mock.patch('cloudinit.analyze.show.dist_check_timestamp',
520+ return_value=(CONTAINER_CODE, 1, 1, 1))
521+ def test_container_ci_log_line(self, m_is_container, m_subp, m_get, m_g):
522+ path = os.path.dirname(os.path.abspath(__file__))
523+ log_path = path + '/boot-test.log'
524+ path += '/dummy.log'
525+ self.set_up_dummy_file_ci(path, log_path)
526+
527+ parser = get_parser()
528+ args = parser.parse_args(args=['boot', '-i', path, '-o',
529+ log_path])
530+ name_default = ''
531+ finish_code = analyze_boot(name_default, args)
532+
533+ self.remove_dummy_file(path, log_path)
534+ self.assertEqual(CONTAINER_CODE, finish_code)
535diff --git a/doc/rtd/topics/analyze.rst b/doc/rtd/topics/analyze.rst
536new file mode 100644
537index 0000000..5cf38bd
538--- /dev/null
539+++ b/doc/rtd/topics/analyze.rst
540@@ -0,0 +1,84 @@
541+*************************
542+Cloud-init Analyze Module
543+*************************
544+
545+Overview
546+========
547+The analyze module was added to cloud-init in order to help analyze cloud-init boot time
548+performance. It is loosely based on systemd-analyze where there are 4 main actions:
549+show, blame, dump, and boot.
550+
551+The 'show' action is similar to 'systemd-analyze critical-chain' which prints a list of units, the
552+time they started and how long they took. For cloud-init, we have four stages, and within each stage
553+a number of modules may run depending on configuration. ‘cloudinit-analyze show’ will, for each
554+boot, print this information and a summary total time, per boot.
555+
556+The 'blame' action matches 'systemd-analyze blame' where it prints, in descending order,
557+the units that took the longest to run. This output is highly useful for examining where cloud-init
558+is spending its time during execution.
559+
560+The 'dump' action simply dumps the cloud-init logs that the analyze module is performing
561+the analysis on and returns a list of dictionaries that can be consumed for other reporting needs.
562+
563+The 'boot' action prints out kernel related timestamps that are not included in any of the
564+cloud-init logs. There are three different timestamps that are presented to the user:
565+kernel start, kernel finish boot, and cloud-init start. This was added for additional
566+clarity into the boot process that cloud-init does not have control over, to aid in debugging of
567+performance issues related to cloudinit startup or tracking regression.
568+
569+Usage
570+=====
571+Using each of the printing formats is as easy as running one of the following bash commands:
572+
573+.. code-block:: shell-session
574+
575+ cloud-init analyze show
576+ cloud-init analyze blame
577+ cloud-init analyze dump
578+ cloud-init analyze boot
579+
580+Cloud-init analyze boot Timestamp Gathering
581+===========================================
582+The following boot related timestamps are gathered on demand when cloud-init analyze boot runs:
583+- Kernel Startup, which is inferred from system uptime
584+- Kernel Finishes Initialization, which is inferred from systemd UserSpaceMonotonicTimestamp property
585+- Cloud-init activation, which is inferred from the property InactiveExitTimestamp of the cloud-init
586+local systemd unit.
587+
588+In order to gather the necessary timestamps using systemd, running the commands
589+
590+.. code-block:: shell-session
591+
592+ systemctl show -p UserspaceTimestampMonotonic
593+ systemctl show cloud-init-local -p InactiveExitTimestampMonotonic
594+
595+will gather the UserspaceTimestamp and InactiveExitTimestamp.
596+The UserspaceTimestamp tracks when the init system starts, which is used as an indicator of kernel
597+finishing initialization. The InactiveExitTimestamp tracks when a particular systemd unit transitions
598+from the Inactive to Active state, which can be used to mark the beginning of systemd's activation
599+of cloud-init.
600+
601+Currently this only works for distros that use systemd as the init process. We will be expanding
602+support for other distros in the future and this document will be updated accordingly.
603+
604+If systemd is not present on the system, dmesg is used to attempt to find an event that logs the
605+beginning of the init system. However, with this method only the first two timestamps are able to be found;
606+dmesg does not monitor userspace processes, so no cloud-init start timestamps are emitted like when
607+using systemd.
608+
609+List of Cloud-init analyze boot supported distros
610+=================================================
611+- Arch
612+- CentOS
613+- Debian
614+- Fedora
615+- OpenSuSE
616+- Red Hat Enterprise Linux
617+- Ubuntu
618+- SUSE Linux Enterprise Server
619+- CoreOS
620+
621+List of Cloud-init analyze boot unsupported distros
622+===================================================
623+- FreeBSD
624+- Gentoo
625\ No newline at end of file
626diff --git a/doc/rtd/topics/capabilities.rst b/doc/rtd/topics/capabilities.rst
627index 0d8b894..6d85a99 100644
628--- a/doc/rtd/topics/capabilities.rst
629+++ b/doc/rtd/topics/capabilities.rst
630@@ -217,6 +217,7 @@ Get detailed reports of where cloud-init spends most of its time. See
631 * **dump** Machine-readable JSON dump of all cloud-init tracked events.
632 * **show** show time-ordered report of the cost of operations during each
633 boot stage.
634+* **boot** show timestamps from kernel initialization, kernel finish initialization, and cloud-init start.
635
636 .. _cli_devel:
637
638diff --git a/doc/rtd/topics/debugging.rst b/doc/rtd/topics/debugging.rst
639index 51363ea..e13d915 100644
640--- a/doc/rtd/topics/debugging.rst
641+++ b/doc/rtd/topics/debugging.rst
642@@ -68,6 +68,19 @@ subcommands default to reading /var/log/cloud-init.log.
643 00.00100s (modules-final/config-rightscale_userdata)
644 ...
645
646+* ``analyze boot`` Make subprocess calls to the kernel in order to get relevant
647+ pre-cloud-init timestamps, such as the kernel start, kernel finish boot, and cloud-init start.
648+
649+.. code-block:: shell-session
650+
651+ $ cloud-init analyze boot
652+ -- Most Recent Boot Record --
653+ Kernel Started at: 2019-06-13 15:59:55.809385
654+ Kernel ended boot at: 2019-06-13 16:00:00.944740
655+ Kernel time to boot (seconds): 5.135355
656+ Cloud-init start: 2019-06-13 16:00:05.738396
657+ Time between Kernel boot and Cloud-init start (seconds): 4.793656
658+
659
660 Analyze quickstart - LXC
661 ---------------------------

Subscribers

People subscribed via source and target branches