Merge lp:~qzhang/lava-scheduler/retrieve-job-fail-reason into lp:lava-scheduler

Proposed by Spring Zhang
Status: Needs review
Proposed branch: lp:~qzhang/lava-scheduler/retrieve-job-fail-reason
Merge into: lp:lava-scheduler
Diff against target: 150 lines (+70/-5)
3 files modified
lava_scheduler_app/logfile_helper.py (+63/-3)
lava_scheduler_app/models.py (+0/-1)
lava_scheduler_daemon/dbjobsource.py (+7/-1)
To merge this branch: bzr merge lp:~qzhang/lava-scheduler/retrieve-job-fail-reason
Reviewer Review Type Date Requested Status
Spring Zhang (community) Needs Resubmitting
Zygmunt Krynicki (community) Needs Fixing
Paul Larson (community) Disapprove
Review via email: mp+96338@code.launchpad.net

Description of the change

Based on the job failure reason summary, I made the commit to find out why health job failure from job logfile. If the failure is caused by 'LAVA', I prefer to keep health job status as it is. Other reasons will trigger the board goes offline.

Versus retrieving information from result bundle, I think the log information make sense, I'd like to keep the task getting information from result bundle in the future if necessary.

One question is about job.log_file, is it a string which could be use .find() method as:
if self.logfile.find(item[0]) != -1

To post a comment you must log in.
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

flake8 (pip install flake8) check your code please

9 +class JobFailReason:

Inherit object please

+ self.type = ["LAVA", "Master Image", "HW"]

Ugly way to specify an enum, I'd prefer
FAULT_LAVA, FAULT_MASTER, FAULT_HW = range(3)
And put it in the class scope

13 + # struct: [ "match string": ["error type", "detail reason"]]
16 + self.match_matrix = [

Array-of-arrays, ugh, use namedtuple to make that readable:

from collections import namedtuple
M = namedtuple("M", "match error detail")

M("tar: You must specify one of the", FAULT_LAVA, "deployment failed due to untar issue"),

... and so on

34 + def get_health_check_reason(self):
35 + return self.health_check_reason

Useless, attribute was public anyway, remove it

37 + def get_error_types(self):
38 + return self.type

This should be a class level property, JobFailReason.FAULT_xxx

+ detail_msg = "\r" + r[1] + " -- " + r[2]

\r ? really?

review: Needs Fixing
148. By Spring Zhang

use flake8 check to align PEP8

149. By Spring Zhang

Modify due to comments.

1. use namedtuple
2. use class property

Revision history for this message
Spring Zhang (qzhang) wrote :

Modify due to comments. And add a warning status of health check.

review: Needs Resubmitting
Revision history for this message
Paul Larson (pwlars) wrote :

I don't think we want to do this. One of the uses of health checks is to catch problems with lava after performing an update. The idea is that we would set all health statuses to unknown when doing the update, and force a health check before they process any new jobs. If there's a bug we introduced that drastically interferes with processing jobs then it can catch it, and prevent any more jobs from running.

review: Disapprove
150. By Spring Zhang

Revert to put board offline onece check job failed

151. By Spring Zhang

merge from upstream and revert

Revision history for this message
Spring Zhang (qzhang) wrote :

Now it is back to put board offline once there is a job failure, only add some possible reasons from job log file.

review: Needs Resubmitting
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

72 + JobFailReason.M("tar: You must specify one of the",

You can just say M as it is already in scope.

144 + if not 'server' in params or \
145 + not isinstance(params['server'], unicode):

The general consensus around using trailing \ is not to do it. Please use parentheses in this case.

>>> if (something \n rest of if):

+ JobFailReason.FAULT_LAVA, "deployment failed due to untar issue"),

You can say self.FAULT_LAVA, this is shorter and is usually preferred.

99 + errors = ""
100 + for r in self._health_check_reason:
101 + errors += "\n" + r.match_text + " -- " + r.reason
102 + return errors

Don't join strings like this in python. It is quadratic in complexity.
Instead use str.join(), it is linear in complexity and you get the number of glue elements right (you otherwise get the unneeded leading newline). Compare:

return "\n".join(
           ["{0.match_text} -- {0.reason}".format(r)
           for f in self._health_check_reason])

review: Needs Fixing
152. By Spring Zhang

merge from mainline

153. By Spring Zhang

modify due to comments

154. By Spring Zhang

Add more error message text

155. By Spring Zhang

1. modify nameturple name to a make-sense one
2. use FieldFile to read logfile

Revision history for this message
Spring Zhang (qzhang) wrote :

1. Fix due to comments.
2. Have a test.

review: Needs Resubmitting
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Firstly I should apologise; I think I said that I'd look at this weeks ago and haven't.

I definitely like the idea of attempting, even quite crudely, to automatically fish problems out of the log.

I don't think having a static list in the source is a very good way of doing this though -- perhaps we could have a simple table containing regular expressions and explanations that you could edit through the admin interface?

Also, I think if we're going to put more information in the transition table (which sounds fine), we should display it better. We have markdown support in lava-server already so it should be possible to render it as markdown if we really want to... (I also think we should be able to edit the most recent offline reason, but that surely is a separate bug).

Revision history for this message
Spring Zhang (qzhang) wrote :

yes, I also think it's a lot hard coded, a user-defined table is better.

Unmerged revisions

155. By Spring Zhang

1. modify nameturple name to a make-sense one
2. use FieldFile to read logfile

154. By Spring Zhang

Add more error message text

153. By Spring Zhang

modify due to comments

152. By Spring Zhang

merge from mainline

151. By Spring Zhang

merge from upstream and revert

150. By Spring Zhang

Revert to put board offline onece check job failed

149. By Spring Zhang

Modify due to comments.

1. use namedtuple
2. use class property

148. By Spring Zhang

use flake8 check to align PEP8

147. By Spring Zhang

Looking for health job failure reason in log file

146. By Spring Zhang

Set dispatcher error in log file as offline reason

A concern is if the logfile can be reached in job complete jobCompleted_impl, for there is also a getLogFileForJobOnBoard() in this file.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lava_scheduler_app/logfile_helper.py'
--- lava_scheduler_app/logfile_helper.py 2012-03-19 02:00:36 +0000
+++ lava_scheduler_app/logfile_helper.py 2012-03-22 10:46:19 +0000
@@ -1,10 +1,11 @@
1import re1import re
2from collections import namedtuple
23
3def getDispatcherErrors(logfile):4def getDispatcherErrors(logfile):
4 errors = ""5 errors = ""
5 for line in logfile:6 for line in logfile:
6 if line.find("CriticalError:") != -1 or \7 if line.find("CriticalError:") != -1 or \
7 line.find("Lava failed on test:") != -1 :8 line.find("Lava failed on test:") != -1:
8 errors += line9 errors += line
910
10 return errors11 return errors
@@ -35,16 +36,19 @@
35 logs.append((match.group(2), line, ""))36 logs.append((match.group(2), line, ""))
36 return logs37 return logs
3738
38class Sections:39
40class Sections(object):
39 def __init__(self):41 def __init__(self):
40 self.sections = []42 self.sections = []
41 self.cur_section_type = None43 self.cur_section_type = None
42 self.cur_section = []44 self.cur_section = []
45
43 def push(self, type, line):46 def push(self, type, line):
44 if type != self.cur_section_type:47 if type != self.cur_section_type:
45 self.close()48 self.close()
46 self.cur_section_type = type49 self.cur_section_type = type
47 self.cur_section.append(line)50 self.cur_section.append(line)
51
48 def close(self):52 def close(self):
49 if self.cur_section_type is not None:53 if self.cur_section_type is not None:
50 self.sections.append(54 self.sections.append(
@@ -54,6 +58,7 @@
54 self.cur_section_type = None58 self.cur_section_type = None
55 self.cur_section = []59 self.cur_section = []
5660
61
57def formatLogFile(logfile):62def formatLogFile(logfile):
58 if not logfile:63 if not logfile:
59 return [('log', 1, "Log file is missing")]64 return [('log', 1, "Log file is missing")]
@@ -74,10 +79,65 @@
74 continue79 continue
75 elif line.find("<LAVA_DISPATCHER>") != -1 or \80 elif line.find("<LAVA_DISPATCHER>") != -1 or \
76 line.find("lava_dispatcher") != -1 or \81 line.find("lava_dispatcher") != -1 or \
77 line.find("CriticalError:") != -1 :82 line.find("CriticalError:") != -1:
78 sections.push('log', line)83 sections.push('log', line)
79 else:84 else:
80 sections.push('console', line)85 sections.push('console', line)
81 sections.close()86 sections.close()
8287
83 return sections.sections88 return sections.sections
89
90
91class JobFailReason(object):
92 # Error types
93 FAULT_LAVA, FAULT_MASTERIMAGE, FAULT_HW = range(3)
94 # match item struct
95 MATCH = namedtuple("MATCH", ['match_text', 'error_type', 'reason'])
96
97 def __init__(self, logfile):
98 self.logfile = logfile
99 self._health_check_reason = []
100 self._error_types = []
101 # TODO: load a user-defined text file to match matrix
102 # some hard coded
103 self.match_matrix = [
104 self.MATCH("tar: You must specify one of the",
105 self.FAULT_LAVA, "deployment failed due to untar issue"),
106 self.MATCH("tar: It is possible that the compressed file(s) have become corrupted.",
107 self.FAULT_LAVA, "deployment failed due to untar issue"),
108 self.MATCH("mount: invalid offset 'None' specified",
109 self.FAULT_LAVA, "deployment failed due to mount issue"),
110 self.MATCH("(initramfs)",
111 self.FAULT_MASTERIMAGE, "Master image corrupt"),
112 self.MATCH("---[ end trace",
113 self.FAULT_MASTERIMAGE, "kernel dump"),
114 self.MATCH("mmcblk0: error -110 transferring data",
115 self.FAULT_MASTERIMAGE, "SD card corrupted"),
116 self.MATCH("mmc0: error -110 whilst initialising SD card",
117 self.FAULT_MASTERIMAGE, "SD card corrupted"),
118 self.MATCH("<<<Not Connected>>>",
119 self.FAULT_HW, "conmux console connection corrupted"),
120 ]
121
122 def look_for_health_check_reason(self):
123 errors = []
124
125 self.logfile.open()
126 logfile_text = self.logfile.read()
127 self.logfile.close()
128 for item in self.match_matrix:
129 if logfile_text.find(item.match_text) != -1:
130 errors.append(item)
131 if not item.error_type in self._error_types:
132 self._error_types.append(item.error_type)
133
134 self._health_check_reason = errors
135
136 @property
137 def health_check_reason(self):
138 return "\n".join(["{0.match_text} -- {0.reason}".format(r)
139 for r in self._health_check_reason])
140
141 @property
142 def error_types(self):
143 return self._error_types
84144
=== modified file 'lava_scheduler_app/models.py'
--- lava_scheduler_app/models.py 2012-03-15 20:44:39 +0000
+++ lava_scheduler_app/models.py 2012-03-22 10:46:19 +0000
@@ -159,7 +159,6 @@
159 # return device_type.device_set.all()159 # return device_type.device_set.all()
160160
161161
162
163class TestJob(RestrictedResource):162class TestJob(RestrictedResource):
164 """163 """
165 A test job is a test process that will be run on a Device.164 A test job is a test process that will be run on a Device.
166165
=== modified file 'lava_scheduler_daemon/dbjobsource.py'
--- lava_scheduler_daemon/dbjobsource.py 2012-03-09 01:27:51 +0000
+++ lava_scheduler_daemon/dbjobsource.py 2012-03-22 10:46:19 +0000
@@ -17,6 +17,7 @@
17from zope.interface import implements17from zope.interface import implements
1818
19from lava_scheduler_app.models import Device, DeviceStateTransition, TestJob19from lava_scheduler_app.models import Device, DeviceStateTransition, TestJob
20from lava_scheduler_app.logfile_helper import getDispatcherErrors, JobFailReason
20from lava_scheduler_daemon.jobsource import IJobSource21from lava_scheduler_daemon.jobsource import IJobSource
2122
2223
@@ -238,8 +239,13 @@
238 if job.health_check is True:239 if job.health_check is True:
239 device.last_health_report_job = job240 device.last_health_report_job = job
240 if job.status == TestJob.INCOMPLETE:241 if job.status == TestJob.INCOMPLETE:
242 offline_reason = "Health Check Job Failed: "
243 offline_reason += getDispatcherErrors(job.log_file)
244 reasons = JobFailReason(job.log_file)
245 reasons.look_for_health_check_reason()
246 offline_reason += reasons.health_check_reason
241 device.health_status = Device.HEALTH_FAIL247 device.health_status = Device.HEALTH_FAIL
242 device.put_into_maintenance_mode(None, "Health Check Job Failed")248 device.put_into_maintenance_mode(None, offline_reason)
243 elif job.status == TestJob.COMPLETE:249 elif job.status == TestJob.COMPLETE:
244 device.health_status = Device.HEALTH_PASS250 device.health_status = Device.HEALTH_PASS
245251

Subscribers

People subscribed via source and target branches