Merge lp:~qzhang/lava-scheduler/retrieve-job-fail-reason into lp:lava-scheduler

Proposed by Spring Zhang
Status: Needs review
Proposed branch: lp:~qzhang/lava-scheduler/retrieve-job-fail-reason
Merge into: lp:lava-scheduler
Diff against target: 150 lines (+70/-5)
3 files modified
lava_scheduler_app/logfile_helper.py (+63/-3)
lava_scheduler_app/models.py (+0/-1)
lava_scheduler_daemon/dbjobsource.py (+7/-1)
To merge this branch: bzr merge lp:~qzhang/lava-scheduler/retrieve-job-fail-reason
Reviewer Review Type Date Requested Status
Spring Zhang (community) Needs Resubmitting
Zygmunt Krynicki (community) Needs Fixing
Paul Larson (community) Disapprove
Review via email: mp+96338@code.launchpad.net

Description of the change

Based on the job failure reason summary, I made the commit to find out why health job failure from job logfile. If the failure is caused by 'LAVA', I prefer to keep health job status as it is. Other reasons will trigger the board goes offline.

Versus retrieving information from result bundle, I think the log information make sense, I'd like to keep the task getting information from result bundle in the future if necessary.

One question is about job.log_file, is it a string which could be use .find() method as:
if self.logfile.find(item[0]) != -1

To post a comment you must log in.
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

flake8 (pip install flake8) check your code please

9 +class JobFailReason:

Inherit object please

+ self.type = ["LAVA", "Master Image", "HW"]

Ugly way to specify an enum, I'd prefer
FAULT_LAVA, FAULT_MASTER, FAULT_HW = range(3)
And put it in the class scope

13 + # struct: [ "match string": ["error type", "detail reason"]]
16 + self.match_matrix = [

Array-of-arrays, ugh, use namedtuple to make that readable:

from collections import namedtuple
M = namedtuple("M", "match error detail")

M("tar: You must specify one of the", FAULT_LAVA, "deployment failed due to untar issue"),

... and so on

34 + def get_health_check_reason(self):
35 + return self.health_check_reason

Useless, attribute was public anyway, remove it

37 + def get_error_types(self):
38 + return self.type

This should be a class level property, JobFailReason.FAULT_xxx

+ detail_msg = "\r" + r[1] + " -- " + r[2]

\r ? really?

review: Needs Fixing
148. By Spring Zhang

use flake8 check to align PEP8

149. By Spring Zhang

Modify due to comments.

1. use namedtuple
2. use class property

Revision history for this message
Spring Zhang (qzhang) wrote :

Modify due to comments. And add a warning status of health check.

review: Needs Resubmitting
Revision history for this message
Paul Larson (pwlars) wrote :

I don't think we want to do this. One of the uses of health checks is to catch problems with lava after performing an update. The idea is that we would set all health statuses to unknown when doing the update, and force a health check before they process any new jobs. If there's a bug we introduced that drastically interferes with processing jobs then it can catch it, and prevent any more jobs from running.

review: Disapprove
150. By Spring Zhang

Revert to put board offline onece check job failed

151. By Spring Zhang

merge from upstream and revert

Revision history for this message
Spring Zhang (qzhang) wrote :

Now it is back to put board offline once there is a job failure, only add some possible reasons from job log file.

review: Needs Resubmitting
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

72 + JobFailReason.M("tar: You must specify one of the",

You can just say M as it is already in scope.

144 + if not 'server' in params or \
145 + not isinstance(params['server'], unicode):

The general consensus around using trailing \ is not to do it. Please use parentheses in this case.

>>> if (something \n rest of if):

+ JobFailReason.FAULT_LAVA, "deployment failed due to untar issue"),

You can say self.FAULT_LAVA, this is shorter and is usually preferred.

99 + errors = ""
100 + for r in self._health_check_reason:
101 + errors += "\n" + r.match_text + " -- " + r.reason
102 + return errors

Don't join strings like this in python. It is quadratic in complexity.
Instead use str.join(), it is linear in complexity and you get the number of glue elements right (you otherwise get the unneeded leading newline). Compare:

return "\n".join(
           ["{0.match_text} -- {0.reason}".format(r)
           for f in self._health_check_reason])

review: Needs Fixing
152. By Spring Zhang

merge from mainline

153. By Spring Zhang

modify due to comments

154. By Spring Zhang

Add more error message text

155. By Spring Zhang

1. modify nameturple name to a make-sense one
2. use FieldFile to read logfile

Revision history for this message
Spring Zhang (qzhang) wrote :

1. Fix due to comments.
2. Have a test.

review: Needs Resubmitting
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Firstly I should apologise; I think I said that I'd look at this weeks ago and haven't.

I definitely like the idea of attempting, even quite crudely, to automatically fish problems out of the log.

I don't think having a static list in the source is a very good way of doing this though -- perhaps we could have a simple table containing regular expressions and explanations that you could edit through the admin interface?

Also, I think if we're going to put more information in the transition table (which sounds fine), we should display it better. We have markdown support in lava-server already so it should be possible to render it as markdown if we really want to... (I also think we should be able to edit the most recent offline reason, but that surely is a separate bug).

Revision history for this message
Spring Zhang (qzhang) wrote :

yes, I also think it's a lot hard coded, a user-defined table is better.

Unmerged revisions

155. By Spring Zhang

1. modify nameturple name to a make-sense one
2. use FieldFile to read logfile

154. By Spring Zhang

Add more error message text

153. By Spring Zhang

modify due to comments

152. By Spring Zhang

merge from mainline

151. By Spring Zhang

merge from upstream and revert

150. By Spring Zhang

Revert to put board offline onece check job failed

149. By Spring Zhang

Modify due to comments.

1. use namedtuple
2. use class property

148. By Spring Zhang

use flake8 check to align PEP8

147. By Spring Zhang

Looking for health job failure reason in log file

146. By Spring Zhang

Set dispatcher error in log file as offline reason

A concern is if the logfile can be reached in job complete jobCompleted_impl, for there is also a getLogFileForJobOnBoard() in this file.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lava_scheduler_app/logfile_helper.py'
2--- lava_scheduler_app/logfile_helper.py 2012-03-19 02:00:36 +0000
3+++ lava_scheduler_app/logfile_helper.py 2012-03-22 10:46:19 +0000
4@@ -1,10 +1,11 @@
5 import re
6+from collections import namedtuple
7
8 def getDispatcherErrors(logfile):
9 errors = ""
10 for line in logfile:
11 if line.find("CriticalError:") != -1 or \
12- line.find("Lava failed on test:") != -1 :
13+ line.find("Lava failed on test:") != -1:
14 errors += line
15
16 return errors
17@@ -35,16 +36,19 @@
18 logs.append((match.group(2), line, ""))
19 return logs
20
21-class Sections:
22+
23+class Sections(object):
24 def __init__(self):
25 self.sections = []
26 self.cur_section_type = None
27 self.cur_section = []
28+
29 def push(self, type, line):
30 if type != self.cur_section_type:
31 self.close()
32 self.cur_section_type = type
33 self.cur_section.append(line)
34+
35 def close(self):
36 if self.cur_section_type is not None:
37 self.sections.append(
38@@ -54,6 +58,7 @@
39 self.cur_section_type = None
40 self.cur_section = []
41
42+
43 def formatLogFile(logfile):
44 if not logfile:
45 return [('log', 1, "Log file is missing")]
46@@ -74,10 +79,65 @@
47 continue
48 elif line.find("<LAVA_DISPATCHER>") != -1 or \
49 line.find("lava_dispatcher") != -1 or \
50- line.find("CriticalError:") != -1 :
51+ line.find("CriticalError:") != -1:
52 sections.push('log', line)
53 else:
54 sections.push('console', line)
55 sections.close()
56
57 return sections.sections
58+
59+
60+class JobFailReason(object):
61+ # Error types
62+ FAULT_LAVA, FAULT_MASTERIMAGE, FAULT_HW = range(3)
63+ # match item struct
64+ MATCH = namedtuple("MATCH", ['match_text', 'error_type', 'reason'])
65+
66+ def __init__(self, logfile):
67+ self.logfile = logfile
68+ self._health_check_reason = []
69+ self._error_types = []
70+ # TODO: load a user-defined text file to match matrix
71+ # some hard coded
72+ self.match_matrix = [
73+ self.MATCH("tar: You must specify one of the",
74+ self.FAULT_LAVA, "deployment failed due to untar issue"),
75+ self.MATCH("tar: It is possible that the compressed file(s) have become corrupted.",
76+ self.FAULT_LAVA, "deployment failed due to untar issue"),
77+ self.MATCH("mount: invalid offset 'None' specified",
78+ self.FAULT_LAVA, "deployment failed due to mount issue"),
79+ self.MATCH("(initramfs)",
80+ self.FAULT_MASTERIMAGE, "Master image corrupt"),
81+ self.MATCH("---[ end trace",
82+ self.FAULT_MASTERIMAGE, "kernel dump"),
83+ self.MATCH("mmcblk0: error -110 transferring data",
84+ self.FAULT_MASTERIMAGE, "SD card corrupted"),
85+ self.MATCH("mmc0: error -110 whilst initialising SD card",
86+ self.FAULT_MASTERIMAGE, "SD card corrupted"),
87+ self.MATCH("<<<Not Connected>>>",
88+ self.FAULT_HW, "conmux console connection corrupted"),
89+ ]
90+
91+ def look_for_health_check_reason(self):
92+ errors = []
93+
94+ self.logfile.open()
95+ logfile_text = self.logfile.read()
96+ self.logfile.close()
97+ for item in self.match_matrix:
98+ if logfile_text.find(item.match_text) != -1:
99+ errors.append(item)
100+ if not item.error_type in self._error_types:
101+ self._error_types.append(item.error_type)
102+
103+ self._health_check_reason = errors
104+
105+ @property
106+ def health_check_reason(self):
107+ return "\n".join(["{0.match_text} -- {0.reason}".format(r)
108+ for r in self._health_check_reason])
109+
110+ @property
111+ def error_types(self):
112+ return self._error_types
113
114=== modified file 'lava_scheduler_app/models.py'
115--- lava_scheduler_app/models.py 2012-03-15 20:44:39 +0000
116+++ lava_scheduler_app/models.py 2012-03-22 10:46:19 +0000
117@@ -159,7 +159,6 @@
118 # return device_type.device_set.all()
119
120
121-
122 class TestJob(RestrictedResource):
123 """
124 A test job is a test process that will be run on a Device.
125
126=== modified file 'lava_scheduler_daemon/dbjobsource.py'
127--- lava_scheduler_daemon/dbjobsource.py 2012-03-09 01:27:51 +0000
128+++ lava_scheduler_daemon/dbjobsource.py 2012-03-22 10:46:19 +0000
129@@ -17,6 +17,7 @@
130 from zope.interface import implements
131
132 from lava_scheduler_app.models import Device, DeviceStateTransition, TestJob
133+from lava_scheduler_app.logfile_helper import getDispatcherErrors, JobFailReason
134 from lava_scheduler_daemon.jobsource import IJobSource
135
136
137@@ -238,8 +239,13 @@
138 if job.health_check is True:
139 device.last_health_report_job = job
140 if job.status == TestJob.INCOMPLETE:
141+ offline_reason = "Health Check Job Failed: "
142+ offline_reason += getDispatcherErrors(job.log_file)
143+ reasons = JobFailReason(job.log_file)
144+ reasons.look_for_health_check_reason()
145+ offline_reason += reasons.health_check_reason
146 device.health_status = Device.HEALTH_FAIL
147- device.put_into_maintenance_mode(None, "Health Check Job Failed")
148+ device.put_into_maintenance_mode(None, offline_reason)
149 elif job.status == TestJob.COMPLETE:
150 device.health_status = Device.HEALTH_PASS
151

Subscribers

People subscribed via source and target branches