Merge lp:~roadmr/checkbox/fix-backend-protocol into lp:checkbox

Proposed by Daniel Manrique
Status: Merged
Merged at revision: 954
Proposed branch: lp:~roadmr/checkbox/fix-backend-protocol
Merge into: lp:checkbox
Diff against target: 146 lines (+58/-14)
3 files modified
backend (+3/-0)
debian/changelog (+6/-1)
plugins/backend_info.py (+49/-13)
To merge this branch: bzr merge lp:~roadmr/checkbox/fix-backend-protocol
Reviewer Review Type Date Requested Status
Marc Tardif (community) Approve
Review via email: mp+68091@code.launchpad.net

Description of the change

So my previous "improvements" to the frontend/backend were not so good, because reads from the backend timed out after 15 seconds. This meant that the frontend wouldn't get stuck waiting for backend to start, but also that if the backend took longer than 15 seconds to complete a job, the frontend would just leave it behind and go on its merry way. At the end of execution the frontend exits, but the backend keeps working on whatever it was doing, so we'd have "lingering backends" again.

This code fixes that. I built packages with these fixes rolled in, and tested them with a default.whitelist run on Oneiric.

I'd welcome input on the code itself, and particularly on the logging strings I used, especially the FAIL result and comment we return when there's no backend to run the jobs. Maybe this should be "unsupported" instead?.

Even though this is all expressed in the code, here's a small explanation of what it does.

We try to spawn the backend. We fork it, then the frontend issues a "ping" to the FIFO. If it gets a "pong" reply within one minute, the backend has started, so we continue execution.

If there's no "pong" reply, we reattempt to spawn. We try this 3 times. If after 3 tries (3 minutes) there's no backend, we assume a problem with it and continue without. If we can't spawn, all backended jobs will return Fail with an error message indicating absence of backend.

Prior to sending a job to the backend, we reping it. If we get a reply, we go ahead with job execution. If not, we do NOT try to respawn it, we just fail the job execution, and a flag is set to mark the fact that the backend died in between jobs, so further backend jobs will fail quickly rather than take one minute to realize the backend has died, as no attempt will be made to revive the backend.

When we send a job to the backend, the reading will timeout after one minute. In this case, since we assume the backend is still chugging on the job, we just print to the logfile and keep waiting indefinitely. The log file at least shows we're waiting on something.

With this design, pretty much the only situation where things can hang up indefinitely is if the backend dies during execution of a job, in which case the frontend will just spin waiting forever. However since the backend is really simple, this is an unlikely case.

To post a comment you must log in.
Revision history for this message
Marc Tardif (cr3) wrote :

Can't wait to try it!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'backend'
2--- backend 2011-06-03 20:30:45 +0000
3+++ backend 2011-07-15 14:28:28 +0000
4@@ -30,6 +30,9 @@
5 message = reader.read_object()
6 if message == "stop":
7 break
8+ if message == "ping":
9+ writer.write_object("pong")
10+ continue
11 if isinstance(message, dict) and "command" in message:
12 job = Job(message["command"], message.get("environ"),
13 message.get("timeout"))
14
15=== modified file 'debian/changelog'
16--- debian/changelog 2011-07-13 15:40:31 +0000
17+++ debian/changelog 2011-07-15 14:28:28 +0000
18@@ -1,5 +1,9 @@
19 checkbox (0.12.4) oneiric; urgency=low
20
21+ * Further improvements to make frontend/backend communication more reliable.
22+ Prevents stuck backends, failure to close the GUI due to lack of reply
23+ from the backend, and test specifying "user" not being run.
24+
25 [Javier Collado]
26 * Checkbox exits with EX_NOINPUT if a whitelist or blacklist file is
27 specified and cannot be found.
28@@ -8,8 +12,9 @@
29 * Refactored job definition files.
30 * Fixed dependencies and test naming.
31 * Added Online CPU before/after suspend test.
32+
33
34- -- Daniel Manrique <daniel.manrique@canonical.com> Wed, 13 Jul 2011 11:26:17 -0400
35+ -- Daniel Manrique <daniel.manrique@canonical.com> Fri, 15 Jul 2011 09:31:10 -0400
36
37 checkbox (0.12.3) oneiric; urgency=low
38
39
40=== modified file 'plugins/backend_info.py'
41--- plugins/backend_info.py 2011-06-03 20:30:45 +0000
42+++ plugins/backend_info.py 2011-07-15 14:28:28 +0000
43@@ -18,6 +18,7 @@
44 #
45 import os
46 import shutil
47+import logging
48
49 from subprocess import call, PIPE
50 from tempfile import mkdtemp
51@@ -26,13 +27,13 @@
52
53 from checkbox.plugin import Plugin
54 from checkbox.properties import Path, Float
55-
56+from checkbox.job import FAIL
57
58 class BackendInfo(Plugin):
59
60- # how long to wait for I/O from/to the backend. If it takes longer
61- # than this, the message is ignored.
62- timeout = Float(default=15.0)
63+ # how long to wait for I/O from/to the backend before the call returns.
64+ # How we behave if I/O times out is dependent on the situation.
65+ timeout = Float(default=60.0)
66
67 command = Path(default="%(checkbox_share)s/backend")
68
69@@ -73,25 +74,59 @@
70
71 return prefix + self.get_command(*args)
72
73+ def spawn_backend(self, input_fifo, output_fifo):
74+ self.pid = os.fork()
75+ if self.pid == 0:
76+ root_command = self.get_root_command(input_fifo, output_fifo)
77+ os.execvp(root_command[0], root_command)
78+ # Should never get here
79+
80+ def ping_backend(self):
81+ if not self.parent_reader or not self.parent_writer:
82+ return False
83+ self.parent_writer.write_object("ping")
84+ result = self.parent_reader.read_object()
85+ return result == "pong"
86+
87+
88 def gather(self):
89 self.directory = mkdtemp(prefix="checkbox")
90 child_input = create_fifo(os.path.join(self.directory, "input"), 0600)
91 child_output = create_fifo(os.path.join(self.directory, "output"), 0600)
92
93- self.pid = os.fork()
94- if self.pid > 0:
95+ self.backend_is_alive = False
96+ for attempt in range(1,4):
97+ self.spawn_backend(child_input, child_output)
98+ #Only returns if I'm still the parent, so I can do parent stuff here
99 self.parent_writer = FifoWriter(child_input, timeout=self.timeout)
100 self.parent_reader = FifoReader(child_output, timeout=self.timeout)
101+ if self.ping_backend():
102+ logging.debug("Backend responded, continuing execution.")
103+ self.backend_is_alive = True
104+ break
105+ else:
106+ logging.debug("Backend didn't respond, trying to create again.")
107
108- else:
109- root_command = self.get_root_command(child_input, child_output)
110- os.execvp(root_command[0], root_command)
111- # Should never get here
112+ if not self.backend_is_alive:
113+ logging.warning("Privileged backend not responding. " +
114+ "jobs specifying user will not be run")
115
116 def message_exec(self, message):
117 if "user" in message:
118- self.parent_writer.write_object(message)
119- result = self.parent_reader.read_object()
120+ if (self.backend_is_alive and not self.ping_backend()):
121+ self.backend_is_alive = False
122+
123+ if self.backend_is_alive:
124+ self.parent_writer.write_object(message)
125+ while True:
126+ result = self.parent_reader.read_object()
127+ if result:
128+ break
129+ else:
130+ logging.info("Waiting for result...")
131+ else:
132+ result = (FAIL, "Unable to test. Privileges are " +
133+ "required for this job.", 0,)
134 if result:
135 self._manager.reactor.fire("message-result", *result)
136
137@@ -101,7 +136,8 @@
138 self.parent_reader.close()
139 shutil.rmtree(self.directory)
140
141- os.waitpid(self.pid, 0)
142+ if self.backend_is_alive:
143+ os.waitpid(self.pid, 0)
144
145
146 factory = BackendInfo

Subscribers

People subscribed via source and target branches