Checkbox

Merge lp:~roadmr/checkbox/fix-backend-protocol into lp:checkbox

fix-backend-protocol
Merge into trunk

Proposed by Daniel Manrique on 2011-07-15

Status:

Merged

Merged at revision:

954

Proposed branch:

lp:~roadmr/checkbox/fix-backend-protocol

Merge into:

lp:checkbox

Diff against target:

146 lines (+58/-14)

3 files modified

backend (+3/-0)
debian/changelog (+6/-1)
plugins/backend_info.py (+49/-13)

To merge this branch:

bzr merge lp:~roadmr/checkbox/fix-backend-protocol

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Marc Tardif (community)		2011-07-15	Approve on 2011-07-15
Review via email: mp+68091@code.launchpad.net

Description of the change

So my previous "improvements" to the frontend/backend were not so good, because reads from the backend timed out after 15 seconds. This meant that the frontend wouldn't get stuck waiting for backend to start, but also that if the backend took longer than 15 seconds to complete a job, the frontend would just leave it behind and go on its merry way. At the end of execution the frontend exits, but the backend keeps working on whatever it was doing, so we'd have "lingering backends" again.

This code fixes that. I built packages with these fixes rolled in, and tested them with a default.whitelist run on Oneiric.

I'd welcome input on the code itself, and particularly on the logging strings I used, especially the FAIL result and comment we return when there's no backend to run the jobs. Maybe this should be "unsupported" instead?.

Even though this is all expressed in the code, here's a small explanation of what it does.

We try to spawn the backend. We fork it, then the frontend issues a "ping" to the FIFO. If it gets a "pong" reply within one minute, the backend has started, so we continue execution.

If there's no "pong" reply, we reattempt to spawn. We try this 3 times. If after 3 tries (3 minutes) there's no backend, we assume a problem with it and continue without. If we can't spawn, all backended jobs will return Fail with an error message indicating absence of backend.

Prior to sending a job to the backend, we reping it. If we get a reply, we go ahead with job execution. If not, we do NOT try to respawn it, we just fail the job execution, and a flag is set to mark the fact that the backend died in between jobs, so further backend jobs will fail quickly rather than take one minute to realize the backend has died, as no attempt will be made to revive the backend.

When we send a job to the backend, the reading will timeout after one minute. In this case, since we assume the backend is still chugging on the job, we just print to the logfile and keep waiting indefinitely. The log file at least shows we're waiting on something.

With this design, pretty much the only situation where things can hang up indefinitely is if the backend dies during execution of a job, in which case the frontend will just spin waiting forever. However since the backend is really simple, this is an unlikely case.

Revision history for this message

Marc Tardif (cr3) wrote on 2011-07-15:

Can't wait to try it!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Checkbox Developers

Daniel Manrique

Kyle Ireland

Ma Jun

Sylvain Pineau

Thao Nguyen

Zygmunt Krynicki

 === modified file 'backend'
 --- backend	2011-06-03 20:30:45 +0000
 +++ backend	2011-07-15 14:28:28 +0000
@@ -30,6 +30,9 @@
              message = reader.read_object()
              if message == "stop":
                  break
++            if message == "ping":
++                writer.write_object("pong")
++                continue
              if isinstance(message, dict) and "command" in message:
                  job = Job(message["command"], message.get("environ"),
                      message.get("timeout"))
 === modified file 'debian/changelog'
 --- debian/changelog	2011-07-13 15:40:31 +0000
 +++ debian/changelog	2011-07-15 14:28:28 +0000
@@ -1,5 +1,9 @@
  checkbox (0.12.4) oneiric; urgency=low
++  * Further improvements to make frontend/backend communication more reliable.
++    Prevents stuck backends, failure to close the GUI due to lack of reply
++    from the backend, and test specifying "user" not being run.
++
    [Javier Collado]
    * Checkbox exits with EX_NOINPUT if a whitelist or blacklist file is
      specified and cannot be found.
@@ -8,8 +12,9 @@
    * Refactored job definition files.
    * Fixed dependencies and test naming.
    * Added Online CPU before/after suspend test.
++
-- -- Daniel Manrique <daniel.manrique@canonical.com>  Wed, 13 Jul 2011 11:26:17 -0400
++ -- Daniel Manrique <daniel.manrique@canonical.com>  Fri, 15 Jul 2011 09:31:10 -0400
  checkbox (0.12.3) oneiric; urgency=low
 === modified file 'plugins/backend_info.py'
 --- plugins/backend_info.py	2011-06-03 20:30:45 +0000
 +++ plugins/backend_info.py	2011-07-15 14:28:28 +0000
@@ -18,6 +18,7 @@
+ #
  import os
  import shutil
++import logging
  from subprocess import call, PIPE
  from tempfile import mkdtemp
@@ -26,13 +27,13 @@
  from checkbox.plugin import Plugin
  from checkbox.properties import Path, Float
--
++from checkbox.job import FAIL
  class BackendInfo(Plugin):
--    # how long to wait for I/O from/to the backend. If it takes longer
--    # than this, the message is ignored.
--    timeout = Float(default=15.0)
++    # how long to wait for I/O from/to the backend before the call returns.
++    # How we behave if I/O times out is dependent on the situation.
++    timeout = Float(default=60.0)
      command = Path(default="%(checkbox_share)s/backend")
@@ -73,25 +74,59 @@
          return prefix + self.get_command(*args)
++    def spawn_backend(self, input_fifo, output_fifo):
++        self.pid = os.fork()
++        if self.pid == 0:
++            root_command = self.get_root_command(input_fifo, output_fifo)
++            os.execvp(root_command[0], root_command)
++            # Should never get here
++
++    def ping_backend(self):
++        if not self.parent_reader or not self.parent_writer:
++            return False
++        self.parent_writer.write_object("ping")
++        result = self.parent_reader.read_object()
++        return result == "pong"
++
++
      def gather(self):
          self.directory = mkdtemp(prefix="checkbox")
          child_input = create_fifo(os.path.join(self.directory, "input"), 0600)
          child_output = create_fifo(os.path.join(self.directory, "output"), 0600)
--        self.pid = os.fork()
--        if self.pid > 0:
++        self.backend_is_alive = False
++        for attempt in range(1,4):
++            self.spawn_backend(child_input, child_output)
++            #Only returns if I'm still the parent, so I can do parent stuff here
              self.parent_writer = FifoWriter(child_input, timeout=self.timeout)
              self.parent_reader = FifoReader(child_output, timeout=self.timeout)
++            if self.ping_backend():
++                logging.debug("Backend responded, continuing execution.")
++                self.backend_is_alive = True
++                break
++            else:
++                logging.debug("Backend didn't respond, trying to create again.")
--        else:
--            root_command = self.get_root_command(child_input, child_output)
--            os.execvp(root_command[0], root_command)
--            # Should never get here
++        if not self.backend_is_alive:
++            logging.warning("Privileged backend not responding. " +
++                            "jobs specifying user will not be run")
      def message_exec(self, message):
          if "user" in message:
--            self.parent_writer.write_object(message)
--            result = self.parent_reader.read_object()
++            if (self.backend_is_alive and not self.ping_backend()):
++                self.backend_is_alive = False
++
++            if self.backend_is_alive:
++                self.parent_writer.write_object(message)
++                while True:
++                    result = self.parent_reader.read_object()
++                    if result:
++                        break
++                    else:
++                        logging.info("Waiting for result...")
++            else:
++                result = (FAIL, "Unable to test. Privileges are " +
++                                "required for this job.", 0,)
              if result:
                  self._manager.reactor.fire("message-result", *result)
@@ -101,7 +136,8 @@
          self.parent_reader.close()
          shutil.rmtree(self.directory)
--        os.waitpid(self.pid, 0)
++        if self.backend_is_alive:
++            os.waitpid(self.pid, 0)
  factory = BackendInfo

Checkbox

Merge lp:~roadmr/checkbox/fix-backend-protocol into lp:checkbox

Commit message

Description of the change

Preview Diff

Subscribers