Ubuntu Distributed Development

Merge lp:~vila/udd/795321-make-tea into lp:udd

795321-make-tea
Merge into import-scripts

Proposed by Vincent Ladeuil on 2011-09-19

Status:

Merged

Merged at revision:

513

Proposed branch:

lp:~vila/udd/795321-make-tea

Merge into:

lp:udd

Diff against target:

587 lines (+397/-17)

7 files modified

mass_import.py (+107/-13)
selftest.py (+1/-0)
udd/circuit_breaker.dot (+23/-0)
udd/circuit_breaker.py (+109/-0)
udd/icommon.py (+25/-4)
udd/tests/test_circuit_breaker.py (+129/-0)
udd/threads.py (+3/-0)

To merge this branch:

bzr merge lp:~vila/udd/795321-make-tea

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
James Westby		2011-09-19	Approve on 2011-09-20
Review via email: mp+76058@code.launchpad.net

Description of the change

This implements a circuit breaker to track launchpad down times and avoid
the associated spurious failures.

Strictly speaking a circuit breaker protect the usage of a resource by
tracking the attempts, successes and failures of these usages. When a
failure occurs, it blocks the attempts for some grace period and then wait
for a success to closed the breaker again.

See the udd/circuit_breaker.py and its associated tests for this part.

The pattern roughly apply to launchpad with some tweaks: launchpad is used
in many different ways and not all failures scenarios are currently known
(in the importer code base). Additionnally, several imports are running in
parallel querying launchpad in different ways and producing success and
failures than can be seen by the circuit breaker in unexpected orders.

I chose to implement a variant that relies on the existing known transient
failures: the ones recorded in the status database in the 'should_retry'
table and just pause for a while without starting new imports as soon as a
launchpad failure is detected. After the grace period, the *same* imports
are retried and should provide a good test that launchpad is back.

The net effect will be to *limit* the number of spurious failures.

The nice property of this table is that we can add more scenarios as we see
them appear with:

requeue_package.py --auto <package>

This means we can complete our knowledge as we discover new failures.

I made only rough tests locally... but couldn't find an easy way to simulate
a local launchpad, even less a transiently down one :-/

But given that I've witnessed a storm of transient failures last Friday and
one Today, I expect more to come.

Note that I added the failure for today's storm (~600 failures) with:

requeue --auto --all-of-type libgnomeui

(requeue is provided after sourcing fixit.sh and is really
requeue_package.py) but the package importer still have 165 outstanding jobs
as I write this (i.e. the time needed to recover from a lp downtime if far
longer than the down time itself).

Note that since all these failures were for the same traceback, it's highly
likely that this is the first (or one of the first) query made against
launchpad during an import, so it *is* a good candidate to retry imports.

I.e. even if all in-fly imports fail for an unknown reason (or one we
haven't correctly identified), the next started one will likely fail with a
*known* transient one and stop the storm.

This seems to be a good first step to cope with the actual code base without
requiring a full analysis of the all the failure modes :)

If this requires too much manual tweak and lead to too much false positives,
we can focus on finding a better way to identify launchpad transient
failures, but the circuit breaker will still be there, should be easy to
change and above all, should avoid the huge number of failures.

Revision history for this message

James Westby (james-w) wrote on 2011-09-20:

Hi Vincent,

Thanks for working on this, it looks great, and I'm looking forward
to seeing it in production!

77 + """Launcpad circuit breaker.

Missing "h" in Launchpad.

131 + # We want to retry asap ('priority')
132 + self.status_db.retry(package_name, priority=True)

Maybe try with the same priority as before?

I don't think it's that important though, given that there
will only be a few jobs in that state though.

303 + self.state = 'open'

I prefer class "constants" for that sort of thing, so that
typos lead to obvious runtime errors, rather than errant
behaviour.

Thanks,

James

review: Approve

lp:~vila/udd/795321-make-tea updated on 2011-09-21

517. By Vincent Ladeuil on 2011-09-21: Address james review comments and add a '.dot' for the circuit breaker state automaton.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-09-21:

Download full text (3.4 KiB)

>>>>> James Westby <email address hidden> writes:

> Review: Approve
> Hi Vincent,

> Thanks for working on this,

Thanks for the speedy review !

> it looks great, and I'm looking forward to seeing it in
> production!

Me too :)

> 77 + """Launcpad circuit breaker.

> Missing "h" in Launchpad.

> 131 + # We want to retry asap ('priority')
> 132 + self.status_db.retry(package_name, priority=True)

> Maybe try with the same priority as before?

> I don't think it's that important though, given that there
> will only be a few jobs in that state though.

It's a delicate matter and I may change my mind, but the underlying idea
is that if an import is qualified as failing due to lp, it makes it the
ideal candidate to check that lp is back online.

As such, I want this import to be tried *before* any other import as the
others have yet to prove that they qualify as better candidates.

Also, one should keep in mind that *several* imports are always in fly
and all of them can open/close the breaker so the
LaunchpadCircuitBreaker try to cope with false positives by essentially
retrying the imports that enhance the S/N ratio: imports may succeed
when lp is down and fail when it's up so whatever is in fly, we should
retry imports that helps deciding whether lp is up or not.

success when down:
==================

It's a bit unlikely but conceivable that an import manage to *not*
query lp during the down time. Fair enough (and one reason why I didn't
even try to stop the imports still running as soon as lp is seen down).

Such an unexpected success will wrongly close the breaker, but the
breaker will open at the next failure, it's a glitch but we cope with it
because we will open it again at the next failure (and we know we'll
have failures if lp is down).

failure when up:
================

As far as the circuit breaker is concerned, what matters is whether the
failure is related to lp, if it is not, the LaunchpadCircuitBreaker will
ignore it.

This is really where the failure qualification matters, if a failure is
related to lp but not marked as such, the circuit breaker is just
ineffective and we'll enter into a failure "storm".

summary:
========

Finding the Right failures is the real key. Potentially, any lp query
(in the largest interpretation) is Right. marking them all as such is
HARD.

Fortunately, there is an easy way to find Right failures: every time we
see a failure storm, we know it's caused by the *first* lp query done by
an import (which is also the reason we get storms: being the first query
it happens early so the import fails quickly and *another* import is
queued, also failing quickly, etc). Therefore, it's sufficient to mark
this query as transient to make it a Right failure.

We will need discover other Right failures (all over the place in the
import code path) but they are far less relevant because they won't
trigger during storms (i.e. when we retry an import that has failed this
way, the retry will fail on the *first* lp query, not because of this
other Right failure).

> 303 + self.state = 'open'

> I prefer class "constants" for that sort of thing, so that
> ...

>>>>> James Westby <jw+debian@jameswestby.net> writes:

> Review: Approve
    > Hi Vincent,

> Thanks for working on this,

Thanks for the speedy review !

> it looks great, and I'm looking forward to seeing it in
    > production!

Me too :)

> 77	+ """Launcpad circuit breaker.

> Missing "h" in Launchpad.

> 131	+ # We want to retry asap ('priority')
    > 132	+ self.status_db.retry(package_name, priority=True)

> Maybe try with the same priority as before?

> I don't think it's that important though, given that there
    > will only be a few jobs in that state though.

It's a delicate matter and I may change my mind, but the underlying idea
is that if an import is qualified as failing due to lp, it makes it the
ideal candidate to check that lp is back online.

As such, I want this import to be tried *before* any other import as the
others have yet to prove that they qualify as better candidates.

success when down:
==================

It's  a bit unlikely but conceivable that an import manage to *not*
query lp during the down time. Fair enough (and one reason why I didn't
even try to stop the imports still running as soon as lp is seen down).

failure when up:
================

As far as the circuit breaker is concerned, what matters is whether the
failure is related to lp, if it is not, the LaunchpadCircuitBreaker will
ignore it.

This is really where the failure qualification matters, if a failure is
related to lp but not marked as such, the circuit breaker is just
ineffective and we'll enter into a failure "storm".

summary:
========

Finding the Right failures is the real key. Potentially, any lp query
(in the largest interpretation) is Right. marking them all as such is
HARD.

> 303	+ self.state = 'open'

> I prefer class "constants" for that sort of thing, so that
    > typos lead to obvious runtime errors, rather than errant
    > behaviour.

Doh. Thanks, I was concentrating on other parts while writing that, I'll
use constants instead in both circuit breakers.

(I finally used package constants to simplify tests).

Revision history for this message

James Westby (james-w) wrote on 2011-09-21:

On Wed, 21 Sep 2011 12:59:52 -0000, Vincent Ladeuil <email address hidden> wrote:
> It's a delicate matter and I may change my mind, but the underlying idea
> is that if an import is qualified as failing due to lp, it makes it the
> ideal candidate to check that lp is back online.

I think that's a good point, and as long as the number of failures is
small it should work well.

I wouldn't want to see 2000 low priority jobs done before high priority
ones to test that LP is back and working, but given that this pattern
should limit that number it should work fine.

Thanks,

James

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-09-21:

>>>>> James Westby <email address hidden> writes:

    > On Wed, 21 Sep 2011 12:59:52 -0000, Vincent Ladeuil <email address hidden> wrote:
    >> It's a delicate matter and I may change my mind, but the underlying idea
    >> is that if an import is qualified as failing due to lp, it makes it the
    >> ideal candidate to check that lp is back online.

> I think that's a good point, and as long as the number of failures is
> small it should work well.

    > I wouldn't want to see 2000 low priority jobs done before high priority
    > ones to test that LP is back and working, but given that this pattern
    > should limit that number it should work fine.

Indeed, it's limited by the number of parallel imports and applied only
on the transient failures, so, either it's effective and it applies to a
limited set or it's not and it doesn't change any priority.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

James Westby

Joe lancer

John A Meinel

Vincent Ladeuil

 === modified file 'mass_import.py'
 --- mass_import.py	2011-06-23 21:15:05 +0000
 +++ mass_import.py	2011-09-21 12:39:28 +0000
@@ -15,10 +15,13 @@
  from launchpadlib.errors import HTTPError
--from udd import icommon
--from udd import iconfig
--from udd import paths
--from udd import threads
++from udd import (
++    circuit_breaker,
++    icommon,
++    iconfig,
++    paths,
++    threads,
++    )
  class WatchedFileHandler(logging.FileHandler):
@@ -130,6 +133,16 @@
  class Importer(threads.SubprocessMonitor):
      def __init__(self, args, package_name, job_id):
++        """Spawn and monitor an import.
++
++        :param args: The script and args to be used to do the import (including
++            the package name).
++
++        :param package_name: The name of the package to import (for display
++            purposes).
++
++        :param job_id: The status db job id for this import.
++        """
          # FIXME: We should put the max_duration in a config file when it
          # becomes available. Until then, during the wheezy import the longest
          # import took more than 2 hours, so 24 is generous enough. On the other
@@ -141,6 +154,8 @@
          self.package_name = package_name
          self.job_id = job_id
          self.status_db = icommon.StatusDatabase(paths.sqlite_file)
++        self.success = None
++        self.failure_sig = None
      def spawn(self):
          logger.info("Trying %s" % self.package_name)
@@ -155,8 +170,8 @@
              output = self.out + self.err
              unicode_output = output.decode("utf-8", "replace")
              ascii_output = unicode_output.encode("ascii", "replace")
--            success = self.retcode == 0
--            if success:
++            self.success = self.retcode == 0
++            if self.success:
                  logger.info("Success %s: %s"
                              % (self.package_name,
                                 ascii_output.replace("\n", " ")))
@@ -164,8 +179,11 @@
                  logger.warning("Importing %s failed:\n%s" % (self.package_name,
                                                               ascii_output))
              self.status_db.finish_job(
--                self.package_name, self.job_id, success,
++                self.package_name, self.job_id, self.success,
                  unicode_output.encode("utf-8", "replace"))
++            if not self.success:
++                reason = self.status_db.failure_reason(self.package_name)
++                self.failure_sig = self.status_db.failure_signature(reason)
          logger.info("thread for %s finished" % self.package_name)
@@ -220,6 +238,72 @@
                      logger.info("All packages requeued, start again")
                      self.tried = set()
++# Constants for launchpad status
++LP_UP = 1
++LP_DOWN = 2
++LP_COMING_BACK = 3
++
++
++class LaunchpadCircuitBreaker(circuit_breaker.CircuitBreaker):
++    """Launchpad circuit breaker.
++
++    Launchpad is down on a regular basis for deployments.
++
++    During the outage, many spurious failures may occur.
++
++    As soon as we see a failure known to be retried automatically, we open the
++    breaker and immediately requeue the job.
++
++    We don't start new imports while the breaker is open (grace period for
++    launchpad to recover, speficied with 'delay').
++
++    We then start imports again and wait for a first success to close the
++    breaker or a new failure to start again.
++    """
++
++    def __init__(self, threshold, delay):
++        super(LaunchpadCircuitBreaker, self).__init__(threshold, delay)
++        self.status_db = icommon.StatusDatabase(paths.sqlite_file)
++        # For log purposes, not to be confused with the breaker state itself
++        # which is authoritative.
++        self.lp_state = LP_UP
++
++    def see_attempt(self):
++        submitted = super(LaunchpadCircuitBreaker, self).see_attempt()
++        if self.lp_state == LP_UP and not submitted:
++            # First attempt after a failure, lp is down
++            logger.info("Launchpad just went down")
++            self.lp_state = LP_DOWN
++        elif self.lp_state == LP_DOWN and submitted:
++            # First attempt after the delay
++            logger.info("Testing if Launchpad is back")
++            self.lp_state = LP_COMING_BACK
++        return submitted
++
++    def see_success(self):
++        super(LaunchpadCircuitBreaker, self).see_attempt()
++        if self.lp_state == LP_COMING_BACK:
++            logger.info("Launchpad *is* back")
++            self.lp_state = LP_UP
++
++    def see_import(self, imp):
++        # Some imports succeeded, other failed.  See if some failures were
++        # transient (transient failures are most likely caused by lp
++        # downtimes).
++        package_name = imp.package_name
++        if imp.success:
++            self.see_success()
++        else:
++            # We can't blame launchpad downtimes for all failures, we rely on
++            # the ones declared in RETRY_TABLE (created when
++            # 'requeue_package.py --auto' is used)
++            if self.status_db.known_auto_retry(imp.failure_sig) is not None:
++                self.see_failure()
++                # We want to retry asap ('priority')
++                self.status_db.retry(package_name, priority=True)
++                logger.info("Launchpad is down, re-trying %s"
++                            % (t.package_name,))
++
  class ImportDriver(threads.ThreadDriver):
      """Monitor the ThreadedImporter.
@@ -230,17 +314,22 @@
      def __init__(self, max_threads):
          super(ImportDriver, self).__init__(None, max_threads)
++        # Launchpad downtimes should be less than 5 minutes (300 seconds).
++        self.circuit_breaker = LaunchpadCircuitBreaker(0, 300)
      def before_start(self):
          super(ImportDriver, self).before_start()
          self.queue = AllQueue()
      def do_one_step(self):
--        try:
--            super(ImportDriver, self).do_one_step()
--        except Exception, e:
--            # The Driver must go on !
--            report_exception('Driver', sys.exc_info())
++        # Don't start a new import while the breaker is open and waiting for
++        # things to cool down.
++        if self.circuit_breaker.see_attempt():
++            try:
++                super(ImportDriver, self).do_one_step()
++            except Exception, e:
++                # The Driver must go on !
++                report_exception('Driver', sys.exc_info())
      def before_stop(self):
          try:
@@ -256,12 +345,16 @@
      def collect_terminated_threads(self):
          before = len(self.threads)
--        super(ImportDriver, self).collect_terminated_threads()
++        collected = super(ImportDriver, self).collect_terminated_threads()
          after = len(self.threads)
          if before != after:
              # Only mention the running threads when we add or remove some
              logger.info("threads for %r still active"
                          % [t.package_name for t in self.threads])
++        for imp in collected:
++            # Delegate to the circuit breaker to intrepret the import
++            self.circuit_breaker.see_import(imp)
++        return collected
      def check_time_quota(self):
          killed = super(ImportDriver, self).check_time_quota()
@@ -270,6 +363,7 @@
                             % (t.package_name,
                                datetime.timedelta(seconds=t.max_duration)))
++
  class Stop(Exception):
      pass
 === modified file 'selftest.py'
 --- selftest.py	2011-06-15 07:26:45 +0000
 +++ selftest.py	2011-09-21 12:39:28 +0000
@@ -42,6 +42,7 @@
          suite.addTests(self.testLoader.loadTestsFromModuleNames([
              'udd.tests',
              'udd.tests.test_config',
++            'udd.tests.test_circuit_breaker',
              'udd.tests.test_import_list',
              'udd.tests.test_import_package',
              'udd.tests.test_package_to_import',
 === added file 'udd/circuit_breaker.dot'
 --- udd/circuit_breaker.dot	1970-01-01 00:00:00 +0000
 +++ udd/circuit_breaker.dot	2011-09-21 12:39:28 +0000
@@ -0,0 +1,23 @@
++// Better enjoyed with:
++// dot circuit_breaker.dot -T png -o circuit_breaker.png
++digraph G
++{
++        node [shape=circle fixedsize=true width=1.5];
++        // Closed
++        closed [shape=doublecircle]; // Initial state
++        closed -> open [label="failure"];
++        closed:sw -> closed:nw [label="attempt"];
++        closed:se -> closed:ne [label="success"];
++        // Open
++        open -> half_open [label="attempt"];
++        open -> half_open [label="success", style=dashed];
++        open:sw -> open:nw [label="failure"];
++        // Half Open
++        half_open -> closed [label="success"];
++        half_open -> open [label="failure",];
++        half_open:se -> half_open:ne [label="attempt"];
++        // Align open and half_open
++        {rank=same;
++                open half_open
++        }
++}
 === added file 'udd/circuit_breaker.py'
 --- udd/circuit_breaker.py	1970-01-01 00:00:00 +0000
 +++ udd/circuit_breaker.py	2011-09-21 12:39:28 +0000
@@ -0,0 +1,109 @@
++import time
++
++
++# Constants for the circuit breaker states
++CLOSED = 1
++OPEN = 2
++HALF_OPEN = 3
++
++
++class CircuitBreaker(object):
++
++    def __init__(self, name, threshold=None, delay=None):
++        """A circuit breaker opening when a failure threshold is reached.
++
++        It half opens again when an attempt is made after a delay.
++        It then closes again when a success is seen.
++
++        :param threshold: When the number of failures exceeds it, the breaker
++            is opened.
++
++        :param delay: Number of seconds before attempting to close (so-called
++            half open) the breaker.
++        """
++        if threshold is None:
++            # No threshold
++            threshold = 0
++        if delay is None:
++            delay = 0.0
++        self.name = name
++        self.failures = 0
++        self.threshold = threshold
++        self.delay = delay # in secs
++        self.state = None
++        # Assume it works
++        self._close()
++
++    def closed(self):
++        """The circuit works if the breaker is closed."""
++        return self.state == CLOSED
++
++    def see_failure(self):
++        # One more failure
++        self.failures += 1
++        if self.state == CLOSED:
++            self._open()
++        elif self.state == OPEN:
++            self._half_open()
++        elif self.state == HALF_OPEN:
++            # Fail again
++            self._open()
++        else:
++            raise AssertionError('Unkown circuit breaker state: %s'
++                                 % (self.state,))
++
++    def see_success(self):
++        if self.state == CLOSED:
++            # Everything works
++            pass
++        elif self.state == OPEN:
++            # Weird, this is not expected in the normal workflow.  We assume a
++            # wrong ordering in the events (or too many of them) but move to
++            # half_open first, waiting for a confirmation.
++            self._half_open()
++        elif self.state == HALF_OPEN:
++            self._close()
++        else:
++            raise AssertionError('Unkown circuit breaker state: %s'
++                                 % (self.state,))
++
++    def see_attempt(self):
++        """An attempt to use the circuit is made.
++
++        :returns: False if the circuit is open, True otherwise. This represents
++            whether or not the attempt tried to use the circuit.
++        """
++        if self.state == CLOSED:
++            # Everything works.
++            pass
++        elif self.state == OPEN:
++            self._half_open()
++        elif self.state == HALF_OPEN:
++            # We're trying. Trying more than once in parallel is allowed.
++            pass
++        else:
++            raise AssertionError('Unkown circuit breaker state: %s'
++                                 % (self.state,))
++        # The attempt is succesful if the breaker is not opened
++        return not (self.state == OPEN)
++
++    def _close(self):
++        """The circuit works, close the breaker."""
++        self.state = CLOSED
++        self.failures = 0
++
++    def _open(self):
++        """A failure happened, open the breaker when reaching the threshold."""
++        if self.failures > self.threshold:
++            # Enough failures, we have a problem
++            self.state = OPEN
++            self.opened_at = time.time()
++
++    def _half_open(self):
++        """We're trying to close the breaker after a failure."""
++        now = time.time()
++        when = self.opened_at + self.delay
++        if  now >= when:
++            # We waited long enough, let's try again.
++            self.state = HALF_OPEN
++
 === modified file 'udd/icommon.py'
 --- udd/icommon.py	2011-08-17 08:30:51 +0000
 +++ udd/icommon.py	2011-09-21 12:39:28 +0000
@@ -467,18 +467,23 @@
      def _start_package(self, c, package):
          ret = None
++        # Search the active jobs for the given package ordered by job type
++        # (priority, new, retry)
          rows = c.execute(self.JOBS_TABLE_FIND, (package, 1)).fetchall()
          now = datetime.datetime.utcnow()
          first = now
          job_id = 0
          for row in rows:
++            # Mark all the jobs inactive (and only the first as "started")
              c.execute(self.JOBS_TABLE_UPDATE,
                      (row[1], 0, row[3], row[4], first, row[6], row[0]))
              first = None
              job_id = row[0]
          if self._has_failed(c, package):
++            # Don't start it if it has failed
              job_id = None
          else:
++            # Mark the job as one that will be started RSN
              c.execute('insert into %s values (?, ?, ?, 0)'
                        % self.FAILURES_TABLE,
                        (package, running_sentinel, now))
@@ -651,6 +656,19 @@
          sig = exc_type + sig
          return sig
++    def _known_auto_retry(self, c, sig):
++        return c.execute(self.RETRY_TABLE_SELECT, (sig,)).fetchone()
++
++    def known_auto_retry(self, sig):
++        c = self.conn.cursor()
++        try:
++            return self._known_auto_retry(c, sig)
++        except:
++            self.conn.rollback()
++            raise
++        finally:
++            c.close()
++
      def retry(self, package, force=False, priority=False, auto=False,
              all=False):
          c = self.conn.cursor()
@@ -669,7 +687,7 @@
                  sig = self.failure_signature(raw_reason)
                  self._retry(c, package, sig, row[2], priority=priority)
                  if auto and sig != None:
--                    row = c.execute(self.RETRY_TABLE_SELECT, (sig,)).fetchone()
++                    row = self._known_auto_retry(c, sig)
                      if row is None:
                          c.execute(self.RETRY_TABLE_INSERT, (sig,))
                  if all and sig != None:
@@ -715,19 +733,22 @@
          self._add_job(c, package, job_type)
      def _attempt_retry(self, c, info):
--        row = c.execute(self.RETRY_TABLE_SELECT, (info.signature,)).fetchone()
++        row = self._known_auto_retry(c, info.signature)
          if row is None:
++            # A failure that is not auto-retried
              return False
          info.auto_retry = True
--        info.auto_retry_time = info.timestamp + datetime.timedelta(0,
--                self.AUTO_RETRY_SECONDS)
++        info.auto_retry_time = (
++            info.timestamp + datetime.timedelta(0, self.AUTO_RETRY_SECONDS))
          if info.auto_retry_time > datetime.datetime.utcnow():
++            # Too early to retry
              return False
          row = c.execute(self.OLD_FAILURES_TABLE_FIND, (info.name,)).fetchone()
          if row is not None and row[1] == info.signature:
              info.failure_count += row[4]
              if row[4] > self.MAX_AUTO_RETRY_COUNT:
                  info.auto_retry_masked = True
++                # Already retried MAX_AUTO_RETRY_COUNT times, stop retrying
                  return False
          self._retry(c, info.name, info.signature, info.timestamp)
 === added file 'udd/tests/test_circuit_breaker.py'
 --- udd/tests/test_circuit_breaker.py	1970-01-01 00:00:00 +0000
 +++ udd/tests/test_circuit_breaker.py	2011-09-21 12:39:28 +0000
@@ -0,0 +1,129 @@
++import time
++
++from bzrlib import tests
++
++from udd import circuit_breaker
++
++
++class TestCircuitBreaker(tests.TestCase):
++
++    threshold = None
++    delay = None
++
++    def setUp(self):
++        super(TestCircuitBreaker, self).setUp()
++        self.cb = self.get_cb()
++
++    def get_cb(self, threshold=None, delay=None):
++        if threshold is None:
++            threshold = self.threshold
++        if delay is None:
++            delay = self.delay
++        return circuit_breaker.CircuitBreaker('test', threshold=threshold,
++                                              delay=delay)
++
++    def assertClosed(self):
++        self.assertTrue(self.cb.closed())
++
++    def assertOpened(self):
++        self.assertFalse(self.cb.closed())
++        self.assertEquals(circuit_breaker.OPEN, self.cb.state)
++
++    def assertHalfOpened(self):
++        self.assertFalse(self.cb.closed())
++        self.assertEquals(circuit_breaker.HALF_OPEN,
++                          self.cb.state)
++
++
++class TestStateMachine(TestCircuitBreaker):
++    """Test the three tests and three events with a specific circuit breaker.
++
++    We cover all transitions but for a circuit breaker with a 0 threshold and
++    and a 0.0 delay to simplify.
++
++    Other threshold/delay combinations are covered elsewhere.
++    """
++
++    threshold = 0 # So we don't have to reach it
++    delay = 0.0 # So we don't have to wait
++
++    def test_initially_closed(self):
++        self.assertClosed()
++
++    # The usual scenario is (state -- event --> state):
++    # closed -- failure --> open -- attempt --> half_open -- success --> closed
++    # The following tests cover this scenario.
++
++    def test_open_on_failure(self):
++        self.assertClosed()
++        self.cb.see_failure()
++        self.assertOpened()
++
++    def test_half_open_on_attempt(self):
++        self.cb.see_failure()
++        self.assertOpened()
++        self.assertTrue(self.cb.see_attempt())
++        self.assertHalfOpened()
++
++    def test_closed_on_success(self):
++        self.cb.see_failure()
++        self.assertTrue(self.cb.see_attempt())
++        self.assertHalfOpened()
++        self.cb.see_success()
++        self.assertClosed()
++
++    # Special case scenarios
++
++    def test_stay_closed(self):
++        """The other events keep the breaker closed."""
++        self.assertTrue(self.cb.see_attempt())
++        self.assertClosed()
++        self.cb.see_success()
++        self.assertClosed()
++
++    def test_keep_trying(self):
++        """While half opened, further attempts stay in the same state."""
++        self.cb.see_failure()
++        self.assertTrue(self.cb.see_attempt())
++        self.assertHalfOpened()
++        self.assertTrue(self.cb.see_attempt())
++        self.assertHalfOpened()
++
++    def test_success_while_opened(self):
++        self.cb.see_failure()
++        self.cb.see_success()
++        self.assertHalfOpened()
++
++    def test_no_cigar(self):
++        """The attempt fails, back to the open state."""
++        self.cb.see_failure()
++        self.assertTrue(self.cb.see_attempt())
++        self.cb.see_failure()
++        self.assertOpened()
++
++
++class TestThreshold(TestCircuitBreaker):
++    """Test a real threshold"""
++
++    threshold = 1 # We need at least one failure to open
++
++    def test_open_on_failure_with_threshold(self):
++        self.cb.see_failure()
++        self.assertClosed()
++        # Reach the threshold
++        self.cb.see_failure()
++        self.assertOpened()
++
++
++class TestDelay(TestCircuitBreaker):
++    """Test a real delay"""
++
++    delay = 0.1 # Short enough for not slowing down tests too much
++
++    def test_half_open_after_delay(self):
++        self.cb.see_failure()
++        self.assertFalse(self.cb.see_attempt())
++        self.assertOpened()
++        time.sleep(self.delay)
++        self.assertTrue(self.cb.see_attempt())
++        self.assertHalfOpened()
 === modified file 'udd/threads.py'
 --- udd/threads.py	2011-06-15 07:26:45 +0000
 +++ udd/threads.py	2011-09-21 12:39:28 +0000
@@ -228,10 +228,13 @@
      def collect_terminated_threads(self):
          # If there are terminated threads, collect them
++        collected = []
          for t in self.threads[:]:
              if not t.isAlive():
                  t.collect()
                  self.threads.remove(t)
++                collected.append(t)
++        return collected
      def kill_remaining_threads(self):
          # Kill all the remaining threads

Ubuntu Distributed Development

Merge lp:~vila/udd/795321-make-tea into lp:udd

Commit message

Description of the change

Preview Diff

Subscribers