Merge lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail into lp:txzookeeper

Proposed by Clint Byrum
Status: Rejected
Rejected by: Kapil Thangavelu
Proposed branch: lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail
Merge into: lp:txzookeeper
Diff against target: 45 lines (+12/-0)
1 file modified
txzookeeper/managed.py (+12/-0)
To merge this branch: bzr merge lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail
Reviewer Review Type Date Requested Status
Juju Engineering Pending
Review via email: mp+132113@code.launchpad.net

Description of the change

This morning the Zookeeper server serving about 45 boxes for juju agents crashed and started recovering. They spewed ConnectionLost errors at an incredibly high rate, pounding on the zookeeper and making it worse. This code backs off the retries so the server can have some breathing room to recover.

To post a comment you must log in.

Unmerged revisions

52. By Clint Byrum

back off retry rate on connection problems to prevent dog-piling on a dead server

51. By Clint Byrum

Handle ConnectionLossException more gracefully

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'txzookeeper/managed.py'
2--- txzookeeper/managed.py 2012-09-19 12:54:25 +0000
3+++ txzookeeper/managed.py 2012-10-30 14:27:21 +0000
4@@ -154,6 +154,8 @@
5 self._session_notifications = []
6 self._reconnect_lock = DeferredLock()
7 self.set_connection_error_callback(self._cb_connection_error)
8+ self._backoff_seconds = 0
9+ self._max_backoff_seconds = 60
10
11 def subscribe_new_session(self):
12 d = Deferred()
13@@ -204,6 +206,10 @@
14 def _cb_restablish_session(self):
15 """Re-establish a new session, and recreate ephemerals and watches.
16 """
17+ # If we have some failures, back off
18+ if self._backoff_seconds:
19+ yield deferLater(reactor, self._backoff_seconds, lambda: none)
20+
21 # Reconnect
22 while 1:
23 if self.handle is None:
24@@ -234,6 +240,9 @@
25 notifications = self._session_notifications
26 self._session_notifications = []
27
28+ # all good, reset backoff
29+ self._backoff_seconds = 0
30+
31 for n in notifications:
32 n.callback(True)
33
34@@ -252,9 +261,12 @@
35 """
36 if not isinstance(error, (
37 zookeeper.SessionExpiredException,
38+ zookeeper.ConnectionLossException,
39 NotConnectedException,
40 zookeeper.ClosingException)):
41 raise error
42+ if self._backoff_seconds < self._max_backoff:
43+ self._backoff_seconds += 10
44 yield self._cb_restablish_session()
45 raise zookeeper.ConnectionLossException
46

Subscribers

People subscribed via source and target branches

to all changes: