Merge lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail into lp:txzookeeper

Proposed by Clint Byrum
Status: Rejected
Rejected by: Kapil Thangavelu
Proposed branch: lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail
Merge into: lp:txzookeeper
Diff against target: 45 lines (+12/-0)
1 file modified
txzookeeper/managed.py (+12/-0)
To merge this branch: bzr merge lp:~clint-fewbar/txzookeeper/backoff-retry-on-fail
Reviewer Review Type Date Requested Status
Juju Engineering Pending
Review via email: mp+132113@code.launchpad.net

Description of the change

This morning the Zookeeper server serving about 45 boxes for juju agents crashed and started recovering. They spewed ConnectionLost errors at an incredibly high rate, pounding on the zookeeper and making it worse. This code backs off the retries so the server can have some breathing room to recover.

To post a comment you must log in.

Unmerged revisions

52. By Clint Byrum

back off retry rate on connection problems to prevent dog-piling on a dead server

51. By Clint Byrum

Handle ConnectionLossException more gracefully

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'txzookeeper/managed.py'
--- txzookeeper/managed.py 2012-09-19 12:54:25 +0000
+++ txzookeeper/managed.py 2012-10-30 14:27:21 +0000
@@ -154,6 +154,8 @@
154 self._session_notifications = []154 self._session_notifications = []
155 self._reconnect_lock = DeferredLock()155 self._reconnect_lock = DeferredLock()
156 self.set_connection_error_callback(self._cb_connection_error)156 self.set_connection_error_callback(self._cb_connection_error)
157 self._backoff_seconds = 0
158 self._max_backoff_seconds = 60
157159
158 def subscribe_new_session(self):160 def subscribe_new_session(self):
159 d = Deferred()161 d = Deferred()
@@ -204,6 +206,10 @@
204 def _cb_restablish_session(self):206 def _cb_restablish_session(self):
205 """Re-establish a new session, and recreate ephemerals and watches.207 """Re-establish a new session, and recreate ephemerals and watches.
206 """208 """
209 # If we have some failures, back off
210 if self._backoff_seconds:
211 yield deferLater(reactor, self._backoff_seconds, lambda: none)
212
207 # Reconnect213 # Reconnect
208 while 1:214 while 1:
209 if self.handle is None:215 if self.handle is None:
@@ -234,6 +240,9 @@
234 notifications = self._session_notifications240 notifications = self._session_notifications
235 self._session_notifications = []241 self._session_notifications = []
236242
243 # all good, reset backoff
244 self._backoff_seconds = 0
245
237 for n in notifications:246 for n in notifications:
238 n.callback(True)247 n.callback(True)
239248
@@ -252,9 +261,12 @@
252 """261 """
253 if not isinstance(error, (262 if not isinstance(error, (
254 zookeeper.SessionExpiredException,263 zookeeper.SessionExpiredException,
264 zookeeper.ConnectionLossException,
255 NotConnectedException,265 NotConnectedException,
256 zookeeper.ClosingException)):266 zookeeper.ClosingException)):
257 raise error267 raise error
268 if self._backoff_seconds < self._max_backoff:
269 self._backoff_seconds += 10
258 yield self._cb_restablish_session()270 yield self._cb_restablish_session()
259 raise zookeeper.ConnectionLossException271 raise zookeeper.ConnectionLossException
260272

Subscribers

People subscribed via source and target branches

to all changes: