Merge lp:~jk0/nova/lp661472 into lp:~hudson-openstack/nova/trunk

Proposed by Josh Kearney
Status: Merged
Merge reported by: Josh Kearney
Merged at revision: not available
Proposed branch: lp:~jk0/nova/lp661472
Merge into: lp:~hudson-openstack/nova/trunk
Diff against target: 43 lines (+16/-3)
1 file modified
nova/rpc.py (+16/-3)
To merge this branch: bzr merge lp:~jk0/nova/lp661472
Reviewer Review Type Date Requested Status
Vish Ishaya (community) Approve
Joshua McKenty (community) Approve
Review via email: mp+41373@code.launchpad.net

Commit message

Check for running AMQP instances.

Description of the change

To post a comment you must log in.
Revision history for this message
Vish Ishaya (vishvananda) wrote :

A couple of issues issues:

1. This won't pass pep8

2. We may need something better than time.sleep(30) on fetch. This will block the reactor while we're waiting to reconnect, stopping us from doing any other work like checking the current state of the instances and doing recovery or updating the db. Is there a reason we can't keep the failed connection flag and just add the self.declare() to the existing code?

Revision history for this message
Matt Dietz (cerberus) wrote :

Is there some reason we don't want to actually log the exceptions on #22 and #58?

lp:~jk0/nova/lp661472 updated
409. By Josh Kearney

Reverted some changes

Revision history for this message
Vish Ishaya (vishvananda) wrote :

LGTM. Consider changing logging.warning to logging.exception in the init code. Might be good to have traceback in case it is a configuration issue instead of rabbit actually being down.

review: Approve
lp:~jk0/nova/lp661472 updated
410. By Josh Kearney

Use logging.exception instead

Revision history for this message
Joshua McKenty (joshua-mckenty) wrote :

LGTM. Would consider adding a maximum number of retries, or a decay on the 30 second limit. The 30 second should be a FLAG or a CONST as well, IMO.

review: Approve
Revision history for this message
Josh Kearney (jk0) wrote :

I like the idea of making it a constant with a maximum number of retries. Would the best course of action be to sys.exit() once that limit is hit (giving a friendly error message), or should we let it continue to run?

Also -- is 30 seconds ideal and what should the retry max be?

Revision history for this message
Vish Ishaya (vishvananda) wrote :

lgtm.

review: Approve
Revision history for this message
Joshua McKenty (joshua-mckenty) wrote :

I would make it 10 seconds (I think a lot of Rabbit errors are transient network issues), with a maximum 12 tries for a total of 2 minutes. (OR, a bunch more config FLAGS.)

And I'd definitely sys.exit(<nonzero>), so that a monitoring system like monit, etc. can show the shutdown as a failure.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'nova/rpc.py'
2--- nova/rpc.py 2010-11-03 23:04:01 +0000
3+++ nova/rpc.py 2010-11-20 18:19:13 +0000
4@@ -24,6 +24,7 @@
5 import json
6 import logging
7 import sys
8+import time
9 import uuid
10
11 from carrot import connection as carrot_connection
12@@ -83,7 +84,18 @@
13 """
14 def __init__(self, *args, **kwargs):
15 self.failed_connection = False
16- super(Consumer, self).__init__(*args, **kwargs)
17+
18+ while True:
19+ try:
20+ super(Consumer, self).__init__(*args, **kwargs)
21+ break
22+ except: # Catching all because carrot sucks
23+ logging.exception("AMQP server on %s:%d is unreachable. " \
24+ "Trying again in 30 seconds." % (
25+ FLAGS.rabbit_host,
26+ FLAGS.rabbit_port))
27+ time.sleep(30)
28+ continue
29
30 def fetch(self, no_ack=None, auto_ack=None, enable_callbacks=False):
31 """Wraps the parent fetch with some logic for failed connections"""
32@@ -94,8 +106,9 @@
33 # NOTE(vish): conn is defined in the parent class, we can
34 # recreate it as long as we create the backend too
35 # pylint: disable-msg=W0201
36- self.conn = Connection.recreate()
37- self.backend = self.conn.create_backend()
38+ self.connection = Connection.recreate()
39+ self.backend = self.connection.create_backend()
40+ self.declare()
41 super(Consumer, self).fetch(no_ack, auto_ack, enable_callbacks)
42 if self.failed_connection:
43 logging.error("Reconnected to queue")