OpenStack Compute (nova)

Merge lp:~rconradharris/nova/lp794672 into lp:~hudson-openstack/nova/trunk

lp794672
Merge into trunk

Proposed by Rick Harris on 2011-06-09

Status:

Merged

Approved by:

Rick Harris on 2011-06-10

Approved revision:

1169

Merged at revision:

1169

Proposed branch:

lp:~rconradharris/nova/lp794672

Merge into:

lp:~hudson-openstack/nova/trunk

Diff against target:

203 lines (+54/-27)

6 files modified

nova/flags.py (+4/-2)
nova/image/__init__.py (+6/-1)
nova/image/glance.py (+31/-6)
nova/tests/image/test_glance.py (+1/-3)
nova/virt/images.py (+1/-10)
nova/virt/xenapi/vm_utils.py (+11/-5)

To merge this branch:

bzr merge lp:~rconradharris/nova/lp794672

Wishlist

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Mark Washenberger (community)		Abstain on 2011-06-10
Brian Waldon (community)		Abstain on 2011-06-10
Dan Prince (community)		Approve on 2011-06-10
Vish Ishaya (community)		Abstain on 2011-06-10
Brian Lamar (community)		Needs Information on 2011-06-09
Ed Leafe (community)	2011-06-09	Approve on 2011-06-09
Review via email: mp+64025@code.launchpad.net

Description of the change

Allows Nova to talk to multiple Glance APIs (without the need for an external load-balancer). Chooses a random Glance API for each request.

Revision history for this message

Ed Leafe (ed-leafe) wrote on 2011-06-09:

lgtm.

review: Approve

Revision history for this message

Dan Prince (dan-prince) wrote on 2011-06-09:

Hi Rick,

I think we need to add '/v1' into generated URL in the image_url function (line 158).

Also, How would an end user actually specify the new Glance API server list in the nova.conf file?

Thanks.

Dan

review: Needs Fixing

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-06-09:

I'm not quite so sure this belongs in Nova. This is adding a very simple load-balancing implementation that, to be useful, needs to be very well thought out. For example, what happens when one of your glance servers is down? What if the glance servers are running different versions of the registry api? I think this is a good idea that should be left up to a deployer.

As a side note, I couldn't get the flag set up correctly. What should that look like?

review: Needs Information

Revision history for this message

Dan Prince (dan-prince) wrote on 2011-06-09:

Sorry. Couple more thoughts here.

Do we want to maintain backwards compatibility with the old glance_host/glance_port flags as well? I mean if a user specifies only the glance_host/glance_port options we could use that. Otherwise if the new glance_api_servers is specified we could use it instead of the legacy flags.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-06-09:

I feel like we're trying to program in a solution where a better solution already exists. What is the problem with making glance_host == a load balanced IP and using conventional load balancing to do the job here? Load balancing logic doesn't belong in Nova in my opinion. It's just going to complicate the code unnecessarily.

review: Needs Information

Revision history for this message

Jay Pipes (jaypipes) wrote on 2011-06-09:

I'd agree with the rest of the comments about better solutions being available (using a LB in the deployment). This works, of course, but it's not ideal.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-06-09:

What's next? Weighted-least-connections algorithm support? Automatic health checks? Caching? Adding/removing nodes on the fly?

I'd really prefer to leave this out as the only time I'd want to use this is in development...where you don't need load balancing.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-09:

Brian Lamar--

> I feel like we're trying to program in a solution where a better solution already exists. What is
> the problem with making glance_host == a load balanced IP and using conventional load balancing to
> do the job here? Load balancing logic doesn't belong in Nova in my opinion. It's just going to
> complicate the code unnecessarily.

I really can't argue with this because, for the most part, this is exactly right. In fact, as mentioned above, that was the original idea behind having a single IP-- we just shove load-balancing off to something that can do it well.

But, leading into your next observation...

> I'd really prefer to leave this out as the only time I'd want to use this is in development...where you don't need load balancing.

There's the rub. According to Ant, they're running into issues testing where Glance is the bottleneck. Of course, we could always say, 'just toss a LB in there'.

The issue with that is, they're going to be spinning-up and tearing-down test envs left and right. It could be a burden to constantly have to configure an external component each time. In this case, dirt-simple `random.choice` is good 'nuff though.

> What's next? Weighted-least-connections algorithm support? Automatic health checks? Caching?
> Adding/removing nodes on the fly?

If each one of those had compelling arguments to why they should be included for testing and dev, then yeah, maybe should consider adding rudimentary versions of those. But, none of the above seem very compelling :).

I think the take-away is, from my perspective, we may need to add some less-than-ideal versions of components to facilitate testing. Sure, it will complicate the code somewhat. Hopefully reduced friction amortized over hundreds of test iterations make the trade-off worth it.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-09:

Brian Waldon--

> This is adding a very simple load-balancing implementation that, to be useful, needs to be very well thought out. For example, what happens when one of your glance servers is down

This is just to facilitate load-testing. The recommendation, as before, stands: use a real load-balancer in prod.

> As a side note, I couldn't get the flag set up correctly. What should that look like?

Hmm, I'll look into that.

Dan--

> Do we want to maintain backwards compatibility with the old glance_host/glance_port flags as well?

Good question.

My preference is to just rip the band-aid off, make the change to using a list, and have one correct way going forward. Post-diablo it will be much harder, I think, to deprecate glance_host and glance_port.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-06-09:

Hey Rick, while I was writing that response I had a feeling it was something Ant/Ozone was running in to and I really see the motivation behind this change here.

However, if we're running into load issues during testing I feel the next logical step is to increase the realism in the testing environment instead of molding the code to fit the needs of testing.

We're going to have to automate the deployment of load balancers eventually anyway, right? I know it's one more thing but I still feel this isn't the place to put this code. Maybe others disagree, and I'll step aside if so.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-09:

> However, if we're running into load issues during testing I feel the next logical step is to
> increase the realism in the testing environment instead of molding the code to fit the needs of
> testing.

I like this idea, but I think that's a medium-term goal; it's probably not going to solve the testing problem tomorrow.

I would consider this patch (an admittedly hacky) stop-gap to get us by until we get something more prod-like setup in our testing envs.

Revision history for this message

termie (termie) wrote on 2011-06-09:

I'd second blamar's last statement, the testing deployment should include the load balancer setup and config, this isn't radically different from any other steps we take via the chef scripts we've used already.

Revision history for this message

Mark Washenberger (markwash) wrote on 2011-06-09:

124 + self.stubs = stubout.StubOutForTesting()
125 self.client = StubGlanceClient(None)
126 - self.service = glance.GlanceImageService(self.client)
127 + self.service = glance.GlanceImageService()
128 + self.stubs.Set(self.service, 'client', self.client)
129 self.context = context.RequestContext(None, None)

This change seems to hurt rather than help. For one, now I have to know about stubout in order to fully understand the tests. For another, now if I change how the glance image service calls its internal variables, I have broken a bunch of tests without doing anything wrong. Can we stick with the way we were doing it before, please?

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-09:

> Can we stick with the way we were doing it before, please?

Sure.

Though, `stubout` and `mox` are used across a bunch our tests already. My completely non-scientific opinion is that dependency-injection is used much less frequently. So, there is a risk that we're making things worse by having two separate ways of achieving the exact same thing.

Personally, I'd prefer we settle on using stubs, but that might be a question better suited for the ML.

Revision history for this message

Antony Messerli (antonym) wrote on 2011-06-09:

Mainly this came out of my discussion with Rick about getting around the bottlenecks of hitting the single glance server when doing 500 builds at once. The Glance server was getting slammed hard from the sheer number of simultaneous builds.

I agree that full production environment would utilize load balancers to properly set load up across multiple glance-api. For a user with a smaller environment, or for stress testing, the patch might be useful to some.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-09:

Re-proposing this with fixes to address the comments above. Notes in-line.

> As a side note, I couldn't get the flag set up correctly. What should that look like?

Good catch. Looks like DEFINE_list doesn't work with arbitrarily nested structures, just lists of strings. To handle this, reworked code to use host:port format. So, config would look like:

--glance_api_servers=127.0.0.1:9292,92.0.0.2:9292

> I think we need to add '/v1' into generated URL in the image_url function (line 158).

Looks like this was moot. The function `image_urls` was only called within an objectstore specific method in xenapi.vm_utils, so the Glance part of the switch could just go away.

Also, since it's not re-used anywhere else (nor would it be at this point), I just moved to the code over to where it's used.

> Can we stick with the way we were doing it before, please?

Done

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-06-10:

I'm leaning toward the above comments about using a load balancer. It is it too complex to write a testing script that throws up 4 glance servers and HAProxy? We do the above for nova-api in all of our production deploys and it isn't too difficult.

review: Abstain

Revision history for this message

Dan Prince (dan-prince) wrote on 2011-06-10:

> Dan--
>
> > Do we want to maintain backwards compatibility with the old
> glance_host/glance_port flags as well?
>
> Good question.
>
> My preference is to just rip the band-aid off, make the change to using a
> list, and have one correct way going forward. Post-diablo it will be much
> harder, I think, to deprecate glance_host and glance_port.

Thanks Rick. I agree.

After a bit of pondering I think I actually like the single --glance_api_servers flag better anyway.

Regarding some of the other comments I would say that nothing you've done here prevents anyone from using a load balancer. Using a load balancer is certainly a good option. However the Glance image service is unique in that it is bandwidth intensive and providing a simple random choice option within nova seems reasonable to me. Sounds like a practical approach to me.

Good work!

review: Approve

Revision history for this message

Dan Prince (dan-prince) wrote on 2011-06-10:

BTW. If it gives anyone any confidence Smoke tests for EC2 and the OSAPI pass for this branch after I make the change to nova.conf to use --glance_api_servers.

Cheers.

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-06-10:

I'm personally not interested in this feature, so I'll let everybody else decide if it goes in. This will definitely achieve the goal of randomizing requests across multiple glance nodes, but it will also randomly expose inconsistencies between them. If we really want a robust "load-balancing" feature we should implement it generically after a blueprint design discussion.

review: Abstain

Revision history for this message

Mark Washenberger (markwash) wrote on 2011-06-10:

Sorry, Rick, I should have asked about this earlier:

93 + self._client = client
94 +
95 + def _get_client(self):
96 + # NOTE(sirp): we want to load balance each request across glance
97 + # servers. Since GlanceImageService is a long-lived object, `client`
98 + # is made to choose a new server each time via this property.
99 + if self._client is not None:
100 + return self._client
101 + glance_host, glance_port = pick_glance_api_server()
102 + return GlanceClient(glance_host, glance_port)
103 +
104 + def _set_client(self, client):
105 + self._client = client
106 +
107 + client = property(_get_client, _set_client)

I am concerned about consistency problems. If there is a problem with one glance server, is it going to make it too hard to debug? For example, if I connect to one server to list images, and then another server to get images, it might not be clear where the problem is. It also might create some odd situations with pagination. (A lot of these problems would exist with insufficiently sticky load balancers as well.)

If my concern points to an actual problem, perhaps the solution is to look at making the glance image service a short-lived object.

review: Needs Information

Revision history for this message

Mark Washenberger (markwash) wrote on 2011-06-10:

After some more thought, I'm not sure if there's any good way to handle an inconsistency among glance servers that are supposed to be identical (short of statically configuring each compute node in a round-robin fashion). So I'll pull my 'needs information' but I'd still love to hear any thoughts you have about this.

Revision history for this message

Mark Washenberger (markwash) on 2011-06-10:

review: Abstain

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-10:

Brian:

> If we really want a robust "load-balancing" feature we should implement it generically after a blueprint design discussion.

From the sound of it, I don't think anybody wants to include real load-balancing logic into Nova.

Mark:
> I am concerned about consistency problems. If there is a problem with one glance server, is it going to make it too hard to debug?

Sporadic failures (timeouts, hung nodes, etc) are a fact of life in any load-balanced setup.

Any step we take to mitigate that will lead us down the rabbit-hole of re-inventing a real-world load-balancer; definitely not a problem we should be solving.

I think this gives us the best trade-off of simplicity vs. does-the-job.

> It also might create some odd situations with pagination

AFAIK we're passing in the pagination state with the request. Since there's no shared state, this shouldn't be a problem.

Things could get trickier when we start managing sessions as part of authn & authz. Though, as long as we keep that state outside of the glance-api server (using memcached or something), we shouldn't have a problem.

Revision history for this message

Rick Harris (rconradharris) wrote on 2011-06-10:

Given that it's at +2 with 3 +0's, I'm going to go ahead and merge this in.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Adam Johnson

Anne Gentle

Anthony Young

Brian Waldon

Chuck Short

Dan Mihai Dumitriu

Dave Walker

David Pravec

Diego Parrilla

Edgar Magana

Endre Karlson

Ilya Alekseyev

Isaku Yamahata

JJ Asghar

Jay Pipes

Jonathan Bryce

Kapil Thangavelu

Keisuke Tagami

Koji Iida

Krisztian Eyssen

Lorin Hochstein

Mark McLoughlin

Masanori Itoh

Milind Barve

Nachi Ueno

Paul Guth

Pedro Perez

Rajesh Battala

Ram Durairaj

Rick Harris

Robert Middleswarth

Salvatore Orlando

Sateesh

Soren Hansen

Tomoya Masuko

Vish Ishaya

Vladimir Popovski

Youcef Laribi

adil mukarram

jawaid ekram

justinsb

jxta

makki

med makki maalej

sreekanth

termie

to status/vote changes:

Chris Behrens

1	=== modified file 'nova/flags.py'
2	--- nova/flags.py 2011-06-03 20:32:42 +0000
3	+++ nova/flags.py 2011-06-09 21:51:44 +0000
4	@@ -270,8 +270,10 @@
5	DEFINE_string('connection_type', 'libvirt', 'libvirt, xenapi or fake')
6	DEFINE_string('aws_access_key_id', 'admin', 'AWS Access ID')
7	DEFINE_string('aws_secret_access_key', 'admin', 'AWS Access Key')
8	-DEFINE_integer('glance_port', 9292, 'glance port')
9	-DEFINE_string('glance_host', '$my_ip', 'glance host')
10	+# NOTE(sirp): my_ip interpolation doesn't work within nested structures
11	+DEFINE_list('glance_api_servers',
12	+ ['127.0.0.1:9292'],
13	+ 'list of glance api servers available to nova (host:port)')
14	DEFINE_integer('s3_port', 3333, 's3 port')
15	DEFINE_string('s3_host', '$my_ip', 's3 host (for infrastructure)')
16	DEFINE_string('s3_dmz', '$my_ip', 's3 dmz ip (for instances)')
17
18	=== modified file 'nova/image/__init__.py'
19	--- nova/image/__init__.py 2011-06-02 15:12:35 +0000
20	+++ nova/image/__init__.py 2011-06-09 21:51:44 +0000
21	@@ -22,6 +22,7 @@
22	from nova import exception
23	from nova import utils
24	from nova import flags
25	+from nova.image import glance as glance_image_service
26
27	FLAGS = flags.FLAGS
28
29	@@ -48,6 +49,8 @@
30	return ImageService()
31
32
33	+# FIXME(sirp): perhaps this should be moved to nova/images/glance so that we
34	+# keep Glance specific code together for the most part
35	def get_glance_client(image_href):
36	"""Get the correct glance client and id for the given image_href.
37
38	@@ -62,7 +65,9 @@
39	"""
40	image_href = image_href or 0
41	if str(image_href).isdigit():
42	- glance_client = GlanceClient(FLAGS.glance_host, FLAGS.glance_port)
43	+ glance_host, glance_port = \
44	+ glance_image_service.pick_glance_api_server()
45	+ glance_client = GlanceClient(glance_host, glance_port)
46	return (glance_client, int(image_href))
47
48	try:
49
50	=== modified file 'nova/image/glance.py'
51	--- nova/image/glance.py 2011-06-01 18:58:17 +0000
52	+++ nova/image/glance.py 2011-06-09 21:51:44 +0000
53	@@ -20,6 +20,7 @@
54	from __future__ import absolute_import
55
56	import datetime
57	+import random
58
59	from glance.common import exception as glance_exception
60
61	@@ -39,6 +40,21 @@
62	GlanceClient = utils.import_class('glance.client.Client')
63
64
65	+def pick_glance_api_server():
66	+ """Return which Glance API server to use for the request
67	+
68	+ This method provides a very primitive form of load-balancing suitable for
69	+ testing and sandbox environments. In production, it would be better to use
70	+ one IP and route that to a real load-balancer.
71	+
72	+ Returns (host, port)
73	+ """
74	+ host_port = random.choice(FLAGS.glance_api_servers)
75	+ host, port_str = host_port.split(':')
76	+ port = int(port_str)
77	+ return host, port
78	+
79	+
80	class GlanceImageService(service.BaseImageService):
81	"""Provides storage and retrieval of disk image objects within Glance."""
82
83	@@ -51,12 +67,21 @@
84	GLANCE_ONLY_ATTRS
85
86	def __init__(self, client=None):
87	- # FIXME(sirp): can we avoid dependency-injection here by using
88	- # stubbing out a fake?
89	- if client is None:
90	- self.client = GlanceClient(FLAGS.glance_host, FLAGS.glance_port)
91	- else:
92	- self.client = client
93	+ self._client = client
94	+
95	+ def _get_client(self):
96	+ # NOTE(sirp): we want to load balance each request across glance
97	+ # servers. Since GlanceImageService is a long-lived object, `client`
98	+ # is made to choose a new server each time via this property.
99	+ if self._client is not None:
100	+ return self._client
101	+ glance_host, glance_port = pick_glance_api_server()
102	+ return GlanceClient(glance_host, glance_port)
103	+
104	+ def _set_client(self, client):
105	+ self._client = client
106	+
107	+ client = property(_get_client, _set_client)
108
109	def index(self, context, filters=None, marker=None, limit=None):
110	"""Calls out to Glance for a list of images available."""
111
112	=== modified file 'nova/tests/image/test_glance.py'
113	--- nova/tests/image/test_glance.py 2011-05-31 20:30:09 +0000
114	+++ nova/tests/image/test_glance.py 2011-06-09 21:51:44 +0000
115	@@ -60,10 +60,8 @@
116	NOW_DATETIME = datetime.datetime(2010, 10, 11, 10, 30, 22)
117
118	def setUp(self):
119	- # FIXME(sirp): we can probably use stubs library here rather than
120	- # dependency injection
121	self.client = StubGlanceClient(None)
122	- self.service = glance.GlanceImageService(self.client)
123	+ self.service = glance.GlanceImageService(client=self.client)
124	self.context = context.RequestContext(None, None)
125
126	def assertDateTimesFilled(self, image_meta):
127
128	=== modified file 'nova/virt/images.py'
129	--- nova/virt/images.py 2011-05-25 23:57:04 +0000
130	+++ nova/virt/images.py 2011-06-09 21:51:44 +0000
131	@@ -23,6 +23,7 @@
132
133	from nova import context
134	from nova import flags
135	+from nova.image import glance as glance_image_service
136	import nova.image
137	from nova import log as logging
138	from nova import utils
139	@@ -42,13 +43,3 @@
140	elevated = context.get_admin_context()
141	metadata = image_service.get(elevated, image_id, image_file)
142	return metadata
143	-
144	-
145	-# TODO(vish): xenapi should use the glance client code directly instead
146	-# of retrieving the image using this method.
147	-def image_url(image):
148	- if FLAGS.image_service == "nova.image.glance.GlanceImageService":
149	- return "http://%s:%s/images/%s" % (FLAGS.glance_host,
150	- FLAGS.glance_port, image)
151	- return "http://%s:%s/_images/%s/image" % (FLAGS.s3_host, FLAGS.s3_port,
152	- image)
153
154	=== modified file 'nova/virt/xenapi/vm_utils.py'
155	--- nova/virt/xenapi/vm_utils.py 2011-06-02 23:24:09 +0000
156	+++ nova/virt/xenapi/vm_utils.py 2011-06-09 21:51:44 +0000
157	@@ -33,6 +33,7 @@
158	from nova import exception
159	from nova import flags
160	import nova.image
161	+from nova.image import glance as glance_image_service
162	from nova import log as logging
163	from nova import utils
164	from nova.auth.manager import AuthManager
165	@@ -358,10 +359,12 @@
166
167	os_type = instance.os_type or FLAGS.default_os_type
168
169	+ glance_host, glance_port = \
170	+ glance_image_service.pick_glance_api_server()
171	params = {'vdi_uuids': vdi_uuids,
172	'image_id': image_id,
173	- 'glance_host': FLAGS.glance_host,
174	- 'glance_port': FLAGS.glance_port,
175	+ 'glance_host': glance_host,
176	+ 'glance_port': glance_port,
177	'sr_path': cls.get_sr_path(session),
178	'os_type': os_type}
179
180	@@ -409,9 +412,11 @@
181	# here (under Python 2.6+) and pass them as arguments
182	uuid_stack = [str(uuid.uuid4()) for i in xrange(2)]
183
184	+ glance_host, glance_port = \
185	+ glance_image_service.pick_glance_api_server()
186	params = {'image_id': image,
187	- 'glance_host': FLAGS.glance_host,
188	- 'glance_port': FLAGS.glance_port,
189	+ 'glance_host': glance_host,
190	+ 'glance_port': glance_port,
191	'uuid_stack': uuid_stack,
192	'sr_path': cls.get_sr_path(session)}
193
194	@@ -576,7 +581,8 @@
195	Returns: A single filename if image_type is KERNEL_RAMDISK
196	A list of dictionaries that describe VDIs, otherwise
197	"""
198	- url = images.image_url(image)
199	+ url = "http://%s:%s/_images/%s/image" % (FLAGS.s3_host, FLAGS.s3_port,
200	+ image)
201	LOG.debug(_("Asking xapi to fetch %(url)s as %(access)s") % locals())
202	if image_type == ImageType.KERNEL_RAMDISK:
203	fn = 'get_kernel'

OpenStack Compute (nova)

Merge lp:~rconradharris/nova/lp794672 into lp:~hudson-openstack/nova/trunk

Commit message

Description of the change

Preview Diff

Subscribers