Merge lp:~rconradharris/nova/lp794672 into lp:~hudson-openstack/nova/trunk
- lp794672
- Merge into trunk
Status: | Merged |
---|---|
Approved by: | Rick Harris |
Approved revision: | 1169 |
Merged at revision: | 1169 |
Proposed branch: | lp:~rconradharris/nova/lp794672 |
Merge into: | lp:~hudson-openstack/nova/trunk |
Diff against target: |
203 lines (+54/-27) 6 files modified
nova/flags.py (+4/-2) nova/image/__init__.py (+6/-1) nova/image/glance.py (+31/-6) nova/tests/image/test_glance.py (+1/-3) nova/virt/images.py (+1/-10) nova/virt/xenapi/vm_utils.py (+11/-5) |
To merge this branch: | bzr merge lp:~rconradharris/nova/lp794672 |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Mark Washenberger (community) | Abstain | ||
Brian Waldon (community) | Abstain | ||
Dan Prince (community) | Approve | ||
Vish Ishaya (community) | Abstain | ||
Brian Lamar (community) | Needs Information | ||
Ed Leafe (community) | Approve | ||
Review via email: mp+64025@code.launchpad.net |
Commit message
Description of the change
Allows Nova to talk to multiple Glance APIs (without the need for an external load-balancer). Chooses a random Glance API for each request.
Dan Prince (dan-prince) wrote : | # |
Hi Rick,
I think we need to add '/v1' into generated URL in the image_url function (line 158).
Also, How would an end user actually specify the new Glance API server list in the nova.conf file?
Thanks.
Dan
Brian Waldon (bcwaldon) wrote : | # |
I'm not quite so sure this belongs in Nova. This is adding a very simple load-balancing implementation that, to be useful, needs to be very well thought out. For example, what happens when one of your glance servers is down? What if the glance servers are running different versions of the registry api? I think this is a good idea that should be left up to a deployer.
As a side note, I couldn't get the flag set up correctly. What should that look like?
Dan Prince (dan-prince) wrote : | # |
Sorry. Couple more thoughts here.
Do we want to maintain backwards compatibility with the old glance_
Brian Lamar (blamar) wrote : | # |
I feel like we're trying to program in a solution where a better solution already exists. What is the problem with making glance_host == a load balanced IP and using conventional load balancing to do the job here? Load balancing logic doesn't belong in Nova in my opinion. It's just going to complicate the code unnecessarily.
Jay Pipes (jaypipes) wrote : | # |
I'd agree with the rest of the comments about better solutions being available (using a LB in the deployment). This works, of course, but it's not ideal.
Brian Lamar (blamar) wrote : | # |
What's next? Weighted-
I'd really prefer to leave this out as the only time I'd want to use this is in development...where you don't need load balancing.
Rick Harris (rconradharris) wrote : | # |
Brian Lamar--
> I feel like we're trying to program in a solution where a better solution already exists. What is
> the problem with making glance_host == a load balanced IP and using conventional load balancing to
> do the job here? Load balancing logic doesn't belong in Nova in my opinion. It's just going to
> complicate the code unnecessarily.
I really can't argue with this because, for the most part, this is exactly right. In fact, as mentioned above, that was the original idea behind having a single IP-- we just shove load-balancing off to something that can do it well.
But, leading into your next observation...
> I'd really prefer to leave this out as the only time I'd want to use this is in development...where you don't need load balancing.
There's the rub. According to Ant, they're running into issues testing where Glance is the bottleneck. Of course, we could always say, 'just toss a LB in there'.
The issue with that is, they're going to be spinning-up and tearing-down test envs left and right. It could be a burden to constantly have to configure an external component each time. In this case, dirt-simple `random.choice` is good 'nuff though.
> What's next? Weighted-
> Adding/removing nodes on the fly?
If each one of those had compelling arguments to why they should be included for testing and dev, then yeah, maybe should consider adding rudimentary versions of those. But, none of the above seem very compelling :).
I think the take-away is, from my perspective, we may need to add some less-than-ideal versions of components to facilitate testing. Sure, it will complicate the code somewhat. Hopefully reduced friction amortized over hundreds of test iterations make the trade-off worth it.
Rick Harris (rconradharris) wrote : | # |
Brian Waldon--
> This is adding a very simple load-balancing implementation that, to be useful, needs to be very well thought out. For example, what happens when one of your glance servers is down
This is just to facilitate load-testing. The recommendation, as before, stands: use a real load-balancer in prod.
> As a side note, I couldn't get the flag set up correctly. What should that look like?
Hmm, I'll look into that.
Dan--
> Do we want to maintain backwards compatibility with the old glance_
Good question.
My preference is to just rip the band-aid off, make the change to using a list, and have one correct way going forward. Post-diablo it will be much harder, I think, to deprecate glance_host and glance_port.
Brian Lamar (blamar) wrote : | # |
Hey Rick, while I was writing that response I had a feeling it was something Ant/Ozone was running in to and I really see the motivation behind this change here.
However, if we're running into load issues during testing I feel the next logical step is to increase the realism in the testing environment instead of molding the code to fit the needs of testing.
We're going to have to automate the deployment of load balancers eventually anyway, right? I know it's one more thing but I still feel this isn't the place to put this code. Maybe others disagree, and I'll step aside if so.
Rick Harris (rconradharris) wrote : | # |
> However, if we're running into load issues during testing I feel the next logical step is to
> increase the realism in the testing environment instead of molding the code to fit the needs of
> testing.
I like this idea, but I think that's a medium-term goal; it's probably not going to solve the testing problem tomorrow.
I would consider this patch (an admittedly hacky) stop-gap to get us by until we get something more prod-like setup in our testing envs.
termie (termie) wrote : | # |
I'd second blamar's last statement, the testing deployment should include the load balancer setup and config, this isn't radically different from any other steps we take via the chef scripts we've used already.
Mark Washenberger (markwash) wrote : | # |
124 + self.stubs = stubout.
125 self.client = StubGlanceClien
126 - self.service = glance.
127 + self.service = glance.
128 + self.stubs.
129 self.context = context.
This change seems to hurt rather than help. For one, now I have to know about stubout in order to fully understand the tests. For another, now if I change how the glance image service calls its internal variables, I have broken a bunch of tests without doing anything wrong. Can we stick with the way we were doing it before, please?
Rick Harris (rconradharris) wrote : | # |
> Can we stick with the way we were doing it before, please?
Sure.
Though, `stubout` and `mox` are used across a bunch our tests already. My completely non-scientific opinion is that dependency-
Personally, I'd prefer we settle on using stubs, but that might be a question better suited for the ML.
Antony Messerli (antonym) wrote : | # |
Mainly this came out of my discussion with Rick about getting around the bottlenecks of hitting the single glance server when doing 500 builds at once. The Glance server was getting slammed hard from the sheer number of simultaneous builds.
I agree that full production environment would utilize load balancers to properly set load up across multiple glance-api. For a user with a smaller environment, or for stress testing, the patch might be useful to some.
Rick Harris (rconradharris) wrote : | # |
Re-proposing this with fixes to address the comments above. Notes in-line.
> As a side note, I couldn't get the flag set up correctly. What should that look like?
Good catch. Looks like DEFINE_list doesn't work with arbitrarily nested structures, just lists of strings. To handle this, reworked code to use host:port format. So, config would look like:
--glance_
> I think we need to add '/v1' into generated URL in the image_url function (line 158).
Looks like this was moot. The function `image_urls` was only called within an objectstore specific method in xenapi.vm_utils, so the Glance part of the switch could just go away.
Also, since it's not re-used anywhere else (nor would it be at this point), I just moved to the code over to where it's used.
> Can we stick with the way we were doing it before, please?
Done
Vish Ishaya (vishvananda) wrote : | # |
I'm leaning toward the above comments about using a load balancer. It is it too complex to write a testing script that throws up 4 glance servers and HAProxy? We do the above for nova-api in all of our production deploys and it isn't too difficult.
Dan Prince (dan-prince) wrote : | # |
> Dan--
>
> > Do we want to maintain backwards compatibility with the old
> glance_
>
> Good question.
>
> My preference is to just rip the band-aid off, make the change to using a
> list, and have one correct way going forward. Post-diablo it will be much
> harder, I think, to deprecate glance_host and glance_port.
Thanks Rick. I agree.
After a bit of pondering I think I actually like the single --glance_
Regarding some of the other comments I would say that nothing you've done here prevents anyone from using a load balancer. Using a load balancer is certainly a good option. However the Glance image service is unique in that it is bandwidth intensive and providing a simple random choice option within nova seems reasonable to me. Sounds like a practical approach to me.
Good work!
Dan Prince (dan-prince) wrote : | # |
BTW. If it gives anyone any confidence Smoke tests for EC2 and the OSAPI pass for this branch after I make the change to nova.conf to use --glance_
Cheers.
Brian Waldon (bcwaldon) wrote : | # |
I'm personally not interested in this feature, so I'll let everybody else decide if it goes in. This will definitely achieve the goal of randomizing requests across multiple glance nodes, but it will also randomly expose inconsistencies between them. If we really want a robust "load-balancing" feature we should implement it generically after a blueprint design discussion.
Mark Washenberger (markwash) wrote : | # |
Sorry, Rick, I should have asked about this earlier:
93 + self._client = client
94 +
95 + def _get_client(self):
96 + # NOTE(sirp): we want to load balance each request across glance
97 + # servers. Since GlanceImageService is a long-lived object, `client`
98 + # is made to choose a new server each time via this property.
99 + if self._client is not None:
100 + return self._client
101 + glance_host, glance_port = pick_glance_
102 + return GlanceClient(
103 +
104 + def _set_client(self, client):
105 + self._client = client
106 +
107 + client = property(
I am concerned about consistency problems. If there is a problem with one glance server, is it going to make it too hard to debug? For example, if I connect to one server to list images, and then another server to get images, it might not be clear where the problem is. It also might create some odd situations with pagination. (A lot of these problems would exist with insufficiently sticky load balancers as well.)
If my concern points to an actual problem, perhaps the solution is to look at making the glance image service a short-lived object.
Mark Washenberger (markwash) wrote : | # |
After some more thought, I'm not sure if there's any good way to handle an inconsistency among glance servers that are supposed to be identical (short of statically configuring each compute node in a round-robin fashion). So I'll pull my 'needs information' but I'd still love to hear any thoughts you have about this.
Mark Washenberger (markwash) : | # |
Rick Harris (rconradharris) wrote : | # |
Brian:
> If we really want a robust "load-balancing" feature we should implement it generically after a blueprint design discussion.
From the sound of it, I don't think anybody wants to include real load-balancing logic into Nova.
Mark:
> I am concerned about consistency problems. If there is a problem with one glance server, is it going to make it too hard to debug?
Sporadic failures (timeouts, hung nodes, etc) are a fact of life in any load-balanced setup.
Any step we take to mitigate that will lead us down the rabbit-hole of re-inventing a real-world load-balancer; definitely not a problem we should be solving.
I think this gives us the best trade-off of simplicity vs. does-the-job.
> It also might create some odd situations with pagination
AFAIK we're passing in the pagination state with the request. Since there's no shared state, this shouldn't be a problem.
Things could get trickier when we start managing sessions as part of authn & authz. Though, as long as we keep that state outside of the glance-api server (using memcached or something), we shouldn't have a problem.
Rick Harris (rconradharris) wrote : | # |
Given that it's at +2 with 3 +0's, I'm going to go ahead and merge this in.
Preview Diff
1 | === modified file 'nova/flags.py' |
2 | --- nova/flags.py 2011-06-03 20:32:42 +0000 |
3 | +++ nova/flags.py 2011-06-09 21:51:44 +0000 |
4 | @@ -270,8 +270,10 @@ |
5 | DEFINE_string('connection_type', 'libvirt', 'libvirt, xenapi or fake') |
6 | DEFINE_string('aws_access_key_id', 'admin', 'AWS Access ID') |
7 | DEFINE_string('aws_secret_access_key', 'admin', 'AWS Access Key') |
8 | -DEFINE_integer('glance_port', 9292, 'glance port') |
9 | -DEFINE_string('glance_host', '$my_ip', 'glance host') |
10 | +# NOTE(sirp): my_ip interpolation doesn't work within nested structures |
11 | +DEFINE_list('glance_api_servers', |
12 | + ['127.0.0.1:9292'], |
13 | + 'list of glance api servers available to nova (host:port)') |
14 | DEFINE_integer('s3_port', 3333, 's3 port') |
15 | DEFINE_string('s3_host', '$my_ip', 's3 host (for infrastructure)') |
16 | DEFINE_string('s3_dmz', '$my_ip', 's3 dmz ip (for instances)') |
17 | |
18 | === modified file 'nova/image/__init__.py' |
19 | --- nova/image/__init__.py 2011-06-02 15:12:35 +0000 |
20 | +++ nova/image/__init__.py 2011-06-09 21:51:44 +0000 |
21 | @@ -22,6 +22,7 @@ |
22 | from nova import exception |
23 | from nova import utils |
24 | from nova import flags |
25 | +from nova.image import glance as glance_image_service |
26 | |
27 | FLAGS = flags.FLAGS |
28 | |
29 | @@ -48,6 +49,8 @@ |
30 | return ImageService() |
31 | |
32 | |
33 | +# FIXME(sirp): perhaps this should be moved to nova/images/glance so that we |
34 | +# keep Glance specific code together for the most part |
35 | def get_glance_client(image_href): |
36 | """Get the correct glance client and id for the given image_href. |
37 | |
38 | @@ -62,7 +65,9 @@ |
39 | """ |
40 | image_href = image_href or 0 |
41 | if str(image_href).isdigit(): |
42 | - glance_client = GlanceClient(FLAGS.glance_host, FLAGS.glance_port) |
43 | + glance_host, glance_port = \ |
44 | + glance_image_service.pick_glance_api_server() |
45 | + glance_client = GlanceClient(glance_host, glance_port) |
46 | return (glance_client, int(image_href)) |
47 | |
48 | try: |
49 | |
50 | === modified file 'nova/image/glance.py' |
51 | --- nova/image/glance.py 2011-06-01 18:58:17 +0000 |
52 | +++ nova/image/glance.py 2011-06-09 21:51:44 +0000 |
53 | @@ -20,6 +20,7 @@ |
54 | from __future__ import absolute_import |
55 | |
56 | import datetime |
57 | +import random |
58 | |
59 | from glance.common import exception as glance_exception |
60 | |
61 | @@ -39,6 +40,21 @@ |
62 | GlanceClient = utils.import_class('glance.client.Client') |
63 | |
64 | |
65 | +def pick_glance_api_server(): |
66 | + """Return which Glance API server to use for the request |
67 | + |
68 | + This method provides a very primitive form of load-balancing suitable for |
69 | + testing and sandbox environments. In production, it would be better to use |
70 | + one IP and route that to a real load-balancer. |
71 | + |
72 | + Returns (host, port) |
73 | + """ |
74 | + host_port = random.choice(FLAGS.glance_api_servers) |
75 | + host, port_str = host_port.split(':') |
76 | + port = int(port_str) |
77 | + return host, port |
78 | + |
79 | + |
80 | class GlanceImageService(service.BaseImageService): |
81 | """Provides storage and retrieval of disk image objects within Glance.""" |
82 | |
83 | @@ -51,12 +67,21 @@ |
84 | GLANCE_ONLY_ATTRS |
85 | |
86 | def __init__(self, client=None): |
87 | - # FIXME(sirp): can we avoid dependency-injection here by using |
88 | - # stubbing out a fake? |
89 | - if client is None: |
90 | - self.client = GlanceClient(FLAGS.glance_host, FLAGS.glance_port) |
91 | - else: |
92 | - self.client = client |
93 | + self._client = client |
94 | + |
95 | + def _get_client(self): |
96 | + # NOTE(sirp): we want to load balance each request across glance |
97 | + # servers. Since GlanceImageService is a long-lived object, `client` |
98 | + # is made to choose a new server each time via this property. |
99 | + if self._client is not None: |
100 | + return self._client |
101 | + glance_host, glance_port = pick_glance_api_server() |
102 | + return GlanceClient(glance_host, glance_port) |
103 | + |
104 | + def _set_client(self, client): |
105 | + self._client = client |
106 | + |
107 | + client = property(_get_client, _set_client) |
108 | |
109 | def index(self, context, filters=None, marker=None, limit=None): |
110 | """Calls out to Glance for a list of images available.""" |
111 | |
112 | === modified file 'nova/tests/image/test_glance.py' |
113 | --- nova/tests/image/test_glance.py 2011-05-31 20:30:09 +0000 |
114 | +++ nova/tests/image/test_glance.py 2011-06-09 21:51:44 +0000 |
115 | @@ -60,10 +60,8 @@ |
116 | NOW_DATETIME = datetime.datetime(2010, 10, 11, 10, 30, 22) |
117 | |
118 | def setUp(self): |
119 | - # FIXME(sirp): we can probably use stubs library here rather than |
120 | - # dependency injection |
121 | self.client = StubGlanceClient(None) |
122 | - self.service = glance.GlanceImageService(self.client) |
123 | + self.service = glance.GlanceImageService(client=self.client) |
124 | self.context = context.RequestContext(None, None) |
125 | |
126 | def assertDateTimesFilled(self, image_meta): |
127 | |
128 | === modified file 'nova/virt/images.py' |
129 | --- nova/virt/images.py 2011-05-25 23:57:04 +0000 |
130 | +++ nova/virt/images.py 2011-06-09 21:51:44 +0000 |
131 | @@ -23,6 +23,7 @@ |
132 | |
133 | from nova import context |
134 | from nova import flags |
135 | +from nova.image import glance as glance_image_service |
136 | import nova.image |
137 | from nova import log as logging |
138 | from nova import utils |
139 | @@ -42,13 +43,3 @@ |
140 | elevated = context.get_admin_context() |
141 | metadata = image_service.get(elevated, image_id, image_file) |
142 | return metadata |
143 | - |
144 | - |
145 | -# TODO(vish): xenapi should use the glance client code directly instead |
146 | -# of retrieving the image using this method. |
147 | -def image_url(image): |
148 | - if FLAGS.image_service == "nova.image.glance.GlanceImageService": |
149 | - return "http://%s:%s/images/%s" % (FLAGS.glance_host, |
150 | - FLAGS.glance_port, image) |
151 | - return "http://%s:%s/_images/%s/image" % (FLAGS.s3_host, FLAGS.s3_port, |
152 | - image) |
153 | |
154 | === modified file 'nova/virt/xenapi/vm_utils.py' |
155 | --- nova/virt/xenapi/vm_utils.py 2011-06-02 23:24:09 +0000 |
156 | +++ nova/virt/xenapi/vm_utils.py 2011-06-09 21:51:44 +0000 |
157 | @@ -33,6 +33,7 @@ |
158 | from nova import exception |
159 | from nova import flags |
160 | import nova.image |
161 | +from nova.image import glance as glance_image_service |
162 | from nova import log as logging |
163 | from nova import utils |
164 | from nova.auth.manager import AuthManager |
165 | @@ -358,10 +359,12 @@ |
166 | |
167 | os_type = instance.os_type or FLAGS.default_os_type |
168 | |
169 | + glance_host, glance_port = \ |
170 | + glance_image_service.pick_glance_api_server() |
171 | params = {'vdi_uuids': vdi_uuids, |
172 | 'image_id': image_id, |
173 | - 'glance_host': FLAGS.glance_host, |
174 | - 'glance_port': FLAGS.glance_port, |
175 | + 'glance_host': glance_host, |
176 | + 'glance_port': glance_port, |
177 | 'sr_path': cls.get_sr_path(session), |
178 | 'os_type': os_type} |
179 | |
180 | @@ -409,9 +412,11 @@ |
181 | # here (under Python 2.6+) and pass them as arguments |
182 | uuid_stack = [str(uuid.uuid4()) for i in xrange(2)] |
183 | |
184 | + glance_host, glance_port = \ |
185 | + glance_image_service.pick_glance_api_server() |
186 | params = {'image_id': image, |
187 | - 'glance_host': FLAGS.glance_host, |
188 | - 'glance_port': FLAGS.glance_port, |
189 | + 'glance_host': glance_host, |
190 | + 'glance_port': glance_port, |
191 | 'uuid_stack': uuid_stack, |
192 | 'sr_path': cls.get_sr_path(session)} |
193 | |
194 | @@ -576,7 +581,8 @@ |
195 | Returns: A single filename if image_type is KERNEL_RAMDISK |
196 | A list of dictionaries that describe VDIs, otherwise |
197 | """ |
198 | - url = images.image_url(image) |
199 | + url = "http://%s:%s/_images/%s/image" % (FLAGS.s3_host, FLAGS.s3_port, |
200 | + image) |
201 | LOG.debug(_("Asking xapi to fetch %(url)s as %(access)s") % locals()) |
202 | if image_type == ImageType.KERNEL_RAMDISK: |
203 | fn = 'get_kernel' |
lgtm.