Merge lp:~sandy-walsh/nova/zones-dev-docs into lp:~hudson-openstack/nova/trunk

Proposed by Sandy Walsh
Status: Merged
Approved by: Sandy Walsh
Approved revision: 1124
Merged at revision: 1151
Proposed branch: lp:~sandy-walsh/nova/zones-dev-docs
Merge into: lp:~hudson-openstack/nova/trunk
Diff against target: 172 lines (+168/-0)
1 file modified
doc/source/devref/distributed_scheduler.rst (+168/-0)
To merge this branch: bzr merge lp:~sandy-walsh/nova/zones-dev-docs
Reviewer Review Type Date Requested Status
fred yang (community) Approve
Brian Waldon (community) Approve
Brian Lamar (community) Approve
Review via email: mp+62870@code.launchpad.net

Commit message

Distributed Scheduler developer docs.

Description of the change

Docs for Distributed Scheduler.

Assumes knowledge of the dist-sched-1, dist-sched-2a, dist-sched-2b and dist-sched-3 branches.

It also talks about the upcoming dist-sched-4 branch (reservation ids)

No illustrations yet, but it could definitely use it.

Kind of stream-of-consciousness, so any feedback would be appreciated. I'll adjust accordingly.

To post a comment you must log in.
Revision history for this message
Ed Leafe (ed-leafe) wrote :
Download full text (3.4 KiB)

Line 24: Not crazy about the "dating service" analogy, but if you use that, you should follow through and explain the similarities in terms of matching interests for people with matching requirements for instances.

Line 36: "Trying to put an Instance with 8G of RAM on a Host that only has 4G remaining would have a very high cost." - That is incorrect; attempting to do that should be impossible, and should fall under the filtering section, not the weighing.

Line 55: 'Filtered' and 'Weighed' should not be capitalized. If you want to emphasize those words, they should be bolded. This pattern is repeated often throughout the document, making it read like German where every noun is capitalized.

Line 56: 'DISK' is not an acronym and should not be in caps.

Line 66: Filtering is not subjective; it is absolute. Either an instance meets the requirements, or it doesn't.

Line 70: CPU and RAM should be capitalized. And if you're going to say "we'll discuss that later", you should specify the section of the document in which it is discussed.

Lines 68-82: It should be made clear at the start of this section that you are describing the original flow of events before the addition of zones. When I read this, I started writing up things that are not accurate, until I got a little further along and realized you were talking about the old way (pre-zone) of doing things.

Line 78: No apostrophe is needed for plurals.

Line 80: The class is `ChanceScheduler`, not `ChangeScheduler`.

Line 91: No apostrophe is needed for plurals.

LInes 96-106: It is not clear here that when the scheduler handling the create request gets back the weighted list of hosts, it has no idea about the location or identity of any host that is not local to its own zone. It can therefore not send a `POST /servers` request to the "relevant child zone", since it has no idea which child is the relevant one. Instead, the `POST /servers` call is made to *all* child zones, and these zones then attempt to decrypt the body. If they cannot, the request is not for them, and they return a 404 or equivalent. If the child can decrypt the request, they proceed with the request and either process it locally, or repeat the process with its child zones.

Line 102: It should be emphasized here that the encryption key must be unique for each zone. That is mentioned much later in the document, but as the key uniqueness is the fundamental mechanism for handling zone routing, it deserves mention here, too.

Line 108: No apostrophe is needed for plurals.

Line 115: instead of using the humorous '... and that would be bad', it would be better to say ', which is not what is intended by the original GET request'.

Line 126: change 'deployment specific' to 'deployment-specific'.

Line 128: should end with a colon, and line 130-131 should be indented or made into a list.

Line 133: filtering and weighing should be distinct concepts. Either something makes it through a filter, or it doesn't. It makes no sense to talk of "weight tuples"; they should simply be "host tuples".

Line 141: for demonstration purposes, it would make more sense for the basic weigh_hosts() method to return a random number for the weight, so ...

Read more...

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Thanks Ed, making changes.

Your point about line 96-106 isn't correct however, we do know which child zone the request has to go to, just not the host within the zone (or any child zone within it). We know which zone we sent the /zones/select to, so we know where to send the /servers request.

Revision history for this message
fred yang (fred-yang) wrote :

Sandy,

We are working to enable openstack with trusted computing pool capability, but is wondering how to enable the stack if the hosts information returned from select is encrypted/blob, since trusted computing pool is base on target host names. Can you suggest a method or if my understanding is correct?

Background of Trusted computing pool -
Intel Trusted Executing Technology (TXT) http://www.intel.com/technology/security/ provides platform Root of Trust to verify a platform is booted with expected Hypervisor/ServiceOS by measuring Hypervisor/ServiceOS's hash during platform boot. We have also enabled Intel TXT technology into Xen/KVM/VMWare already

Following describes flow and highlights usage model -
1. A target host is booted with TXT enabled - hypervisor/ServiceOS will be measured by TXT and save the measured hash value into TPM registers per http://www.trustedcomputinggroup.org/developers/
2. Standalone Attestation Server challenges target hosts, during run-time, to retrieve TPM registers
3. Attestation Server verifies retrieved registers against Administrator pre-setup known/good hashes to decide trustiness if the target host is indeed booted with expected Hypervisor

The Standalone Attestation Server is 1) Cloud provider hosted, 2) Attestation Server exports query API thru. https to admin in verifying target Host. 3) the service provide service by target host name or IPaddress

Approach in enhancing openstack -
1. Add API to call https API exported by Attestation Server
2. Tag on hosts list returned from flavor filter list.
3. Call Query(Host) thru. Attestation Server to verify host's trustiness if user specifies trustiness in flavor; drops the host from list if fail the verification

Through above process, Cloud provider can build trusted computing pool and provide premiere service.

Should the returned host is encrypted/blob, the service has no way to verify host's trustiness before launch instances.

Suggestion?

Thanks,
-Fred

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Hi Fred,

Interesting problem. As I mentioned in the document, the intention is not to let the inner-workings of a Zone leak outside. I'm assuming the Attestation Server is running inside the Zone so it can have access to the Hosts? And by "running inside the Zone" I mean there are no firewall/network constraints to keep it from accessing any host within the Zone.

If I understand the flow correctly:
1. A TXT-enabled instance is requested. The InstanceID or ReservationID is returned to the caller.
2. At the same time a notice would be sent to the Attestation Server (from within Nova) to tell it about the new instance (and the Host it resides on)?
3. From there, there Attestation Server could query the TPM chip to confirm the host.

If my guess about step 2 is correct this wouldn't be a problem. If, however, the Attestation Server gets notified by the Caller (and not from Nova directly), we would have a problem, because we can't leak Host information outside of a Zone.

But all is not lost.

We could easily provide a means for someone outside the Zone to request internal details about the instance. This would be returned as an encrypted message again (likely using a separate key from the Scheduler encryption). This message could be sent to the Attestation Server where it would be decrypted and the Host probed. (this implies the key is shared with the Attestation Server)

Am I close to understanding the problem?

Cheers,
S

Revision history for this message
Brian Lamar (blamar) wrote :

I feel it's pretty verbose and needs more pictures, but for the most part it makes sense and helped me understand some of the aspects of zones.

review: Approve
Revision history for this message
fred yang (fred-yang) wrote :

Sandy,

We would need to perform the step#2 attestation earlier, before setup the weigh within each zone manager to filter out none-trusted nodes.

After your clarification, my believe the best place to verify would be in filter_host

Thanks,
-Fred

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Brian, agree 100% ... we have it earmarked for Diablo-2 to get some illustrations in there. Thanks for looking at it.

Fred, ah! I would simply derive from an existing scheduler and override the host_filter() or weight_hosts() methods ... do your magic and then call the base class (or vise-versa). Should work fine.

Otherwise let me know if you need any special hooks in there or how we can help.

Cheers,
S

Revision history for this message
Brian Waldon (bcwaldon) wrote :

This is great, Sandy. I learned quite a bit reading over it. I did find one typo:

47: than -> than

review: Needs Fixing
Revision history for this message
Brian Lamar (blamar) wrote :

Also:

32: This -> This

Revision history for this message
Brian Waldon (bcwaldon) wrote :

> Also:
>
> 32: This -> This

You, Brian II, are a jerk.

47: than -> that

Revision history for this message
fred yang (fred-yang) wrote :

Hi Sandy,

Two more questions -

1. from a zone manager perspective, it can receive different formate (JSON or rigid) of flavors request, so all the 3 types of filter drivers would be dynamically invoked for parsing requests, correct?
2. Will zone_aware_scheduler to address different schedulers being used in child_zone, or deployment assuming parent & child zones all to use zone_aware_scheduler during deployment?

Thanks
-Fred

review: Needs Information
Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Brian^2 ... heh. fixed

Fred,

1. correct, the idea is that the caller passes in the Host Filter type and that's the one applied. Two different calls may use two different filters.

That said, the OS API doesn't support this parameter (yet, we need to make an extension for it), which is why each Filter has the instance_type_to_filter() method ... for backwards compatibility.

2. Each Zone can specify its own Scheduler. Only the ZoneAwareSchedulers derivations understand things like HostFilters, etc. but theoretically it could be anything. The biggest concern is making sure the weights returned are comparable to the weights returned from other schedulers.

One feature we have slated for the next sprint is a child zone weight adjustment:

cooked_weight = offset + (scale * raw_weight)

offset = 0.0 by default
scale = 1.0 by default

Where offset and scale may be specified when you add a child zone via 'nova zone-add'. That way you'll be able to put more or less emphasis on the weights returned from a child zone. I think you could use this to make different Scheduler weights comparable.

Hope it helps!

Revision history for this message
Brian Waldon (bcwaldon) wrote :

Great work, Sandy.

review: Approve
Revision history for this message
fred yang (fred-yang) wrote :

great!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'doc/source/devref/distributed_scheduler.rst'
2--- doc/source/devref/distributed_scheduler.rst 1970-01-01 00:00:00 +0000
3+++ doc/source/devref/distributed_scheduler.rst 2011-06-03 12:26:51 +0000
4@@ -0,0 +1,168 @@
5+..
6+ Copyright 2011 OpenStack LLC
7+ All Rights Reserved.
8+
9+ Licensed under the Apache License, Version 2.0 (the "License"); you may
10+ not use this file except in compliance with the License. You may obtain
11+ a copy of the License at
12+
13+ http://www.apache.org/licenses/LICENSE-2.0
14+
15+ Unless required by applicable law or agreed to in writing, software
16+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
17+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
18+ License for the specific language governing permissions and limitations
19+ under the License.
20+
21+Distributed Scheduler
22+=====
23+
24+The Scheduler is akin to a Dating Service. Requests for the creation of new instances come in and the most applicable Compute nodes are selected from a large pool of potential candidates. In a small deployment we may be happy with the currently available Change Scheduler which randomly selects a Host from the available pool. Or if you need something a little more fancy you may want to use the Availability Zone Scheduler, which selects Compute hosts from a logical partitioning of available hosts (within a single Zone).
25+
26+But for larger deployments a more complex scheduling algorithm is required. Additionally, if you are using Zones in your Nova setup, you'll need a scheduler that understand how to pass instance requests from Zone to Zone.
27+
28+This is the purpose of the Distributed Scheduler (DS). The DS utilizes the Capabilities of a Zone and its component services to make informed decisions on where a new instance should be created. When making this decision it consults not only all the Compute nodes in the current Zone, but the Compute nodes in each Child Zone. This continues recursively until the ideal host is found.
29+
30+So, how does this all work?
31+
32+This document will explain the strategy employed by the `ZoneAwareScheduler` and its derivations. You should read the Zones documentation before reading this.
33+
34+Costs & Weights
35+----------
36+When deciding where to place an Instance, we compare a Weighted Cost for each Host. The Weighting, currently, is just the sum of each Cost. Costs are nothing more than integers from `0 - max_int`. Costs are computed by looking at the various Capabilities of the Host relative to the specs of the Instance being asked for. Trying to putting a plain vanilla instance on a high performance host should have a very high cost. But putting a vanilla instance on a vanilla Host should have a low cost.
37+
38+Some Costs are more esoteric. Consider a rule that says we should prefer Hosts that don't already have an instance on it that is owned by the user requesting it (to mitigate against machine failures). Here we have to look at all the other Instances on the host to compute our cost.
39+
40+An example of some other costs might include selecting:
41+* a GPU-based host over a standard CPU
42+* a host with fast ethernet over a 10mbps line
43+* a host that can run Windows instances
44+* a host in the EU vs North America
45+* etc
46+
47+This Weight is computed for each Instance requested. If the customer asked for 1000 instances, the consumed resources on each Host are "virtually" depleted so the Cost can change accordingly.
48+
49+nova.scheduler.zone_aware_scheduler.ZoneAwareScheduler
50+-----------
51+As we explained in the Zones documentation, each Scheduler has a `ZoneManager` object that collects "Capabilities" about child Zones and each of the services running in the current Zone. The `ZoneAwareScheduler` uses this information to make its decisions.
52+
53+Here is how it works:
54+
55+1. The compute nodes are filtered and the nodes remaining are weighed.
56+1a. Filtering the hosts is a simple matter of ensuring the compute node has ample resources (CPU, RAM, Disk, etc) to fulfil the request.
57+1b. Weighing of the remaining compute nodes assigns a number based on their suitability for the request.
58+2. The same request is sent to each child Zone and step #1 is done there too. The resulting weighted list is returned to the parent.
59+3. The parent Zone sorts and aggregates all the weights and a final build plan is constructed.
60+4. The build plan is executed upon. Concurrently, instance create requests are sent to each of the selected hosts, be they local or in a child zone. Child Zones may forward the requests to their child Zones as needed.
61+
62+`ZoneAwareScheduler` by itself is not capable of handling all the provisioning itself. Derived classes are used to select which host filtering and weighing strategy will be used.
63+
64+Filtering and Weighing
65+------------
66+The filtering (excluding compute nodes incapable of fulfilling the request) and weighing (computing the relative "fitness" of a compute node to fulfill the request) rules used are very subjective operations ... Service Providers will probably have a very different set of filtering and weighing rules than private cloud administrators. The filtering and weighing aspects of the `ZoneAwareScheduler` are flexible and extensible.
67+
68+Requesting a new instance
69+------------
70+Prior to the `ZoneAwareScheduler`, to request a new instance, a call was made to `nova.compute.api.create()`. The type of instance created depended on the value of the `InstanceType` record being passed in. The `InstanceType` determined the amount of disk, CPU, RAM and network required for the instance. Administrators can add new `InstanceType` records to suit their needs. For more complicated instance requests we need to go beyond the default fields in the `InstanceType` table.
71+
72+`nova.compute.api.create()` performed the following actions:
73+1. it validated all the fields passed into it.
74+2. it created an entry in the `Instance` table for each instance requested
75+3. it put one `run_instance` message in the scheduler queue for each instance requested
76+4. the schedulers picked off the messages and decided which compute node should handle the request.
77+5. the `run_instance` message was forwarded to the compute node for processing and the instance is created.
78+6. it returned a list of dicts representing each of the `Instance` records (even if the instance has not been activated yet). At least the `instance_id`s are valid.
79+
80+Generally, the standard schedulers (like `ChanceScheduler` and `AvailabilityZoneScheduler`) only operate in the current Zone. They have no concept of child Zones.
81+
82+The problem with this approach is each request is scattered amongst each of the schedulers. If we are asking for 1000 instances, each scheduler gets the requests one-at-a-time. There is no possability of optimizing the requests to take into account all 1000 instances as a group. We call this Single-Shot vs. All-at-Once.
83+
84+For the `ZoneAwareScheduler` we need to use the All-at-Once approach. We need to consider all the hosts across all the Zones before deciding where they should reside. In order to handle this we have a new method `nova.compute.api.create_all_at_once()`. This method does things a little differently:
85+1. it validates all the fields passed into it.
86+2. it creates a single `reservation_id` for all of instances created. This is a UUID.
87+3. it creates a single `run_instance` request in the scheduler queue
88+4. a scheduler picks the message off the queue and works on it.
89+5. the scheduler sends off an OS API `POST /zones/select` command to each child Zone. The `BODY` payload of the call contains the `request_spec`.
90+6. the child Zones use the `request_spec` to compute a weighted list for each instance requested. No attempt to actually create an instance is done at this point. We're only estimating the suitability of the Zones.
91+7. if the child Zone has its own child Zones, the `/zones/select` call will be sent down to them as well.
92+8. Finally, when all the estimates have bubbled back to the Zone that initiated the call, all the results are merged, sorted and processed.
93+9. Now the instances can be created. The initiating Zone either forwards the `run_instance` message to the local Compute node to do the work, or it issues a `POST /servers` call to the relevant child Zone. The parameters to the child Zone call are the same as what was passed in by the user.
94+10. The `reservation_id` is passed back to the caller. Later we explain how the user can check on the status of the command with this `reservation_id`.
95+
96+The Catch
97+-------------
98+This all seems pretty straightforward but, like most things, there's a catch. Zones are expected to operate in complete isolation from each other. Each Zone has its own AMQP service, database and set of Nova services. But, for security reasons Zones should never leak information about the architectural layout internally. That means Zones cannot leak information about hostnames or service IP addresses outside of its world.
99+
100+When `POST /zones/select` is called to estimate which compute node to use, time passes until the `POST /servers` call is issued. If we only passed the weight back from the `select` we would have to re-compute the appropriate compute node for the create command ... and we could end up with a different host. Somehow we need to remember the results of our computations and pass them outside of the Zone. Now, we could store this information in the local database and return a reference to it, but remember that the vast majority of weights are going be ignored. Storing them in the database would result in a flood of disk access and then we have to clean up all these entries periodically. Recall that there are going to be many many `select` calls issued to child Zones asking for estimates.
101+
102+Instead, we take a rather innovative approach to the problem. We encrypt all the child zone internal details and pass them back the to parent Zone. If the parent zone decides to use a child Zone for the instance it simply passes the encrypted data back to the child during the `POST /servers` call as an extra parameter. The child Zone can then decrypt the hint and go directly to the Compute node previously selected. If the estimate isn't used, it is simply discarded by the parent. It's for this reason that it is so important that each Zone defines a unique encryption key via `--build_plan_encryption_key`
103+
104+In the case of nested child Zones, each Zone re-encrypts the weighted list results and passes those values to the parent.
105+
106+Throughout the `nova.api.openstack.servers`, `nova.api.openstack.zones`, `nova.compute.api.create*` and `nova.scheduler.zone_aware_scheduler` code you'll see references to `blob` and `child_blob`. These are the encrypted hints about which Compute node to use.
107+
108+Reservation IDs
109+---------------
110+
111+NOTE: The features described in this section are related to the up-coming 'merge-4' branch.
112+
113+The OpenStack API allows a user to list all the instances they own via the `GET /servers/` command or the details on a particular instance via `GET /servers/###`. This mechanism is usually sufficient since OS API only allows for creating one instance at a time, unlike the EC2 API which allows you to specify a quantity of instances to be created.
114+
115+NOTE: currently the `GET /servers` command is not Zone-aware since all operations done in child Zones are done via a single administrative account. Therefore, asking a child Zone to `GET /servers` would return all the active instances ... and that would be what the user intended. Later, when the Keystone Auth system is integrated with Nova, this functionality will be enabled.
116+
117+We could use the OS API 1.1 Extensions mechanism to accept a `num_instances` parameter, but this would result in a different return code. Instead of getting back an `Instance` record, we would be getting back a `reservation_id`. So, instead, we've implemented a new command `POST /zones/boot` command which is nearly identical to `POST /servers` except that it takes a `num_instances` parameter and returns a `reservation_id`. Perhaps in OS API 2.x we can unify these approaches.
118+
119+Finally, we need to give the user a way to get information on each of the instances created under this `reservation_id`. Fortunately, this is still possible with the existing `GET /servers` command, so long as we add a new optional `reservation_id` parameter.
120+
121+`python-novaclient` will be extended to support both of these changes.
122+
123+Host Filter
124+--------------
125+
126+As we mentioned earlier, filtering hosts is a very deployment-specific process. Service Providers may have a different set of criteria for filtering Compute nodes than a University. To faciliate this the `nova.scheduler.host_filter` module supports a variety of filtering strategies as well as an easy means for plugging in your own algorithms.
127+
128+The filter used is determined by the `--default_host_filter` flag, which points to a Python Class. By default this flag is set to `nova.scheduler.host_filter.AllHostsFilter` which simply returns all available hosts. But there are others:
129+
130+ * `nova.scheduler.host_filter.InstanceTypeFilter` provides host filtering based on the memory and disk size specified in the `InstanceType` record passed into `run_instance`.
131+
132+ * `nova.scheduler.host_filter.JSONFilter` filters hosts based on simple JSON expression grammar. Using a LISP-like JSON structure the caller can request instances based on criteria well beyond what `InstanceType` specifies. See `nova.tests.test_host_filter` for examples.
133+
134+To create your own `HostFilter` the user simply has to derive from `nova.scheduler.host_filter.HostFilter` and implement two methods: `instance_type_to_filter` and `filter_hosts`. Since Nova is currently dependent on the `InstanceType` structure, the `instance_type_to_filter` method should take an `InstanceType` and turn it into an internal data structure usable by your filter. This is for backward compatibility with existing OpenStack and EC2 API calls. If you decide to create your own call for creating instances not based on `Flavors` or `InstanceTypes` you can ignore this method. The real work is done in `filter_hosts` which must return a list of host tuples for each appropriate host. The set of all available hosts is in the `ZoneManager` object passed into the call as well as the filter query. The host tuple contains (`<hostname>`, `<additional data>`) where `<additional data>` is whatever you want it to be.
135+
136+Cost Scheduler Weighing
137+--------------
138+Every `ZoneAwareScheduler` derivation must also override the `weigh_hosts` method. This takes the list of filtered hosts (generated by the `filter_hosts` method) and returns a list of weight dicts. The weight dicts must contain two keys: `weight` and `hostname` where `weight` is simply an integer (lower is better) and `hostname` is the name of the host. The list does not need to be sorted, this will be done by the `ZoneAwareScheduler` base class when all the results have been assembled.
139+
140+Simple Zone Aware Scheduling
141+--------------
142+The easiest way to get started with the `ZoneAwareScheduler` is to use the `nova.scheduler.host_filter.HostFilterScheduler`. This scheduler uses the default Host Filter as and the `weight_hosts` method simply returns a weight of 1 for all hosts. But, from this, you can see calls being routed from Zone to Zone and follow the flow of things.
143+
144+The `--scheduler_driver` flag is how you specify the scheduler class name.
145+
146+Flags
147+--------------
148+
149+All this Zone and Distributed Scheduler stuff can seem a little daunting to configure, but it's actually not too bad. Here are some of the main flags you should set in your `nova.conf` file:
150+
151+::
152+ --allow_admin_api=true
153+ --enable_zone_routing=true
154+ --zone_name=zone1
155+ --build_plan_encryption_key=c286696d887c9aa0611bbb3e2025a45b
156+ --scheduler_driver=nova.scheduler.host_filter.HostFilterScheduler
157+ --default_host_filter=nova.scheduler.host_filter.AllHostsFilter
158+
159+`--allow_admin_api` must be set for OS API to enable the new `/zones/*` commands.
160+`--enable_zone_routing` must be set for OS API commands such as `create()`, `pause()` and `delete()` to get routed from Zone to Zone when looking for instances.
161+`--zone_name` is only required in child Zones. The default Zone name is `nova`, but you may want to name your child Zones something useful. Duplicate Zone names are not an issue.
162+`build_plan_encryption_key` is the SHA-256 key for encrypting/decrypting the Host information when it leaves a Zone. Be sure to change this key for each Zone you create. Do not duplicate keys.
163+`scheduler_driver` is the real workhorse of the operation. For Distributed Scheduler, you need to specify a class derived from `nova.scheduler.zone_aware_scheduler.ZoneAwareScheduler`.
164+`default_host_filter` is the host filter to be used for filtering candidate Compute nodes.
165+
166+Some optional flags which are handy for debugging are:
167+
168+::
169+ --connection_type=fake
170+ --verbose
171+
172+Using the `Fake` virtualization driver is handy when you're setting this stuff up so you're not dealing with a million possible issues at once. When things seem to working correctly, switch back to whatever hypervisor your deployment uses.