OpenStack Compute (nova)

Merge lp:~nttdata/nova/live-migration into lp:~hudson-openstack/nova/trunk

live-migration
Merge into trunk

Proposed by Kei Masumoto on 2010-12-31

Status:

Merged

Approved by:

Eric Day on 2011-01-18

Approved revision:

466

Merged at revision:

573

Proposed branch:

lp:~nttdata/nova/live-migration

Merge into:

lp:~hudson-openstack/nova/trunk

Diff against target:

1380 lines (+956/-17)

19 files modified

.mailmap (+2/-0)
Authors (+2/-0)
bin/nova-manage (+81/-1)
nova/api/ec2/cloud.py (+1/-1)
nova/compute/manager.py (+117/-1)
nova/db/api.py (+30/-0)
nova/db/sqlalchemy/api.py (+64/-0)
nova/db/sqlalchemy/models.py (+24/-2)
nova/network/manager.py (+8/-6)
nova/scheduler/driver.py (+183/-0)
nova/scheduler/manager.py (+48/-0)
nova/service.py (+4/-0)
nova/virt/cpuinfo.xml.template (+9/-0)
nova/virt/fake.py (+32/-0)
nova/virt/libvirt_conn.py (+287/-0)
nova/virt/xenapi_conn.py (+30/-0)
nova/volume/driver.py (+25/-5)
nova/volume/manager.py (+8/-1)
setup.py (+1/-0)

To merge this branch:

bzr merge lp:~nttdata/nova/live-migration

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Soren Hansen (community)		2010-12-31	Approve on 2011-01-18
Masanori Itoh (community)		2011-01-14	Approve on 2011-01-18
Vish Ishaya (community)		2011-01-13	Approve on 2011-01-18
Thierry Carrez (community)	ffe	2011-01-13	Approve on 2011-01-17
Review via email: mp+44940@code.launchpad.net

Commit message

Risk of Regression: This patch don’t modify existing functionlities, but I have added some.
1. nova.db.service.sqlalchemy.model.Serivce (adding a column to database)
2. nova.service ( nova-compute needes to insert information defined by 1 above)

So, db migration is necessary for existing user, but just adding columns.

Description of the change

Adding live migration features.
Please refer detail design at:
<http://wiki.openstack.org/LiveMigration?action=AttachFile&do=view&target=bexar-migration-live-update1.pdf>

Also, usage is described at:
<http://wiki.openstack.org/UsageOfLiveMigration>

Changes has been done at:
nova-manage:
    this feature can be used only from nova-manage
nova/scheduler/driver.py
nova/scheduler/manager.py
    pre-checking at schedule_live_migration().
nova/compute/manager.py and nova/virt/libvirt_conn.py
    executing live_migration
nova/db/sqlalchemy/*
    we added Host table because live_migration needs to check which host has
    enough resource, then we have to record total resource that physical servers has.

Revision history for this message

Soren Hansen (soren) wrote on 2011-01-10:

Download full text (46.4 KiB)

Hello.

Please find my comments inline.

2010/12/31 Kei Masumoto <email address hidden>:
> === modified file 'bin/nova-manage'
> --- bin/nova-manage 2010-12-16 22:52:08 +0000
> +++ bin/nova-manage 2010-12-31 04:08:57 +0000
> @@ -79,7 +79,10 @@
> from nova import quota
> from nova import utils
> from nova.auth import manager
> +from nova import rpc
> from nova.cloudpipe import pipelib
> +from nova.api.ec2 import cloud
> +
>
>
> FLAGS = flags.FLAGS
> @@ -452,6 +455,86 @@
> int(network_size), int(vlan_start),
> int(vpn_start))
>
> +
> +class InstanceCommands(object):
> + """Class for mangaging VM instances."""
> +
> + def live_migration(self, ec2_id, dest):
> + """live_migration"""
> +
> + logging.basicConfig()

Todd's newlog branch landed very recently. This changed how we do
logging. Can you update your branch accordingly? Thanks.

> + ctxt = context.get_admin_context()
> +
> + try:
> + internal_id = cloud.ec2_id_to_internal_id(ec2_id)
> + instance_ref = db.instance_get_by_internal_id(ctxt, internal_id)
> + instance_id = instance_ref['id']

There's no longer any difference between ec2_id and internal id. This
should simplify this bit of your patch somewhat.

> + except exception.NotFound as e:
> + msg = _('instance(%s) is not found')
> + e.args += (msg % ec2_id,)
> + raise e

I don't think it's a good idea to add elements to existing Exception
instances' args attribute this way. I'd prefer if you either just raised
a new NotFound exception or simply printed "No such instance: %s" % id
or something like that.

> + ret = rpc.call(ctxt,
> + FLAGS.scheduler_topic,
> + {"method": "live_migration",
> + "args": {"instance_id": instance_id,
> + "dest": dest,
> + "topic": FLAGS.compute_topic}})

I don't understand why you pass the compute_topic in the rpc call rather
than letting the scheduler worry about that?

> + if None != ret:
> + raise ret

"if ret:" is better.

You can (or at least should) only raise Exceptions. rpc.call never
*returns* an Exception. It may *raise* one, but will never return one.

> +
> + print 'Finished all procedure. Check migrating finishes successfully'
> + print 'check status by using euca-describe-instances.'

Perhaps something like this instead:
"Migration of %s initiated. Check its progress using euca-describe-instances."

Hello.

Please find my comments inline.

2010/12/31 Kei Masumoto <masumotok@nttdata.co.jp>:
> === modified file 'bin/nova-manage'
> --- bin/nova-manage     2010-12-16 22:52:08 +0000
> +++ bin/nova-manage     2010-12-31 04:08:57 +0000
> @@ -79,7 +79,10 @@
>  from nova import quota
>  from nova import utils
>  from nova.auth import manager
> +from nova import rpc
>  from nova.cloudpipe import pipelib
> +from nova.api.ec2 import cloud
> +
>
>
>  FLAGS = flags.FLAGS
> @@ -452,6 +455,86 @@
>                                     int(network_size), int(vlan_start),
>                                     int(vpn_start))
>
> +
> +class InstanceCommands(object):
> +    """Class for mangaging VM instances."""
> +
> +    def live_migration(self, ec2_id, dest):
> +        """live_migration"""
> +
> +        logging.basicConfig()

Todd's newlog branch landed very recently. This changed how we do
logging. Can you update your branch accordingly? Thanks.

> +        ctxt = context.get_admin_context()
> +
> +        try:
> +            internal_id = cloud.ec2_id_to_internal_id(ec2_id)
> +            instance_ref = db.instance_get_by_internal_id(ctxt, internal_id)
> +            instance_id = instance_ref['id']

There's no longer any difference between ec2_id and internal id. This
should simplify this bit of your patch somewhat.

> +        except exception.NotFound as e:
> +            msg = _('instance(%s) is not found')
> +            e.args += (msg % ec2_id,)
> +            raise e

> +        ret = rpc.call(ctxt,
> +                       FLAGS.scheduler_topic,
> +                       {"method": "live_migration",
> +                        "args": {"instance_id": instance_id,
> +                                "dest": dest,
> +                                "topic": FLAGS.compute_topic}})

I don't understand why you pass the compute_topic in the rpc call rather
than letting the scheduler worry about that?

> +        if None != ret:
> +            raise ret

"if ret:" is better.

You can (or at least should) only raise Exceptions. rpc.call never
*returns* an Exception. It may *raise* one, but will never return one.

> +
> +        print 'Finished all procedure. Check migrating finishes successfully'
> +        print 'check status by using euca-describe-instances.'

Perhaps something like this instead:
"Migration of %s initiated. Check its progress using euca-describe-instances."

> +class HostCommands(object):
> +    """Class for mangaging host(physical nodes)."""
> +
> +
> +    def list(self):
> +        """describe host list."""
> +
> +        # To supress msg: No handlers could be found for logger "amqplib"
> +        logging.basicConfig()
> +
> +        host_refs = db.host_get_all(context.get_admin_context())
> +        for host_ref in host_refs:
> +            print host_ref['name']
> +
> +
> +    def show(self, host):
> +        """describe cpu/memory/hdd info for host."""
> +
> +        # To supress msg: No handlers could be found for logger "amqplib"
> +        logging.basicConfig()
> +
> +        result = rpc.call(context.get_admin_context(),
> +                         FLAGS.scheduler_topic,
> +                         {"method": "show_host_resource",
> +                          "args": {"host": host}})
> +
> +        # Checking result msg format is necessary, that will have done
> +        # when this feture is included in API.
> +        if dict != type(result):

Can you make this "if type(result) == dict" instead? Everywhere else in
the code, that's how we order comparisons (i.e. "variable == constant"
or "variable != constant" etc.). You do the same a bunch of places in
this patch.

> +            print 'Unexpected error occurs'
> +        elif not result['ret']:
> +            print '%s' % result['msg']
> +        else:
> +            cpu = result['phy_resource']['vcpus']
> +            mem = result['phy_resource']['memory_mb']
> +            hdd = result['phy_resource']['local_gb']
> +
> +            print 'HOST\t\tPROJECT\t\tcpu\tmem(mb)\tdisk(gb)'
> +            print '%s\t\t\t%s\t%s\t%s' % (host, cpu, mem, hdd)
> +            for p_id, val in result['usage'].items():
> +                print '%s\t%s\t\t%s\t%s\t%s' % (host,
> +                                             p_id,
> +                                             val['vcpus'],
> +                                             val['memory_mb'],
> +                                             val['local_gb'])
> +
> +
>  CATEGORIES = [
>     ('user', UserCommands),
>     ('project', ProjectCommands),
> @@ -459,8 +542,9 @@
>     ('shell', ShellCommands),
>     ('vpn', VpnCommands),
>     ('floating', FloatingIpCommands),
> -    ('network', NetworkCommands)]
> -
> +    ('network', NetworkCommands),
> +    ('instance', InstanceCommands),
> +    ('host', HostCommands)]
>
>  def lazy_match(name, key_value_tuples):
>     """Finds all objects that have a key that case insensitively contains
>
> === modified file 'nova/api/ec2/cloud.py'
> --- nova/api/ec2/cloud.py       2010-12-22 21:38:44 +0000
> +++ nova/api/ec2/cloud.py       2010-12-31 04:08:57 +0000
> @@ -679,13 +679,22 @@
>             ec2_id = None
>             if (floating_ip_ref['fixed_ip']
>                 and floating_ip_ref['fixed_ip']['instance']):
> -                internal_id = floating_ip_ref['fixed_ip']['instance']['ec2_id']
> +                # modified by masumotok
> +                internal_id = \
> +                    floating_ip_ref['fixed_ip']['instance']['internal_id']
>                 ec2_id = internal_id_to_ec2_id(internal_id)
>             address_rv = {'public_ip': address,
>                           'instance_id': ec2_id}
>             if context.user.is_admin():
> +                # modified by masumotok- b/c proj_id is never inserted
> +                #details = "%s (%s)" % (address_rv['instance_id'],
> +                #                       floating_ip_ref['project_id'])
> +                if None != address_rv['instance_id']:
> +                    status = 'reserved'
> +                else:
> +                    status = None
>                 details = "%s (%s)" % (address_rv['instance_id'],
> -                                       floating_ip_ref['project_id'])
> +                                       status)
>                 address_rv['instance_id'] = details
>             addresses.append(address_rv)
>         return {'addressesSet': addresses}
>

Again, we've consolidated the instance ID's into a single one, so the
ec2_id no longer exists.

> === modified file 'nova/compute/api.py'
> --- nova/compute/api.py 2010-12-30 22:27:31 +0000
> +++ nova/compute/api.py 2010-12-31 04:08:57 +0000
> @@ -233,6 +233,7 @@
>                              terminated_at=datetime.datetime.utcnow())
>
>         host = instance['host']
> +        logging.error('terminate %s %s %s %s',context, FLAGS.compute_topic, host, self.db.queue_get_for(context, FLAGS.compute_topic, host))

I don't think including all those things in an error message really is
helpful?

>         if host:
>             rpc.cast(context,
>                      self.db.queue_get_for(context, FLAGS.compute_topic, host),
>
> === modified file 'nova/compute/manager.py'
> --- nova/compute/manager.py     2010-12-30 22:06:48 +0000
> +++ nova/compute/manager.py     2010-12-31 04:08:57 +0000
> @@ -36,6 +36,11 @@
>
>  import datetime
>  import logging
> +import sys
> +import traceback
> +import os
> +import time
> +
>
>  from nova import exception
>  from nova import flags
> @@ -43,12 +48,16 @@
>  from nova import rpc
>  from nova import utils
>  from nova.compute import power_state
> +from nova import rpc

Duplicate import.

> +from nova import db

Move further up, so the imports are alphabetical.

>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_string('instances_path', '$state_path/instances',
>                     'where instances are stored on disk')
>  flags.DEFINE_string('compute_driver', 'nova.virt.connection.get_connection',
>                     'Driver to use for controlling virtualization')
> +flags.DEFINE_string('live_migration_timeout', 10,
> +                    'Timeout value for pre_live_migration is completed.')

What's the unit? Seconds? Minutes?

>  flags.DEFINE_string('stub_network', False,
>                     'Stub network related code')
>
> @@ -111,10 +120,9 @@
>         instance_ref = self.db.instance_get(context, instance_id)
>         if instance_ref['name'] in self.driver.list_instances():
>             raise exception.Error(_("Instance has already been created"))
> -        logging.debug(_("instance %s: starting..."), instance_id)

Why remove this?

>         self.db.instance_update(context,
>                                 instance_id,
> -                                {'host': self.host})
> +                                {'host': self.host, 'launched_on':self.host})

Why pass the same value twice?

>
>         self.db.instance_set_state(context,
>                                    instance_id,
> @@ -415,3 +423,108 @@
>         self.volume_manager.remove_compute_volume(context, volume_id)
>         self.db.volume_detached(context, volume_id)
>         return True
> +
> +    def compareCPU(self, context, xml):
> +        """ Check the host cpu is compatible to a cpu given by xml."""
> +        return self.driver.compareCPU(xml)
> +
> +    def get_memory_mb(self):
> +        """Get the memory size of physical computer ."""
> +        meminfo = open('/proc/meminfo').read().split()
> +        idx = meminfo.index('MemTotal:')
> +        # transforming kb to mb.
> +        return int(meminfo[idx + 1]) / 1024
> +
> +    def get_local_gb(self):
> +        """Get the hdd size of physical computer ."""
> +        hddinfo = os.statvfs(FLAGS.instances_path)
> +        return hddinfo.f_bsize * hddinfo.f_blocks / 1024 / 1024 / 1024

I think get_memory_mb and get_local_gb belong in the virt driver, too.

> +
> +    def pre_live_migration(self, context, instance_id, dest):
> +        """Any preparation for live migration at dst host."""
> +
> +        # Getting volume info ( shlf/slot number )
> +        instance_ref = db.instance_get(context, instance_id)
> +        ec2_id = instance_ref['hostname']
> +
> +        volumes = []
> +        try:
> +            volumes = db.volume_get_by_ec2_id(context, ec2_id)
> +        except exception.NotFound:
> +            logging.info(_('%s has no volume.'), ec2_id)

This should never find any volumes. ec2_id refers to an instance.
volume_get_by_ec2_id looks up volume by the volume id, not an instance
id.

> +
> +        shelf_slots = {}
> +        for vol in volumes:
> +            shelf, slot = db.volume_get_shelf_and_blade(context, vol['id'])
> +            shelf_slots[vol.id] = (shelf, slot)
> +
> +        # Getting fixed ips
> +        fixed_ip = db.instance_get_fixed_address(context, instance_id)
> +        if None == fixed_ip:

"if not fixed_ip:" is better.

> +            exc_type = 'NotFoundError'
> +            val = _('%s(%s) doesnt have fixed_ip') % (instance_id, ec2_id)
> +            tb = ''.join(traceback.format_tb(sys.exc_info()[2]))
> +            raise rpc.RemoteError(exc_type, val, tb)

Why isn't the usual exception handling in the rpc code sufficient here?

> +
> +        # If any volume is mounted, prepare here.
> +        if 0 != len(shelf_slots):
> +            pass

It seems odd to argue over no-op code, but this would look better to me:

for shelf, slot in shelf_slots:
		# Prepare volumes for migration
		pass

> +
> +        #  Creating nova-instance-instance-xxx, this is written to libvirt.xml,
> +        #  and can be seen when executin "virsh nwfiter-list" On destination host,
> +        #  this nwfilter is necessary.
> +        #  In addition this method is creating security rule ingress rule onto
> +        #  destination host.
> +        self.driver.setup_nwfilters_for_instance(instance_ref)

Only libvirt has a setup_nwfilters_for_instance method, and we've
actually added another filtering backend, so this has been generalised
somewhat.

More generally speaking, though, I think this pre-migration code belongs
in the virt driver.

> +
> +        # 5. bridge settings
> +        self.network_manager.setup_compute_network(context, instance_id)
> +        return True
> +
> +    def nwfilter_for_instance_exists(self, context, instance_id):
> +        """Check nova-instance-instance-xxx filter exists """
> +        instance_ref = db.instance_get(context, instance_id)
> +        return self.driver.nwfilter_for_instance_exists(instance_ref)
> +

> +    def live_migration(self, context, instance_id, dest):
> +        """executes live migration."""
> +
> +        # Asking dest host to preparing live migration.
> +        compute_topic = db.queue_get_for(context, FLAGS.compute_topic, dest)
> +        ret = rpc.call(context,
> +                        compute_topic,
> +                        {"method": "pre_live_migration",
> +                         "args": {'instance_id': instance_id,
> +                                    'dest': dest}})
> +
> +        if True != ret:

Just "if ret" is better.

> +            logging.error(_('Pre live migration failed(err at %s)'), dest)
> +            db.instance_set_state(context,
> +                                  instance_id,
> +                                  power_state.RUNNING,
> +                                  'running')
> +            return
> +
> +        # Waiting for setting up nwfilter such as, nova-instance-instance-xxx.
> +        # otherwise, live migration fail.
> +        timeout_count = range(FLAGS.live_migration_timeout * 2)
> +        while 0 != len(timeout_count):
> +            ret = rpc.call(context,
> +                        compute_topic,
> +                        {"method": "nwfilter_for_instance_exists",
> +                         "args": {'instance_id': instance_id}})
> +            if ret:
> +                break
> +
> +            timeout_count.pop()
> +            time.sleep(0.5)
> +
> +        if not ret:
> +            logging.error(_('Timeout for pre_live_migration at %s'), dest)
> +            return
> +
> +        # Executing live migration
> +        # live_migration might raises ProcessExecution error, but
> +        # nothing must be recovered in this version.
> +        instance_ref = db.instance_get(context, instance_id)
> +        self.driver.live_migration(context, instance_ref, dest)
>
> === modified file 'nova/db/api.py'
> --- nova/db/api.py      2010-12-28 17:49:07 +0000
> +++ nova/db/api.py      2010-12-31 04:08:57 +0000
> @@ -234,6 +234,10 @@
>     return IMPL.floating_ip_get_by_address(context, address)
>
>
> +def floating_ip_update(context, address, values):
> +    """update floating ip information."""
> +    return IMPL.floating_ip_update(context, address, values)
> +
>  ####################
>
>
> @@ -378,6 +382,32 @@
>                                             security_group_id)
>
>
> +def instance_get_all_by_host(context, hostname):
> +    """Get instances by host"""
> +    return IMPL.instance_get_all_by_host(context, hostname)
> +
> +
> +def instance_get_vcpu_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get instances.vcpus by host and project"""
> +    return IMPL.instance_get_vcpu_sum_by_host_and_project(context,
> +                                                          hostname,
> +                                                          proj_id)
> +
> +
> +def instance_get_memory_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get amount of memory by host and project """
> +    return IMPL.instance_get_memory_sum_by_host_and_project(context,
> +                                                            hostname,
> +                                                            proj_id)
> +
> +
> +def instance_get_disk_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get total amount of disk by host and project """
> +    return IMPL.instance_get_disk_sum_by_host_and_project(context,
> +                                                          hostname,
> +                                                          proj_id)
> +
> +
>  def instance_action_create(context, values):
>     """Create an instance action from the values dictionary."""
>     return IMPL.instance_action_create(context, values)
> @@ -889,3 +919,36 @@
>
>     """
>     return IMPL.host_get_networks(context, host)
> +
> +
> +###################
> +
> +
> +def host_create(context, value):
> +    """Create a host from the values dictionary."""
> +    return IMPL.host_create(context, value)
> +
> +
> +def host_get(context, host_id):
> +    """Get an host or raise if it does not exist."""
> +    return IMPL.host_get(context, host_id)
> +
> +
> +def host_get_all(context, session=None):
> +    """Get all hosts or raise if it does not exist."""
> +    return IMPL.host_get_all(context)
> +
> +
> +def host_get_by_name(context, host):
> +    """Get an host or raise if it does not exist."""
> +    return IMPL.host_get_by_name(context, host)
> +
> +
> +def host_update(context, host, values):
> +    """Set the given properties on an host and update it."""
> +    return IMPL.host_update(context, host, values)
> +
> +

> +def host_deactivated(context, host):
> +    """set deleted flag to a given host"""
> +    return IMPL.host_deactivated(context, host)

How is this different from the other *_destroy() methods?
>
> === modified file 'nova/db/sqlalchemy/api.py'
> --- nova/db/sqlalchemy/api.py   2010-12-29 20:15:04 +0000
> +++ nova/db/sqlalchemy/api.py   2010-12-31 04:08:57 +0000
> @@ -473,6 +473,16 @@
>     return result
>
>
> +@require_context
> +def floating_ip_update(context, address, values):
> +    session = get_session()
> +    with session.begin():
> +        floating_ip_ref = floating_ip_get_by_address(context, address, session)
> +        for (key, value) in values.iteritems():
> +            floating_ip_ref[key] = value
> +        floating_ip_ref.save(session=session)
> +
> +
>  ###################
>
>
> @@ -832,6 +842,7 @@
>         return instance_ref
>
>
> +@require_context
>  def instance_add_security_group(context, instance_id, security_group_id):
>     """Associate the given security group with the given instance"""
>     session = get_session()
> @@ -845,6 +856,51 @@
>
>
>  @require_context
> +def instance_get_all_by_host(context, hostname):
> +    session = get_session()
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Instance
> +                       ).filter_by(host=hostname
> +                       ).filter_by(deleted=can_read_deleted(context)
> +                       ).all()
> +    if None == result:

"if not result:" is better.

> +        return []
> +    return result
> +
> +
> +@require_context
> +def _instance_get_sum_by_host_and_project(context, column, hostname, proj_id):
> +    session = get_session()
> +
> +    result = session.query(models.Instance
> +                       ).filter_by(host=hostname
> +                       ).filter_by(project_id=proj_id
> +                       ).filter_by(deleted=can_read_deleted(context)
> +                       ).value(column)
> +    if None == result:

"if not result:" is better.

> +        return 0
> +    return result
> +
> +
> +@require_context
> +def instance_get_vcpu_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'vcpus', hostname,
> +                                                 proj_id)
> +
> +
> +@require_context
> +def instance_get_memory_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'memory_mb',
> +                                                 hostname, proj_id)
> +
> +
> +@require_context
> +def instance_get_disk_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'local_gb',
> +                                                 hostname, proj_id)
> +@require_context
>  def instance_action_create(context, values):
>     """Create an instance action from the values dictionary."""
>     action_ref = models.InstanceActions()
> @@ -1875,3 +1931,76 @@
>                        filter_by(deleted=False).\
>                        filter_by(host=host).\
>                        all()
> +
> +
> +###################
> +
> +@require_admin_context
> +def host_create(context, values):
> +    host_ref = models.Host()
> +    for (key, value) in values.iteritems():
> +        host_ref[key] = value
> +    host_ref.save()
> +    return host_ref
> +
> +
> +@require_admin_context
> +def host_get(context, host_id, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).filter_by(id=host_id
> +                     ).first()
> +
> +    if not result:
> +        raise exception.NotFound('No host for id %s' % host_id)
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_get_all(context, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).all()
> +
> +    if not result:
> +        raise exception.NotFound('No host record found .')
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_get_by_name(context, host, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).filter_by(name=host
> +                     ).first()
> +
> +    if not result:
> +        raise exception.NotFound('No host for name %s' % host)
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_update(context, host_id, values):
> +    session = get_session()
> +    with session.begin():
> +        host_ref = host_get(context, host_id, session=session)
> +        for (key, value) in values.iteritems():
> +            host_ref[key] = value
> +        host_ref.save(session=session)
> +
> +
> +@require_admin_context
> +def host_deactivated(context, host):
> +    host_update(context, host, {'deleted': True})
>
> === modified file 'nova/db/sqlalchemy/models.py'
> --- nova/db/sqlalchemy/models.py        2010-12-31 02:04:39 +0000
> +++ nova/db/sqlalchemy/models.py        2010-12-31 04:08:57 +0000
> @@ -138,6 +138,26 @@
>  #    __tablename__ = 'hosts'
>  #    id = Column(String(255), primary_key=True)
>
> +# this class is created by masumotok

No need to claim ownership over individual model classes. :)

> +class Host(BASE, NovaBase):
> +    """Represents a host where services are running"""
> +    __tablename__ = 'hosts'
> +    id = Column(Integer, primary_key=True)
> +    name = Column(String(255))
> +    vcpus = Column(Integer, nullable=False, default=-1)
> +    memory_mb = Column(Integer, nullable=False, default=-1)
> +    local_gb = Column(Integer, nullable=False, default=-1)
> +    hypervisor_type = Column(String(128))
> +    hypervisor_version = Column(Integer, nullable=False, default=-1)
> +    cpu_info = Column(String(1024))
> +    deleted = Column(Boolean, default=False)
> +    # C: when calling service_create()
> +    # D: never deleted. instead of deleting cloumn "deleted" is true
> +    #    when host is down
> +    #    b/c Host.id is foreign key of service, and records
> +    #    of the "service" table are not deleted.
> +    # R: Column "deleted" is true when calling hosts_up() and host is down.
> +
>
>  class Service(BASE, NovaBase):
>     """Represents a running service on a host."""
> @@ -224,6 +244,10 @@
>     display_name = Column(String(255))
>     display_description = Column(String(255))
>
> +    # To remember on which host a instance booted.
> +    # An instance may moved to other host by live migraiton.
> +    launched_on = Column(String(255))
> +
>     # TODO(vish): see Ewan's email about state improvements, probably
>     #             should be in a driver base class or some such
>     # vmstate_state = running, halted, suspended, paused
> @@ -550,7 +574,7 @@
>               Volume, ExportDevice, IscsiTarget, FixedIp, FloatingIp,
>               Network, SecurityGroup, SecurityGroupIngressRule,
>               SecurityGroupInstanceAssociation, AuthToken, User,
> -              Project, Certificate)  # , Image, Host
> +              Project, Certificate, Host)  # , Image
>     engine = create_engine(FLAGS.sql_connection, echo=False)
>     for model in models:
>         model.metadata.create_all(engine)
>
> === modified file 'nova/network/manager.py'
> --- nova/network/manager.py     2010-12-22 21:38:44 +0000
> +++ nova/network/manager.py     2010-12-31 04:08:57 +0000
> @@ -154,7 +154,7 @@
>         """Called when this host becomes the host for a network."""
>         raise NotImplementedError()
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
>         raise NotImplementedError()
>
> @@ -314,7 +314,7 @@
>         self.db.fixed_ip_update(context, address, {'allocated': False})
>         self.db.fixed_ip_disassociate(context.elevated(), address)
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Network is created manually."""
>         pass
>
> @@ -381,9 +381,10 @@
>         super(FlatDHCPManager, self).init_host()
>         self.driver.metadata_forward()
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
> -        network_ref = db.network_get_by_instance(context, instance_id)
> +        if network_ref is None:
> +            network_ref = db.network_get_by_instance(context, instance_id)
>         self.driver.ensure_bridge(network_ref['bridge'],
>                                   FLAGS.flat_interface)
>
> @@ -473,9 +474,11 @@
>         """Returns a fixed ip to the pool."""
>         self.db.fixed_ip_update(context, address, {'allocated': False})
>
> -    def setup_compute_network(self, context, instance_id):
> +    #def setup_compute_network(self, context, instance_id):

Please remove the old code, rather than just commenting it out.

> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
> -        network_ref = db.network_get_by_instance(context, instance_id)
> +        if network_ref is None:
> +            network_ref = db.network_get_by_instance(context, instance_id)
>         self.driver.ensure_vlan_bridge(network_ref['vlan'],
>                                        network_ref['bridge'])
>
>
> === modified file 'nova/scheduler/driver.py'
> --- nova/scheduler/driver.py    2010-12-22 20:59:53 +0000
> +++ nova/scheduler/driver.py    2010-12-31 04:08:57 +0000
> @@ -22,10 +22,14 @@
>  """
>
>  import datetime
> +import logging
>
>  from nova import db
>  from nova import exception
>  from nova import flags
> +from nova import rpc
> +from nova.api.ec2 import cloud
> +from nova.compute import power_state
>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_integer('service_down_time', 60,
> @@ -59,3 +63,136 @@
>     def schedule(self, context, topic, *_args, **_kwargs):
>         """Must override at least this method for scheduler to work."""
>         raise NotImplementedError(_("Must implement a fallback schedule"))
> +
> +    def schedule_live_migration(self, context, instance_id, dest):
> +        """ live migration method """
> +
> +        # Whether instance exists and running
> +        # try-catch clause is necessary because only internal_id is shown
> +        # when NotFound exception occurs. it isnot understandable to admins.
> +        try:
> +            instance_ref = db.instance_get(context, instance_id)
> +            ec2_id = instance_ref['hostname']
> +            internal_id = instance_ref['internal_id']
> +        except exception.NotFound, e:
> +            msg = _('Unexpected error: instance is not found')
> +            e.args += ('\n' + msg, )
> +            raise e

Same comment as above wrt to exception mangling. I now see your comment
above, but I don't understand how it helps to just not show the id at
all?

> +
> +        # Checking instance state.
> +        if power_state.RUNNING != instance_ref['state'] or \
> +           'running' != instance_ref['state_description']:
> +            msg = _('Instance(%s) is not running')
> +            raise exception.Invalid(msg % ec2_id)
> +
> +        # Checking destination host exists
> +        dhost_ref = db.host_get_by_name(context, dest)
> +
> +        # Checking whether The host where instance is running
> +        # and dest is not same.
> +        src = instance_ref['host']
> +        if dest == src:
> +            msg = _('%s is where %s is running now. choose other host.')
> +            raise exception.Invalid(msg % (dest, ec2_id))
> +
> +        # Checking dest is compute node.
> +        services = db.service_get_all_by_topic(context, 'compute')
> +        if dest not in [service.host for service in services]:
> +            msg = _('%s must be compute node')
> +            raise exception.Invalid(msg % dest)
> +
> +        # Checking dest host is alive.
> +        service = [service for service in services if service.host == dest]
> +        service = service[0]
> +        if not self.service_is_up(service):
> +            msg = _('%s is not alive(time synchronize problem?)')
> +            raise exception.Invalid(msg % dest)
> +
> +        # NOTE(masumotok): Below pre-checkings are followed by
> +        # http://wiki.libvirt.org/page/TodoPreMigrationChecks
> +
> +        # Checking hypervisor is same.
> +        orighost = instance_ref['launched_on']
> +        ohost_ref = db.host_get_by_name(context, orighost)
> +
> +        otype = ohost_ref['hypervisor_type']
> +        dtype = dhost_ref['hypervisor_type']
> +        if otype != dtype:
> +            msg = _('Different hypervisor type(%s->%s)')
> +            raise exception.Invalid(msg % (otype, dtype))
> +
> +        # Checkng hypervisor version.
> +        oversion = ohost_ref['hypervisor_version']
> +        dversion = dhost_ref['hypervisor_version']
> +        if oversion > dversion:
> +            msg = _('Older hypervisor version(%s->%s)')
> +            raise exception.Invalid(msg % (oversion, dversion))
> +
> +        # Checking cpuinfo.
> +        cpuinfo = ohost_ref['cpu_info']
> +        if str != type(cpuinfo):
> +            msg = _('Unexpected err: not found cpu_info for %s on DB.hosts')
> +            raise exception.Invalid(msg % orighost)
> +
> +        ret = rpc.call(context,
> +                       db.queue_get_for(context, FLAGS.compute_topic, dest),
> +                       {"method": 'compareCPU',
> +                        "args": {'xml': cpuinfo}})
> +
> +        if int != type(ret):
> +            raise ret

Same comment as above. rpc.call doesn't ever return exceptions, so don't
raise its return value.

> +
> +        if 0 >= ret:
> +            u = 'http://libvirt.org/html/libvirt-libvirt.html'
> +            u += '#virCPUCompareResult'
> +            msg = '%s doesnt have compatibility to %s(where %s launching at)\n'
> +            msg += 'result:%d \n'
> +            msg += 'Refer to %s'
> +            msg = _(msg)
> +            raise exception.Invalid(msg % (dest, src, ec2_id, ret, u))

This is several levels removed from libvirt, so errors should either be
more generic, or generated closer to the virt layer and sent back to
here.

> +
> +        # Checking dst host still has enough capacities.
> +        self.has_enough_resource(context, instance_id, dest)
> +
> +        # Changing instance_state.
> +        db.instance_set_state(context,
> +                              instance_id,
> +                              power_state.PAUSED,
> +                              'migrating')
> +
> +        # Requesting live migration.
> +        return src
> +
> +    def has_enough_resource(self, context, instance_id, dest):
> +        """ Check if destination host has enough resource for live migration"""
> +
> +        # Getting instance information
> +        instance_ref = db.instance_get(context, instance_id)
> +        ec2_id = instance_ref['hostname']
> +        vcpus = instance_ref['vcpus']
> +        mem = instance_ref['memory_mb']
> +        hdd = instance_ref['local_gb']
> +
> +        # Gettin host information
> +        host_ref = db.host_get_by_name(context, dest)
> +        total_cpu = int(host_ref['vcpus'])
> +        total_mem = int(host_ref['memory_mb'])
> +        total_hdd = int(host_ref['local_gb'])
> +
> +        instances_ref = db.instance_get_all_by_host(context, dest)
> +        for i_ref in instances_ref:
> +            total_cpu -= int(i_ref['vcpus'])
> +            total_mem -= int(i_ref['memory_mb'])
> +            total_hdd -= int(i_ref['local_gb'])
> +
> +        # Checking host has enough information
> +        logging.debug('host(%s) remains vcpu:%s mem:%s hdd:%s,' %
> +                      (dest, total_cpu, total_mem, total_hdd))
> +        logging.debug('instance(%s) has vcpu:%s mem:%s hdd:%s,' %
> +                      (ec2_id, vcpus, mem, hdd))
> +
> +        if total_cpu <= vcpus or total_mem <= mem or total_hdd <= hdd:
> +            msg = '%s doesnt have enough resource for %s' % (dest, ec2_id)
> +            raise exception.NotEmpty(msg)
> +
> +        logging.debug(_('%s has enough resource for %s') % (dest, ec2_id))
>
> === modified file 'nova/scheduler/manager.py'
> --- nova/scheduler/manager.py   2010-12-22 20:59:53 +0000
> +++ nova/scheduler/manager.py   2010-12-31 04:08:57 +0000
> @@ -29,6 +29,7 @@
>  from nova import manager
>  from nova import rpc
>  from nova import utils
> +from nova import exception
>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_string('scheduler_driver',
> @@ -66,3 +67,46 @@
>                  {"method": method,
>                   "args": kwargs})
>         logging.debug(_("Casting to %s %s for %s"), topic, host, method)
> +
> +    # NOTE (masumotok) : This method should be moved to nova.api.ec2.admin.
> +    #                    Based on bear design summit discussion,
> +    #                    just put this here for bexar release.
> +    def show_host_resource(self, context, host, *args):
> +        """ show the physical/usage resource given by hosts."""
> +
> +        try:
> +            host_ref = db.host_get_by_name(context, host)
> +        except exception.NotFound:
> +            return {'ret': False, 'msg': 'No such Host'}
> +        except:
> +            raise
> +
> +        # Getting physical resource information
> +        h_resource = {'vcpus': host_ref['vcpus'],
> +                     'memory_mb': host_ref['memory_mb'],
> +                     'local_gb': host_ref['local_gb']}
> +
> +        # Getting usage resource information
> +        u_resource = {}
> +        instances_ref = db.instance_get_all_by_host(context, host_ref['name'])
> +
> +        if 0 == len(instances_ref):
> +            return {'ret': True, 'phy_resource': h_resource, 'usage': {}}
> +
> +        project_ids = [i['project_id'] for i in instances_ref]
> +        project_ids = list(set(project_ids))
> +        for p_id in project_ids:
> +            vcpus = db.instance_get_vcpu_sum_by_host_and_project(context,
> +                                                               host,
> +                                                               p_id)
> +            mem = db.instance_get_memory_sum_by_host_and_project(context,
> +                                                                host,
> +                                                                p_id)
> +            hdd = db.instance_get_disk_sum_by_host_and_project(context,
> +                                                               host,
> +                                                               p_id)
> +            u_resource[p_id] = {'vcpus': vcpus,
> +                                'memory_mb': mem,
> +                                'local_gb': hdd}
> +
> +        return {'ret': True, 'phy_resource': h_resource, 'usage': u_resource}
>
> === modified file 'nova/service.py'
> --- nova/service.py     2010-12-22 20:59:53 +0000
> +++ nova/service.py     2010-12-31 04:08:57 +0000
> @@ -77,6 +77,13 @@
>         self.manager.init_host()
>         self.model_disconnected = False
>         ctxt = context.get_admin_context()
> +
> +        try:
> +            host_ref = db.host_get_by_name(ctxt, self.host)
> +        except exception.NotFound:
> +            host_ref = db.host_create(ctxt, {'name': self.host})
> +        host_ref = self._update_host_ref(ctxt, host_ref)
> +
>         try:
>             service_ref = db.service_get_by_args(ctxt,
>                                                  self.host,
> @@ -117,6 +124,26 @@
>                                          'report_count': 0})
>         self.service_id = service_ref['id']
>
> +    def _update_host_ref(self, context, host_ref):
> +
> +        if 0 <= self.manager_class_name.find('ComputeManager'):
> +            vcpu = self.manager.driver.get_vcpu_number()
> +            memory_mb = self.manager.get_memory_mb()
> +            local_gb = self.manager.get_local_gb()
> +            hypervisor = self.manager.driver.get_hypervisor_type()
> +            version = self.manager.driver.get_hypervisor_version()
> +            cpu_xml = self.manager.driver.get_cpu_xml()
> +
> +            db.host_update(context,
> +                           host_ref['id'],
> +                           {'vcpus': vcpu,
> +                           'memory_mb': memory_mb,
> +                           'local_gb': local_gb,
> +                           'hypervisor_type': hypervisor,
> +                           'hypervisor_version': version,
> +                           'cpu_info':cpu_xml })
> +        return host_ref
> +
>     def __getattr__(self, key):
>         manager = self.__dict__.get('manager', None)
>         return getattr(manager, key)
>
> === added file 'nova/service.py.THIS'
> --- nova/service.py.THIS        1970-01-01 00:00:00 +0000
> +++ nova/service.py.THIS        2010-12-31 04:08:57 +0000

Please remove this file.

> === modified file 'nova/utils.py'
> --- nova/utils.py       2010-12-26 14:08:38 +0000
> +++ nova/utils.py       2010-12-31 04:08:57 +0000
> @@ -134,7 +134,6 @@
>         result = obj.communicate()
>     obj.stdin.close()
>     if obj.returncode:
> -        logging.debug(_("Result was %s") % (obj.returncode))
>         if check_exit_code and obj.returncode != 0:
>             (stdout, stderr) = result
>             raise ProcessExecutionError(exit_code=obj.returncode,
>

Why remove this?

> === modified file 'nova/virt/libvirt_conn.py'
> --- nova/virt/libvirt_conn.py   2010-12-30 21:23:14 +0000
> +++ nova/virt/libvirt_conn.py   2010-12-31 04:08:57 +0000
> @@ -39,6 +39,7 @@
>  import logging
>  import os
>  import shutil
> +import re
>
>  from eventlet import greenthread
>  from eventlet import event
> @@ -82,6 +83,9 @@
>                     '',
>                     'Override the default libvirt URI (which is dependent'
>                     ' on libvirt_type)')
> +flags.DEFINE_string('live_migration_uri',
> +                  "qemu+tcp://%s/system",
> +                  'Define protocol used by live_migration feature')
>  flags.DEFINE_bool('allow_project_net_traffic',
>                   True,
>                   'Whether to allow in project network traffic')
> @@ -674,6 +678,30 @@
>
>         return interfaces
>
> +    def get_vcpu_number(self):
> +        """ Get vcpu number of physical computer.  """
> +        return self._conn.getMaxVcpus(None)
> +
> +    def get_hypervisor_type(self):
> +        """ Get hypervisor type """
> +        return self._conn.getType()
> +
> +    def get_hypervisor_version(self):
> +        """ Get hypervisor version """
> +        return self._conn.getVersion()
> +
> +    def get_cpu_xml(self):
> +        """ Get cpuinfo information """
> +        xmlstr = self._conn.getCapabilities()
> +        xml = libxml2.parseDoc(xmlstr)
> +        nodes = xml.xpathEval('//cpu')
> +        if 1 != len(nodes):
> +            msg = 'Unexpected xml format. tag "cpu" must be 1, but %d.' % len(nodes)
> +            msg += '\n' + xml.serialize()
> +            raise exception.Invalid(_(msg))
> +        cpuxmlstr = re.sub("\n|[ ]+", ' ', nodes[0].serialize())
> +        return cpuxmlstr
> +
>     def block_stats(self, instance_name, disk):
>         """
>         Note that this function takes an instance name, not an Instance, so
> @@ -694,6 +722,119 @@
>         fw = NWFilterFirewall(self._conn)
>         fw.ensure_security_group_filter(security_group_id)
>
> +    def setup_nwfilters_for_instance(self, instance):
> +        """ See same method of NWFilterFirewall class """
> +        nwfilter = NWFilterFirewall(self._conn)
> +        return nwfilter.setup_nwfilters_for_instance(instance)
> +

This has been refactored somewhat in trunk. Please adjust accordingly.

> +    def nwfilter_for_instance_exists(self, instance_ref):
> +        try:
> +            filter = 'nova-instance-%s' % instance_ref.name
> +            self._conn.nwfilterLookupByName(filter)
> +            return True
> +        except libvirt.libvirtError:
> +            return False
> +

I'd prefer it if we could just call into the firewall driver and tell it
to prepare filters for a given instance and leave it to the driver to
decide if it actually has anything left to do.

> +    def compareCPU(self, xml):
> +        """
> +           Check the host cpu is compatible to a cpu given by xml.
> +           "xml" must be a part of libvirt.openReadonly().getCapabilities().
> +           return values follows by virCPUCompareResult.
> +           if 0 > return value, do live migration.
> +
> +           'http://libvirt.org/html/libvirt-libvirt.html#virCPUCompareResult'
> +        """
> +        return self._conn.compareCPU(xml, 0)
> +
> +    def live_migration(self, context, instance_ref, dest):
> +        """
> +           Just spawning live_migration operation for
> +           distributing high-load.
> +        """
> +        greenthread.spawn(self._live_migration, context, instance_ref, dest)
> +
> +    def _live_migration(self, context, instance_ref, dest):
> +        """ Do live migration."""
> +
> +        # Do live migration.
> +        try:
> +            uri = FLAGS.live_migration_uri % dest
> +            out, err = utils.execute("sudo virsh migrate --live %s %s"
> +                                    % (instance_ref.name, uri))

We shouldn't use virsh for this. We should use the python bindings for
libvirt. Doing so also lets us throttle the bandwidth used by the
migration process.

> +        except exception.ProcessExecutionError:
> +            id = instance_ref['id']
> +            db.instance_set_state(context, id, power_state.RUNNING, 'running')
> +            raise
> +
> +        # Waiting for completion of live_migration.
> +        timer = utils.LoopingCall(f=None)
> +
> +        def wait_for_live_migration():
> +
> +            try:
> +                state = self.get_info(instance_ref.name)['state']
> +            except exception.NotFound:
> +                timer.stop()
> +                self._post_live_migration(context, instance_ref, dest)
> +
> +        timer.f = wait_for_live_migration
> +        timer.start(interval=0.5, now=True)
> +
> +    def _post_live_migration(self, context, instance_ref, dest):
> +        """
> +           Post operations for live migration.
> +           Mainly, database updating.
> +        """
> +        # Detaching volumes.
> +        # (not necessary in current version )
> +
> +        # Releasing vlan.
> +        #   (not necessary in current implementation?)
> +
> +        # Releasing security group ingress rule.
> +        #   (not necessary in current implementation?)
> +
> +        # Database updating.
> +        ec2_id = instance_ref['hostname']
> +
> +        instance_id = instance_ref['id']
> +        fixed_ip = db.instance_get_fixed_address(context, instance_id)
> +        # Not return if fixed_ip is not found, otherwise,
> +        # instance never be accessible..
> +        if None == fixed_ip:
> +            logging.warn('fixed_ip is not found for %s ' % ec2_id)
> +        db.fixed_ip_update(context, fixed_ip, {'host': dest})
> +        network_ref = db.fixed_ip_get_network(context, fixed_ip)
> +        db.network_update(context, network_ref['id'], {'host': dest})
> +
> +        try:
> +            floating_ip = db.instance_get_floating_address(context, instance_id)
> +            # Not return if floating_ip is not found, otherwise,
> +            # instance never be accessible..
> +            if None == floating_ip:
> +                logging.error('floating_ip is not found for %s ' % ec2_id)
> +            else:
> +                floating_ip_ref = db.floating_ip_get_by_address(context,
> +                                                                floating_ip)
> +                db.floating_ip_update(context,
> +                                      floating_ip_ref['address'],
> +                                      {'host': dest})
> +        except exception.NotFound:
> +            logging.debug('%s doesnt have floating_ip.. ' % ec2_id)
> +        except:
> +            msg = 'Live migration: Unexpected error:'
> +            msg += '%s cannot inherit floating ip.. ' % ec2_id
> +            logging.error(_(msg))
> +
> +        db.instance_update(context,
> +                           instance_id,
> +                           {'state_description': 'running',
> +                            'state': power_state.RUNNING,
> +                            'host': dest})
> +
> +        logging.info(_('Live migrating %s to %s finishes successfully')
> +                     % (ec2_id, dest))
> +
>
>  class NWFilterFirewall(object):
>     """
>
> === modified file 'setup.py'
> --- setup.py    2010-12-23 16:57:04 +0000
> +++ setup.py    2010-12-31 04:08:57 +0000
> @@ -25,6 +25,7 @@
>
>  from nova.utils import parse_mailmap, str_dict_replace
>
> +
>  class local_BuildDoc(BuildDoc):
>     def run(self):
>         for builder in ['html', 'man']:
> @@ -54,8 +55,8 @@
>       author='OpenStack',
>       author_email='nova@lists.launchpad.net',
>       url='http://www.openstack.org/',
> -      cmdclass={ 'sdist': local_sdist,
> -                 'build_sphinx' : local_BuildDoc },
> +      cmdclass={'sdist': local_sdist,
> +                 'build_sphinx': local_BuildDoc},
>       packages=find_packages(exclude=['bin', 'smoketests']),
>       include_package_data=True,
>       test_suite='nose.collector',

Good pep8 catches! Thanks.

-- 
Soren Hansen
Ubuntu Developer    http://www.ubuntu.com/
OpenStack Developer http://www.openstack.org/

Revision history for this message

Soren Hansen (soren) on 2011-01-10:

review: Needs Fixing

Revision history for this message

Kei Masumoto (masumotok) wrote on 2011-01-11:

Download full text (49.2 KiB)

Soren,

Thanks for reviewing, and I'm trying to fix based on your comments.
By the way, I have some questions - please give me a hand to solve one by one.

Regarding to the comment on nova/compute/manager.py:

>> self.db.instance_update(context,
>> instance_id,
>> - {'host': self.host})
>> + {'host': self.host, 'launched_on':self.host})
>
>Why pass the same value twice?

You mentioned "'launched_on':self.host" should be removed didn't you?
Before doing so, let me explain.

"host' column on Instance table is to record which host an instance is running on.
Therefore, values are updated if an instance is moved by live migration.
On the other hand, 'launched_on' column that I created is to record which host an instance was launched, and is not updated.
This information is necessary because cpuflag of launched host must have compatibility to the one of live migration destination host.
For this reason, I insert save value twice to different column when an instance is launched.

Please let me know if it does not make sense.

Regards,
Kei Masumoto

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Soren Hansen
Sent: Monday, January 10, 2011 9:47 PM
To: <email address hidden>
Subject: Re: [Merge] lp:~nttdata/nova/live-migration into lp:nova

Hello.

Please find my comments inline.

Todd's newlog branch landed very recently. This changed how we do
logging. Can you update your branch accordingly? Thanks.

There's no longer any difference between ec2_id and internal id. This
should simplify this bit of your patch somewhat.

> + except exception.NotFound as e:
> + msg = _('instance(%s) is not found')
> + e.args += (msg % ec2_id,)
> + raise e

Soren,

Thanks for reviewing, and I'm trying to fix based on your comments.
By the way, I have some questions - please give me a hand to solve one by one.

Regarding to the comment on nova/compute/manager.py:

>>         self.db.instance_update(context,
>>                                 instance_id,
>> -                                {'host': self.host})
>> +                                {'host': self.host, 'launched_on':self.host})
>
>Why pass the same value twice?

You mentioned "'launched_on':self.host" should be removed didn't you?
Before doing so, let me explain.

Please let me know if it does not make sense.

Regards,
Kei Masumoto

-----Original Message-----
From: bounces@canonical.com [mailto:bounces@canonical.com] On Behalf Of Soren Hansen
Sent: Monday, January 10, 2011 9:47 PM
To: mp+44940@code.launchpad.net
Subject: Re: [Merge] lp:~nttdata/nova/live-migration into lp:nova

Hello.

Please find my comments inline.

2010/12/31 Kei Masumoto <masumotok@nttdata.co.jp>:
> === modified file 'bin/nova-manage'
> --- bin/nova-manage     2010-12-16 22:52:08 +0000
> +++ bin/nova-manage     2010-12-31 04:08:57 +0000
> @@ -79,7 +79,10 @@
>  from nova import quota
>  from nova import utils
>  from nova.auth import manager
> +from nova import rpc
>  from nova.cloudpipe import pipelib
> +from nova.api.ec2 import cloud
> +
>
>
>  FLAGS = flags.FLAGS
> @@ -452,6 +455,86 @@
>                                     int(network_size), int(vlan_start),
>                                     int(vpn_start))
>
> +
> +class InstanceCommands(object):
> +    """Class for mangaging VM instances."""
> +
> +    def live_migration(self, ec2_id, dest):
> +        """live_migration"""
> +
> +        logging.basicConfig()

Todd's newlog branch landed very recently. This changed how we do
logging. Can you update your branch accordingly? Thanks.

> +        ctxt = context.get_admin_context()
> +
> +        try:
> +            internal_id = cloud.ec2_id_to_internal_id(ec2_id)
> +            instance_ref = db.instance_get_by_internal_id(ctxt, internal_id)
> +            instance_id = instance_ref['id']

There's no longer any difference between ec2_id and internal id. This
should simplify this bit of your patch somewhat.

> +        except exception.NotFound as e:
> +            msg = _('instance(%s) is not found')
> +            e.args += (msg % ec2_id,)
> +            raise e

> +        ret = rpc.call(ctxt,
> +                       FLAGS.scheduler_topic,
> +                       {"method": "live_migration",
> +                        "args": {"instance_id": instance_id,
> +                                "dest": dest,
> +                                "topic": FLAGS.compute_topic}})

I don't understand why you pass the compute_topic in the rpc call rather
than letting the scheduler worry about that?

> +        if None != ret:
> +            raise ret

"if ret:" is better.

You can (or at least should) only raise Exceptions. rpc.call never
*returns* an Exception. It may *raise* one, but will never return one.

> +
> +        print 'Finished all procedure. Check migrating finishes successfully'
> +        print 'check status by using euca-describe-instances.'

Perhaps something like this instead:
"Migration of %s initiated. Check its progress using euca-describe-instances."

> +class HostCommands(object):
> +    """Class for mangaging host(physical nodes)."""
> +
> +
> +    def list(self):
> +        """describe host list."""
> +
> +        # To supress msg: No handlers could be found for logger "amqplib"
> +        logging.basicConfig()
> +
> +        host_refs = db.host_get_all(context.get_admin_context())
> +        for host_ref in host_refs:
> +            print host_ref['name']
> +
> +
> +    def show(self, host):
> +        """describe cpu/memory/hdd info for host."""
> +
> +        # To supress msg: No handlers could be found for logger "amqplib"
> +        logging.basicConfig()
> +
> +        result = rpc.call(context.get_admin_context(),
> +                         FLAGS.scheduler_topic,
> +                         {"method": "show_host_resource",
> +                          "args": {"host": host}})
> +
> +        # Checking result msg format is necessary, that will have done
> +        # when this feture is included in API.
> +        if dict != type(result):

> +            print 'Unexpected error occurs'
> +        elif not result['ret']:
> +            print '%s' % result['msg']
> +        else:
> +            cpu = result['phy_resource']['vcpus']
> +            mem = result['phy_resource']['memory_mb']
> +            hdd = result['phy_resource']['local_gb']
> +
> +            print 'HOST\t\tPROJECT\t\tcpu\tmem(mb)\tdisk(gb)'
> +            print '%s\t\t\t%s\t%s\t%s' % (host, cpu, mem, hdd)
> +            for p_id, val in result['usage'].items():
> +                print '%s\t%s\t\t%s\t%s\t%s' % (host,
> +                                             p_id,
> +                                             val['vcpus'],
> +                                             val['memory_mb'],
> +                                             val['local_gb'])
> +
> +
>  CATEGORIES = [
>     ('user', UserCommands),
>     ('project', ProjectCommands),
> @@ -459,8 +542,9 @@
>     ('shell', ShellCommands),
>     ('vpn', VpnCommands),
>     ('floating', FloatingIpCommands),
> -    ('network', NetworkCommands)]
> -
> +    ('network', NetworkCommands),
> +    ('instance', InstanceCommands),
> +    ('host', HostCommands)]
>
>  def lazy_match(name, key_value_tuples):
>     """Finds all objects that have a key that case insensitively contains
>
> === modified file 'nova/api/ec2/cloud.py'
> --- nova/api/ec2/cloud.py       2010-12-22 21:38:44 +0000
> +++ nova/api/ec2/cloud.py       2010-12-31 04:08:57 +0000
> @@ -679,13 +679,22 @@
>             ec2_id = None
>             if (floating_ip_ref['fixed_ip']
>                 and floating_ip_ref['fixed_ip']['instance']):
> -                internal_id = floating_ip_ref['fixed_ip']['instance']['ec2_id']
> +                # modified by masumotok
> +                internal_id = \
> +                    floating_ip_ref['fixed_ip']['instance']['internal_id']
>                 ec2_id = internal_id_to_ec2_id(internal_id)
>             address_rv = {'public_ip': address,
>                           'instance_id': ec2_id}
>             if context.user.is_admin():
> +                # modified by masumotok- b/c proj_id is never inserted
> +                #details = "%s (%s)" % (address_rv['instance_id'],
> +                #                       floating_ip_ref['project_id'])
> +                if None != address_rv['instance_id']:
> +                    status = 'reserved'
> +                else:
> +                    status = None
>                 details = "%s (%s)" % (address_rv['instance_id'],
> -                                       floating_ip_ref['project_id'])
> +                                       status)
>                 address_rv['instance_id'] = details
>             addresses.append(address_rv)
>         return {'addressesSet': addresses}
>

Again, we've consolidated the instance ID's into a single one, so the
ec2_id no longer exists.

> === modified file 'nova/compute/api.py'
> --- nova/compute/api.py 2010-12-30 22:27:31 +0000
> +++ nova/compute/api.py 2010-12-31 04:08:57 +0000
> @@ -233,6 +233,7 @@
>                              terminated_at=datetime.datetime.utcnow())
>
>         host = instance['host']
> +        logging.error('terminate %s %s %s %s',context, FLAGS.compute_topic, host, self.db.queue_get_for(context, FLAGS.compute_topic, host))

I don't think including all those things in an error message really is
helpful?

>         if host:
>             rpc.cast(context,
>                      self.db.queue_get_for(context, FLAGS.compute_topic, host),
>
> === modified file 'nova/compute/manager.py'
> --- nova/compute/manager.py     2010-12-30 22:06:48 +0000
> +++ nova/compute/manager.py     2010-12-31 04:08:57 +0000
> @@ -36,6 +36,11 @@
>
>  import datetime
>  import logging
> +import sys
> +import traceback
> +import os
> +import time
> +
>
>  from nova import exception
>  from nova import flags
> @@ -43,12 +48,16 @@
>  from nova import rpc
>  from nova import utils
>  from nova.compute import power_state
> +from nova import rpc

Duplicate import.

> +from nova import db

Move further up, so the imports are alphabetical.

>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_string('instances_path', '$state_path/instances',
>                     'where instances are stored on disk')
>  flags.DEFINE_string('compute_driver', 'nova.virt.connection.get_connection',
>                     'Driver to use for controlling virtualization')
> +flags.DEFINE_string('live_migration_timeout', 10,
> +                    'Timeout value for pre_live_migration is completed.')

What's the unit? Seconds? Minutes?

>  flags.DEFINE_string('stub_network', False,
>                     'Stub network related code')
>
> @@ -111,10 +120,9 @@
>         instance_ref = self.db.instance_get(context, instance_id)
>         if instance_ref['name'] in self.driver.list_instances():
>             raise exception.Error(_("Instance has already been created"))
> -        logging.debug(_("instance %s: starting..."), instance_id)

Why remove this?

>         self.db.instance_update(context,
>                                 instance_id,
> -                                {'host': self.host})
> +                                {'host': self.host, 'launched_on':self.host})

Why pass the same value twice?

>
>         self.db.instance_set_state(context,
>                                    instance_id,
> @@ -415,3 +423,108 @@
>         self.volume_manager.remove_compute_volume(context, volume_id)
>         self.db.volume_detached(context, volume_id)
>         return True
> +
> +    def compareCPU(self, context, xml):
> +        """ Check the host cpu is compatible to a cpu given by xml."""
> +        return self.driver.compareCPU(xml)
> +
> +    def get_memory_mb(self):
> +        """Get the memory size of physical computer ."""
> +        meminfo = open('/proc/meminfo').read().split()
> +        idx = meminfo.index('MemTotal:')
> +        # transforming kb to mb.
> +        return int(meminfo[idx + 1]) / 1024
> +
> +    def get_local_gb(self):
> +        """Get the hdd size of physical computer ."""
> +        hddinfo = os.statvfs(FLAGS.instances_path)
> +        return hddinfo.f_bsize * hddinfo.f_blocks / 1024 / 1024 / 1024

I think get_memory_mb and get_local_gb belong in the virt driver, too.

> +
> +    def pre_live_migration(self, context, instance_id, dest):
> +        """Any preparation for live migration at dst host."""
> +
> +        # Getting volume info ( shlf/slot number )
> +        instance_ref = db.instance_get(context, instance_id)
> +        ec2_id = instance_ref['hostname']
> +
> +        volumes = []
> +        try:
> +            volumes = db.volume_get_by_ec2_id(context, ec2_id)
> +        except exception.NotFound:
> +            logging.info(_('%s has no volume.'), ec2_id)

This should never find any volumes. ec2_id refers to an instance.
volume_get_by_ec2_id looks up volume by the volume id, not an instance
id.

> +
> +        shelf_slots = {}
> +        for vol in volumes:
> +            shelf, slot = db.volume_get_shelf_and_blade(context, vol['id'])
> +            shelf_slots[vol.id] = (shelf, slot)
> +
> +        # Getting fixed ips
> +        fixed_ip = db.instance_get_fixed_address(context, instance_id)
> +        if None == fixed_ip:

"if not fixed_ip:" is better.

> +            exc_type = 'NotFoundError'
> +            val = _('%s(%s) doesnt have fixed_ip') % (instance_id, ec2_id)
> +            tb = ''.join(traceback.format_tb(sys.exc_info()[2]))
> +            raise rpc.RemoteError(exc_type, val, tb)

Why isn't the usual exception handling in the rpc code sufficient here?

> +
> +        # If any volume is mounted, prepare here.
> +        if 0 != len(shelf_slots):
> +            pass

It seems odd to argue over no-op code, but this would look better to me:

for shelf, slot in shelf_slots:
                # Prepare volumes for migration
                pass

> +
> +        #  Creating nova-instance-instance-xxx, this is written to libvirt.xml,
> +        #  and can be seen when executin "virsh nwfiter-list" On destination host,
> +        #  this nwfilter is necessary.
> +        #  In addition this method is creating security rule ingress rule onto
> +        #  destination host.
> +        self.driver.setup_nwfilters_for_instance(instance_ref)

Only libvirt has a setup_nwfilters_for_instance method, and we've
actually added another filtering backend, so this has been generalised
somewhat.

More generally speaking, though, I think this pre-migration code belongs
in the virt driver.

> +
> +        # 5. bridge settings
> +        self.network_manager.setup_compute_network(context, instance_id)
> +        return True
> +
> +    def nwfilter_for_instance_exists(self, context, instance_id):
> +        """Check nova-instance-instance-xxx filter exists """
> +        instance_ref = db.instance_get(context, instance_id)
> +        return self.driver.nwfilter_for_instance_exists(instance_ref)
> +

> +    def live_migration(self, context, instance_id, dest):
> +        """executes live migration."""
> +
> +        # Asking dest host to preparing live migration.
> +        compute_topic = db.queue_get_for(context, FLAGS.compute_topic, dest)
> +        ret = rpc.call(context,
> +                        compute_topic,
> +                        {"method": "pre_live_migration",
> +                         "args": {'instance_id': instance_id,
> +                                    'dest': dest}})
> +
> +        if True != ret:

Just "if ret" is better.

> +            logging.error(_('Pre live migration failed(err at %s)'), dest)
> +            db.instance_set_state(context,
> +                                  instance_id,
> +                                  power_state.RUNNING,
> +                                  'running')
> +            return
> +
> +        # Waiting for setting up nwfilter such as, nova-instance-instance-xxx.
> +        # otherwise, live migration fail.
> +        timeout_count = range(FLAGS.live_migration_timeout * 2)
> +        while 0 != len(timeout_count):
> +            ret = rpc.call(context,
> +                        compute_topic,
> +                        {"method": "nwfilter_for_instance_exists",
> +                         "args": {'instance_id': instance_id}})
> +            if ret:
> +                break
> +
> +            timeout_count.pop()
> +            time.sleep(0.5)
> +
> +        if not ret:
> +            logging.error(_('Timeout for pre_live_migration at %s'), dest)
> +            return
> +
> +        # Executing live migration
> +        # live_migration might raises ProcessExecution error, but
> +        # nothing must be recovered in this version.
> +        instance_ref = db.instance_get(context, instance_id)
> +        self.driver.live_migration(context, instance_ref, dest)
>
> === modified file 'nova/db/api.py'
> --- nova/db/api.py      2010-12-28 17:49:07 +0000
> +++ nova/db/api.py      2010-12-31 04:08:57 +0000
> @@ -234,6 +234,10 @@
>     return IMPL.floating_ip_get_by_address(context, address)
>
>
> +def floating_ip_update(context, address, values):
> +    """update floating ip information."""
> +    return IMPL.floating_ip_update(context, address, values)
> +
>  ####################
>
>
> @@ -378,6 +382,32 @@
>                                             security_group_id)
>
>
> +def instance_get_all_by_host(context, hostname):
> +    """Get instances by host"""
> +    return IMPL.instance_get_all_by_host(context, hostname)
> +
> +
> +def instance_get_vcpu_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get instances.vcpus by host and project"""
> +    return IMPL.instance_get_vcpu_sum_by_host_and_project(context,
> +                                                          hostname,
> +                                                          proj_id)
> +
> +
> +def instance_get_memory_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get amount of memory by host and project """
> +    return IMPL.instance_get_memory_sum_by_host_and_project(context,
> +                                                            hostname,
> +                                                            proj_id)
> +
> +
> +def instance_get_disk_sum_by_host_and_project(context, hostname, proj_id):
> +    """Get total amount of disk by host and project """
> +    return IMPL.instance_get_disk_sum_by_host_and_project(context,
> +                                                          hostname,
> +                                                          proj_id)
> +
> +
>  def instance_action_create(context, values):
>     """Create an instance action from the values dictionary."""
>     return IMPL.instance_action_create(context, values)
> @@ -889,3 +919,36 @@
>
>     """
>     return IMPL.host_get_networks(context, host)
> +
> +
> +###################
> +
> +
> +def host_create(context, value):
> +    """Create a host from the values dictionary."""
> +    return IMPL.host_create(context, value)
> +
> +
> +def host_get(context, host_id):
> +    """Get an host or raise if it does not exist."""
> +    return IMPL.host_get(context, host_id)
> +
> +
> +def host_get_all(context, session=None):
> +    """Get all hosts or raise if it does not exist."""
> +    return IMPL.host_get_all(context)
> +
> +
> +def host_get_by_name(context, host):
> +    """Get an host or raise if it does not exist."""
> +    return IMPL.host_get_by_name(context, host)
> +
> +
> +def host_update(context, host, values):
> +    """Set the given properties on an host and update it."""
> +    return IMPL.host_update(context, host, values)
> +
> +

> +def host_deactivated(context, host):
> +    """set deleted flag to a given host"""
> +    return IMPL.host_deactivated(context, host)

How is this different from the other *_destroy() methods?
>
> === modified file 'nova/db/sqlalchemy/api.py'
> --- nova/db/sqlalchemy/api.py   2010-12-29 20:15:04 +0000
> +++ nova/db/sqlalchemy/api.py   2010-12-31 04:08:57 +0000
> @@ -473,6 +473,16 @@
>     return result
>
>
> +@require_context
> +def floating_ip_update(context, address, values):
> +    session = get_session()
> +    with session.begin():
> +        floating_ip_ref = floating_ip_get_by_address(context, address, session)
> +        for (key, value) in values.iteritems():
> +            floating_ip_ref[key] = value
> +        floating_ip_ref.save(session=session)
> +
> +
>  ###################
>
>
> @@ -832,6 +842,7 @@
>         return instance_ref
>
>
> +@require_context
>  def instance_add_security_group(context, instance_id, security_group_id):
>     """Associate the given security group with the given instance"""
>     session = get_session()
> @@ -845,6 +856,51 @@
>
>
>  @require_context
> +def instance_get_all_by_host(context, hostname):
> +    session = get_session()
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Instance
> +                       ).filter_by(host=hostname
> +                       ).filter_by(deleted=can_read_deleted(context)
> +                       ).all()
> +    if None == result:

"if not result:" is better.

> +        return []
> +    return result
> +
> +
> +@require_context
> +def _instance_get_sum_by_host_and_project(context, column, hostname, proj_id):
> +    session = get_session()
> +
> +    result = session.query(models.Instance
> +                       ).filter_by(host=hostname
> +                       ).filter_by(project_id=proj_id
> +                       ).filter_by(deleted=can_read_deleted(context)
> +                       ).value(column)
> +    if None == result:

"if not result:" is better.

> +        return 0
> +    return result
> +
> +
> +@require_context
> +def instance_get_vcpu_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'vcpus', hostname,
> +                                                 proj_id)
> +
> +
> +@require_context
> +def instance_get_memory_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'memory_mb',
> +                                                 hostname, proj_id)
> +
> +
> +@require_context
> +def instance_get_disk_sum_by_host_and_project(context, hostname, proj_id):
> +    return _instance_get_sum_by_host_and_project(context, 'local_gb',
> +                                                 hostname, proj_id)
> +@require_context
>  def instance_action_create(context, values):
>     """Create an instance action from the values dictionary."""
>     action_ref = models.InstanceActions()
> @@ -1875,3 +1931,76 @@
>                        filter_by(deleted=False).\
>                        filter_by(host=host).\
>                        all()
> +
> +
> +###################
> +
> +@require_admin_context
> +def host_create(context, values):
> +    host_ref = models.Host()
> +    for (key, value) in values.iteritems():
> +        host_ref[key] = value
> +    host_ref.save()
> +    return host_ref
> +
> +
> +@require_admin_context
> +def host_get(context, host_id, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).filter_by(id=host_id
> +                     ).first()
> +
> +    if not result:
> +        raise exception.NotFound('No host for id %s' % host_id)
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_get_all(context, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).all()
> +
> +    if not result:
> +        raise exception.NotFound('No host record found .')
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_get_by_name(context, host, session=None):
> +    if not session:
> +        session = get_session()
> +
> +    result = session.query(models.Host
> +                     ).filter_by(deleted=False
> +                     ).filter_by(name=host
> +                     ).first()
> +
> +    if not result:
> +        raise exception.NotFound('No host for name %s' % host)
> +
> +    return result
> +
> +
> +@require_admin_context
> +def host_update(context, host_id, values):
> +    session = get_session()
> +    with session.begin():
> +        host_ref = host_get(context, host_id, session=session)
> +        for (key, value) in values.iteritems():
> +            host_ref[key] = value
> +        host_ref.save(session=session)
> +
> +
> +@require_admin_context
> +def host_deactivated(context, host):
> +    host_update(context, host, {'deleted': True})
>
> === modified file 'nova/db/sqlalchemy/models.py'
> --- nova/db/sqlalchemy/models.py        2010-12-31 02:04:39 +0000
> +++ nova/db/sqlalchemy/models.py        2010-12-31 04:08:57 +0000
> @@ -138,6 +138,26 @@
>  #    __tablename__ = 'hosts'
>  #    id = Column(String(255), primary_key=True)
>
> +# this class is created by masumotok

No need to claim ownership over individual model classes. :)

> +class Host(BASE, NovaBase):
> +    """Represents a host where services are running"""
> +    __tablename__ = 'hosts'
> +    id = Column(Integer, primary_key=True)
> +    name = Column(String(255))
> +    vcpus = Column(Integer, nullable=False, default=-1)
> +    memory_mb = Column(Integer, nullable=False, default=-1)
> +    local_gb = Column(Integer, nullable=False, default=-1)
> +    hypervisor_type = Column(String(128))
> +    hypervisor_version = Column(Integer, nullable=False, default=-1)
> +    cpu_info = Column(String(1024))
> +    deleted = Column(Boolean, default=False)
> +    # C: when calling service_create()
> +    # D: never deleted. instead of deleting cloumn "deleted" is true
> +    #    when host is down
> +    #    b/c Host.id is foreign key of service, and records
> +    #    of the "service" table are not deleted.
> +    # R: Column "deleted" is true when calling hosts_up() and host is down.
> +
>
>  class Service(BASE, NovaBase):
>     """Represents a running service on a host."""
> @@ -224,6 +244,10 @@
>     display_name = Column(String(255))
>     display_description = Column(String(255))
>
> +    # To remember on which host a instance booted.
> +    # An instance may moved to other host by live migraiton.
> +    launched_on = Column(String(255))
> +
>     # TODO(vish): see Ewan's email about state improvements, probably
>     #             should be in a driver base class or some such
>     # vmstate_state = running, halted, suspended, paused
> @@ -550,7 +574,7 @@
>               Volume, ExportDevice, IscsiTarget, FixedIp, FloatingIp,
>               Network, SecurityGroup, SecurityGroupIngressRule,
>               SecurityGroupInstanceAssociation, AuthToken, User,
> -              Project, Certificate)  # , Image, Host
> +              Project, Certificate, Host)  # , Image
>     engine = create_engine(FLAGS.sql_connection, echo=False)
>     for model in models:
>         model.metadata.create_all(engine)
>
> === modified file 'nova/network/manager.py'
> --- nova/network/manager.py     2010-12-22 21:38:44 +0000
> +++ nova/network/manager.py     2010-12-31 04:08:57 +0000
> @@ -154,7 +154,7 @@
>         """Called when this host becomes the host for a network."""
>         raise NotImplementedError()
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
>         raise NotImplementedError()
>
> @@ -314,7 +314,7 @@
>         self.db.fixed_ip_update(context, address, {'allocated': False})
>         self.db.fixed_ip_disassociate(context.elevated(), address)
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Network is created manually."""
>         pass
>
> @@ -381,9 +381,10 @@
>         super(FlatDHCPManager, self).init_host()
>         self.driver.metadata_forward()
>
> -    def setup_compute_network(self, context, instance_id):
> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
> -        network_ref = db.network_get_by_instance(context, instance_id)
> +        if network_ref is None:
> +            network_ref = db.network_get_by_instance(context, instance_id)
>         self.driver.ensure_bridge(network_ref['bridge'],
>                                   FLAGS.flat_interface)
>
> @@ -473,9 +474,11 @@
>         """Returns a fixed ip to the pool."""
>         self.db.fixed_ip_update(context, address, {'allocated': False})
>
> -    def setup_compute_network(self, context, instance_id):
> +    #def setup_compute_network(self, context, instance_id):

Please remove the old code, rather than just commenting it out.

> +    def setup_compute_network(self, context, instance_id, network_ref=None):
>         """Sets up matching network for compute hosts."""
> -        network_ref = db.network_get_by_instance(context, instance_id)
> +        if network_ref is None:
> +            network_ref = db.network_get_by_instance(context, instance_id)
>         self.driver.ensure_vlan_bridge(network_ref['vlan'],
>                                        network_ref['bridge'])
>
>
> === modified file 'nova/scheduler/driver.py'
> --- nova/scheduler/driver.py    2010-12-22 20:59:53 +0000
> +++ nova/scheduler/driver.py    2010-12-31 04:08:57 +0000
> @@ -22,10 +22,14 @@
>  """
>
>  import datetime
> +import logging
>
>  from nova import db
>  from nova import exception
>  from nova import flags
> +from nova import rpc
> +from nova.api.ec2 import cloud
> +from nova.compute import power_state
>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_integer('service_down_time', 60,
> @@ -59,3 +63,136 @@
>     def schedule(self, context, topic, *_args, **_kwargs):
>         """Must override at least this method for scheduler to work."""
>         raise NotImplementedError(_("Must implement a fallback schedule"))
> +
> +    def schedule_live_migration(self, context, instance_id, dest):
> +        """ live migration method """
> +
> +        # Whether instance exists and running
> +        # try-catch clause is necessary because only internal_id is shown
> +        # when NotFound exception occurs. it isnot understandable to admins.
> +        try:
> +            instance_ref = db.instance_get(context, instance_id)
> +            ec2_id = instance_ref['hostname']
> +            internal_id = instance_ref['internal_id']
> +        except exception.NotFound, e:
> +            msg = _('Unexpected error: instance is not found')
> +            e.args += ('\n' + msg, )
> +            raise e

Same comment as above wrt to exception mangling. I now see your comment
above, but I don't understand how it helps to just not show the id at
all?

> +
> +        # Checking instance state.
> +        if power_state.RUNNING != instance_ref['state'] or \
> +           'running' != instance_ref['state_description']:
> +            msg = _('Instance(%s) is not running')
> +            raise exception.Invalid(msg % ec2_id)
> +
> +        # Checking destination host exists
> +        dhost_ref = db.host_get_by_name(context, dest)
> +
> +        # Checking whether The host where instance is running
> +        # and dest is not same.
> +        src = instance_ref['host']
> +        if dest == src:
> +            msg = _('%s is where %s is running now. choose other host.')
> +            raise exception.Invalid(msg % (dest, ec2_id))
> +
> +        # Checking dest is compute node.
> +        services = db.service_get_all_by_topic(context, 'compute')
> +        if dest not in [service.host for service in services]:
> +            msg = _('%s must be compute node')
> +            raise exception.Invalid(msg % dest)
> +
> +        # Checking dest host is alive.
> +        service = [service for service in services if service.host == dest]
> +        service = service[0]
> +        if not self.service_is_up(service):
> +            msg = _('%s is not alive(time synchronize problem?)')
> +            raise exception.Invalid(msg % dest)
> +
> +        # NOTE(masumotok): Below pre-checkings are followed by
> +        # http://wiki.libvirt.org/page/TodoPreMigrationChecks
> +
> +        # Checking hypervisor is same.
> +        orighost = instance_ref['launched_on']
> +        ohost_ref = db.host_get_by_name(context, orighost)
> +
> +        otype = ohost_ref['hypervisor_type']
> +        dtype = dhost_ref['hypervisor_type']
> +        if otype != dtype:
> +            msg = _('Different hypervisor type(%s->%s)')
> +            raise exception.Invalid(msg % (otype, dtype))
> +
> +        # Checkng hypervisor version.
> +        oversion = ohost_ref['hypervisor_version']
> +        dversion = dhost_ref['hypervisor_version']
> +        if oversion > dversion:
> +            msg = _('Older hypervisor version(%s->%s)')
> +            raise exception.Invalid(msg % (oversion, dversion))
> +
> +        # Checking cpuinfo.
> +        cpuinfo = ohost_ref['cpu_info']
> +        if str != type(cpuinfo):
> +            msg = _('Unexpected err: not found cpu_info for %s on DB.hosts')
> +            raise exception.Invalid(msg % orighost)
> +
> +        ret = rpc.call(context,
> +                       db.queue_get_for(context, FLAGS.compute_topic, dest),
> +                       {"method": 'compareCPU',
> +                        "args": {'xml': cpuinfo}})
> +
> +        if int != type(ret):
> +            raise ret

Same comment as above. rpc.call doesn't ever return exceptions, so don't
raise its return value.

> +
> +        if 0 >= ret:
> +            u = 'http://libvirt.org/html/libvirt-libvirt.html'
> +            u += '#virCPUCompareResult'
> +            msg = '%s doesnt have compatibility to %s(where %s launching at)\n'
> +            msg += 'result:%d \n'
> +            msg += 'Refer to %s'
> +            msg = _(msg)
> +            raise exception.Invalid(msg % (dest, src, ec2_id, ret, u))

This is several levels removed from libvirt, so errors should either be
more generic, or generated closer to the virt layer and sent back to
here.

> +
> +        # Checking dst host still has enough capacities.
> +        self.has_enough_resource(context, instance_id, dest)
> +
> +        # Changing instance_state.
> +        db.instance_set_state(context,
> +                              instance_id,
> +                              power_state.PAUSED,
> +                              'migrating')
> +
> +        # Requesting live migration.
> +        return src
> +
> +    def has_enough_resource(self, context, instance_id, dest):
> +        """ Check if destination host has enough resource for live migration"""
> +
> +        # Getting instance information
> +        instance_ref = db.instance_get(context, instance_id)
> +        ec2_id = instance_ref['hostname']
> +        vcpus = instance_ref['vcpus']
> +        mem = instance_ref['memory_mb']
> +        hdd = instance_ref['local_gb']
> +
> +        # Gettin host information
> +        host_ref = db.host_get_by_name(context, dest)
> +        total_cpu = int(host_ref['vcpus'])
> +        total_mem = int(host_ref['memory_mb'])
> +        total_hdd = int(host_ref['local_gb'])
> +
> +        instances_ref = db.instance_get_all_by_host(context, dest)
> +        for i_ref in instances_ref:
> +            total_cpu -= int(i_ref['vcpus'])
> +            total_mem -= int(i_ref['memory_mb'])
> +            total_hdd -= int(i_ref['local_gb'])
> +
> +        # Checking host has enough information
> +        logging.debug('host(%s) remains vcpu:%s mem:%s hdd:%s,' %
> +                      (dest, total_cpu, total_mem, total_hdd))
> +        logging.debug('instance(%s) has vcpu:%s mem:%s hdd:%s,' %
> +                      (ec2_id, vcpus, mem, hdd))
> +
> +        if total_cpu <= vcpus or total_mem <= mem or total_hdd <= hdd:
> +            msg = '%s doesnt have enough resource for %s' % (dest, ec2_id)
> +            raise exception.NotEmpty(msg)
> +
> +        logging.debug(_('%s has enough resource for %s') % (dest, ec2_id))
>
> === modified file 'nova/scheduler/manager.py'
> --- nova/scheduler/manager.py   2010-12-22 20:59:53 +0000
> +++ nova/scheduler/manager.py   2010-12-31 04:08:57 +0000
> @@ -29,6 +29,7 @@
>  from nova import manager
>  from nova import rpc
>  from nova import utils
> +from nova import exception
>
>  FLAGS = flags.FLAGS
>  flags.DEFINE_string('scheduler_driver',
> @@ -66,3 +67,46 @@
>                  {"method": method,
>                   "args": kwargs})
>         logging.debug(_("Casting to %s %s for %s"), topic, host, method)
> +
> +    # NOTE (masumotok) : This method should be moved to nova.api.ec2.admin.
> +    #                    Based on bear design summit discussion,
> +    #                    just put this here for bexar release.
> +    def show_host_resource(self, context, host, *args):
> +        """ show the physical/usage resource given by hosts."""
> +
> +        try:
> +            host_ref = db.host_get_by_name(context, host)
> +        except exception.NotFound:
> +            return {'ret': False, 'msg': 'No such Host'}
> +        except:
> +            raise
> +
> +        # Getting physical resource information
> +        h_resource = {'vcpus': host_ref['vcpus'],
> +                     'memory_mb': host_ref['memory_mb'],
> +                     'local_gb': host_ref['local_gb']}
> +
> +        # Getting usage resource information
> +        u_resource = {}
> +        instances_ref = db.instance_get_all_by_host(context, host_ref['name'])
> +
> +        if 0 == len(instances_ref):
> +            return {'ret': True, 'phy_resource': h_resource, 'usage': {}}
> +
> +        project_ids = [i['project_id'] for i in instances_ref]
> +        project_ids = list(set(project_ids))
> +        for p_id in project_ids:
> +            vcpus = db.instance_get_vcpu_sum_by_host_and_project(context,
> +                                                               host,
> +                                                               p_id)
> +            mem = db.instance_get_memory_sum_by_host_and_project(context,
> +                                                                host,
> +                                                                p_id)
> +            hdd = db.instance_get_disk_sum_by_host_and_project(context,
> +                                                               host,
> +                                                               p_id)
> +            u_resource[p_id] = {'vcpus': vcpus,
> +                                'memory_mb': mem,
> +                                'local_gb': hdd}
> +
> +        return {'ret': True, 'phy_resource': h_resource, 'usage': u_resource}
>
> === modified file 'nova/service.py'
> --- nova/service.py     2010-12-22 20:59:53 +0000
> +++ nova/service.py     2010-12-31 04:08:57 +0000
> @@ -77,6 +77,13 @@
>         self.manager.init_host()
>         self.model_disconnected = False
>         ctxt = context.get_admin_context()
> +
> +        try:
> +            host_ref = db.host_get_by_name(ctxt, self.host)
> +        except exception.NotFound:
> +            host_ref = db.host_create(ctxt, {'name': self.host})
> +        host_ref = self._update_host_ref(ctxt, host_ref)
> +
>         try:
>             service_ref = db.service_get_by_args(ctxt,
>                                                  self.host,
> @@ -117,6 +124,26 @@
>                                          'report_count': 0})
>         self.service_id = service_ref['id']
>
> +    def _update_host_ref(self, context, host_ref):
> +
> +        if 0 <= self.manager_class_name.find('ComputeManager'):
> +            vcpu = self.manager.driver.get_vcpu_number()
> +            memory_mb = self.manager.get_memory_mb()
> +            local_gb = self.manager.get_local_gb()
> +            hypervisor = self.manager.driver.get_hypervisor_type()
> +            version = self.manager.driver.get_hypervisor_version()
> +            cpu_xml = self.manager.driver.get_cpu_xml()
> +
> +            db.host_update(context,
> +                           host_ref['id'],
> +                           {'vcpus': vcpu,
> +                           'memory_mb': memory_mb,
> +                           'local_gb': local_gb,
> +                           'hypervisor_type': hypervisor,
> +                           'hypervisor_version': version,
> +                           'cpu_info':cpu_xml })
> +        return host_ref
> +
>     def __getattr__(self, key):
>         manager = self.__dict__.get('manager', None)
>         return getattr(manager, key)
>
> === added file 'nova/service.py.THIS'
> --- nova/service.py.THIS        1970-01-01 00:00:00 +0000
> +++ nova/service.py.THIS        2010-12-31 04:08:57 +0000

Please remove this file.

> === modified file 'nova/utils.py'
> --- nova/utils.py       2010-12-26 14:08:38 +0000
> +++ nova/utils.py       2010-12-31 04:08:57 +0000
> @@ -134,7 +134,6 @@
>         result = obj.communicate()
>     obj.stdin.close()
>     if obj.returncode:
> -        logging.debug(_("Result was %s") % (obj.returncode))
>         if check_exit_code and obj.returncode != 0:
>             (stdout, stderr) = result
>             raise ProcessExecutionError(exit_code=obj.returncode,
>

Why remove this?

> === modified file 'nova/virt/libvirt_conn.py'
> --- nova/virt/libvirt_conn.py   2010-12-30 21:23:14 +0000
> +++ nova/virt/libvirt_conn.py   2010-12-31 04:08:57 +0000
> @@ -39,6 +39,7 @@
>  import logging
>  import os
>  import shutil
> +import re
>
>  from eventlet import greenthread
>  from eventlet import event
> @@ -82,6 +83,9 @@
>                     '',
>                     'Override the default libvirt URI (which is dependent'
>                     ' on libvirt_type)')
> +flags.DEFINE_string('live_migration_uri',
> +                  "qemu+tcp://%s/system",
> +                  'Define protocol used by live_migration feature')
>  flags.DEFINE_bool('allow_project_net_traffic',
>                   True,
>                   'Whether to allow in project network traffic')
> @@ -674,6 +678,30 @@
>
>         return interfaces
>
> +    def get_vcpu_number(self):
> +        """ Get vcpu number of physical computer.  """
> +        return self._conn.getMaxVcpus(None)
> +
> +    def get_hypervisor_type(self):
> +        """ Get hypervisor type """
> +        return self._conn.getType()
> +
> +    def get_hypervisor_version(self):
> +        """ Get hypervisor version """
> +        return self._conn.getVersion()
> +
> +    def get_cpu_xml(self):
> +        """ Get cpuinfo information """
> +        xmlstr = self._conn.getCapabilities()
> +        xml = libxml2.parseDoc(xmlstr)
> +        nodes = xml.xpathEval('//cpu')
> +        if 1 != len(nodes):
> +            msg = 'Unexpected xml format. tag "cpu" must be 1, but %d.' % len(nodes)
> +            msg += '\n' + xml.serialize()
> +            raise exception.Invalid(_(msg))
> +        cpuxmlstr = re.sub("\n|[ ]+", ' ', nodes[0].serialize())
> +        return cpuxmlstr
> +
>     def block_stats(self, instance_name, disk):
>         """
>         Note that this function takes an instance name, not an Instance, so
> @@ -694,6 +722,119 @@
>         fw = NWFilterFirewall(self._conn)
>         fw.ensure_security_group_filter(security_group_id)
>
> +    def setup_nwfilters_for_instance(self, instance):
> +        """ See same method of NWFilterFirewall class """
> +        nwfilter = NWFilterFirewall(self._conn)
> +        return nwfilter.setup_nwfilters_for_instance(instance)
> +

This has been refactored somewhat in trunk. Please adjust accordingly.

> +    def nwfilter_for_instance_exists(self, instance_ref):
> +        try:
> +            filter = 'nova-instance-%s' % instance_ref.name
> +            self._conn.nwfilterLookupByName(filter)
> +            return True
> +        except libvirt.libvirtError:
> +            return False
> +

I'd prefer it if we could just call into the firewall driver and tell it
to prepare filters for a given instance and leave it to the driver to
decide if it actually has anything left to do.

> +    def compareCPU(self, xml):
> +        """
> +           Check the host cpu is compatible to a cpu given by xml.
> +           "xml" must be a part of libvirt.openReadonly().getCapabilities().
> +           return values follows by virCPUCompareResult.
> +           if 0 > return value, do live migration.
> +
> +           'http://libvirt.org/html/libvirt-libvirt.html#virCPUCompareResult'
> +        """
> +        return self._conn.compareCPU(xml, 0)
> +
> +    def live_migration(self, context, instance_ref, dest):
> +        """
> +           Just spawning live_migration operation for
> +           distributing high-load.
> +        """
> +        greenthread.spawn(self._live_migration, context, instance_ref, dest)
> +
> +    def _live_migration(self, context, instance_ref, dest):
> +        """ Do live migration."""
> +
> +        # Do live migration.
> +        try:
> +            uri = FLAGS.live_migration_uri % dest
> +            out, err = utils.execute("sudo virsh migrate --live %s %s"
> +                                    % (instance_ref.name, uri))

We shouldn't use virsh for this. We should use the python bindings for
libvirt. Doing so also lets us throttle the bandwidth used by the
migration process.

> +        except exception.ProcessExecutionError:
> +            id = instance_ref['id']
> +            db.instance_set_state(context, id, power_state.RUNNING, 'running')
> +            raise
> +
> +        # Waiting for completion of live_migration.
> +        timer = utils.LoopingCall(f=None)
> +
> +        def wait_for_live_migration():
> +
> +            try:
> +                state = self.get_info(instance_ref.name)['state']
> +            except exception.NotFound:
> +                timer.stop()
> +                self._post_live_migration(context, instance_ref, dest)
> +
> +        timer.f = wait_for_live_migration
> +        timer.start(interval=0.5, now=True)
> +
> +    def _post_live_migration(self, context, instance_ref, dest):
> +        """
> +           Post operations for live migration.
> +           Mainly, database updating.
> +        """
> +        # Detaching volumes.
> +        # (not necessary in current version )
> +
> +        # Releasing vlan.
> +        #   (not necessary in current implementation?)
> +
> +        # Releasing security group ingress rule.
> +        #   (not necessary in current implementation?)
> +
> +        # Database updating.
> +        ec2_id = instance_ref['hostname']
> +
> +        instance_id = instance_ref['id']
> +        fixed_ip = db.instance_get_fixed_address(context, instance_id)
> +        # Not return if fixed_ip is not found, otherwise,
> +        # instance never be accessible..
> +        if None == fixed_ip:
> +            logging.warn('fixed_ip is not found for %s ' % ec2_id)
> +        db.fixed_ip_update(context, fixed_ip, {'host': dest})
> +        network_ref = db.fixed_ip_get_network(context, fixed_ip)
> +        db.network_update(context, network_ref['id'], {'host': dest})
> +
> +        try:
> +            floating_ip = db.instance_get_floating_address(context, instance_id)
> +            # Not return if floating_ip is not found, otherwise,
> +            # instance never be accessible..
> +            if None == floating_ip:
> +                logging.error('floating_ip is not found for %s ' % ec2_id)
> +            else:
> +                floating_ip_ref = db.floating_ip_get_by_address(context,
> +                                                                floating_ip)
> +                db.floating_ip_update(context,
> +                                      floating_ip_ref['address'],
> +                                      {'host': dest})
> +        except exception.NotFound:
> +            logging.debug('%s doesnt have floating_ip.. ' % ec2_id)
> +        except:
> +            msg = 'Live migration: Unexpected error:'
> +            msg += '%s cannot inherit floating ip.. ' % ec2_id
> +            logging.error(_(msg))
> +
> +        db.instance_update(context,
> +                           instance_id,
> +                           {'state_description': 'running',
> +                            'state': power_state.RUNNING,
> +                            'host': dest})
> +
> +        logging.info(_('Live migrating %s to %s finishes successfully')
> +                     % (ec2_id, dest))
> +
>
>  class NWFilterFirewall(object):
>     """
>
> === modified file 'setup.py'
> --- setup.py    2010-12-23 16:57:04 +0000
> +++ setup.py    2010-12-31 04:08:57 +0000
> @@ -25,6 +25,7 @@
>
>  from nova.utils import parse_mailmap, str_dict_replace
>
> +
>  class local_BuildDoc(BuildDoc):
>     def run(self):
>         for builder in ['html', 'man']:
> @@ -54,8 +55,8 @@
>       author='OpenStack',
>       author_email='nova@lists.launchpad.net',
>       url='http://www.openstack.org/',
> -      cmdclass={ 'sdist': local_sdist,
> -                 'build_sphinx' : local_BuildDoc },
> +      cmdclass={'sdist': local_sdist,
> +                 'build_sphinx': local_BuildDoc},
>       packages=find_packages(exclude=['bin', 'smoketests']),
>       include_package_data=True,
>       test_suite='nose.collector',

Good pep8 catches! Thanks.

--
Soren Hansen
Ubuntu Developer    http://www.ubuntu.com/
OpenStack Developer http://www.openstack.org/

https://code.launchpad.net/~nttdata/nova/live-migration/+merge/44940
Your team NTT DATA is subscribed to branch lp:~nttdata/nova/live-migration.