Comment 3 for bug 1786346

Revision history for this message
Dan Smith (danms) wrote :

This appears to be now fundamentally broken on Rocky, and racy on Queens and Pike. Below is the current thinking from live conversation and debugging of this with the reporter.

The event that triggers VIF plugging happens here:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7775

in pre_live_migration(). That is well before the wait_for_instance_event() call is made in the libvirt driver as added here:

https://review.openstack.org/#/c/497457/30/nova/virt/libvirt/driver.py

Which means that in Pike/Queens we race to the point at which the event shows up. If we win, then we receive it and do the slow..fast dance as expected. If we lose the race, then the event comes before we start listening to it, in which case we time out, but don't actually stop the live migration, which means it continues at 1MB/s until it completes.

On Rocky, we now always listen for that event a level up in the compute manager:

https://review.openstack.org/#/c/558001/10/nova/compute/manager.py

which is properly wrapped around pre_live_migration(), and will always wait for and eat the plugging event before calling into the driver. Thus, the driver will never receive the event it is waiting for and we will always timeout and always run migrations at 1MB/s until completion.