If instance was migrated while was in shutdown state, nova disallow start before resize-confirm

Bug #1460577 reported by George Shuklin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

Steps to reproduce:
1. Create instance
2. Shutdown instance
3. Perform resize
4. Try to start instance.

Expected behaviour: instance starts in resize_confirm state
Actual behaviour: ERROR (Conflict): Instance d0e9bc6b-0544-410f-ba96-b0b78ce18828 in vm_state resized. Cannot start while the instance is in this state. (HTTP 409)

Rationale:

If tenant is resizing running instance, he can log into instance after reboot and see if resize was successful or not. If tenant stopped instance before resize, he has no ability to check if instance resized successfully or not before confirming migration.

Proposed solution: Allow to start instance in the state 'resize_confirm + stopped'.

(Btw: I'd like to allow to stop/start instances in resize_confirm state, because tenant may wish to reboot/stop/start instance few times before deciding that migration was successful).

Tags: compute resize
tags: added: compute live-migration resize
jichenjc (jichenjc)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
jichenjc (jichenjc) wrote :

I think we should allow this operation
actually, we power_on the VM if the old instance is not stopped
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3734

but if it's poweroff by user then we won't have chance to power on it again?

Changed in nova:
assignee: nobody → jichenjc (jichenjc)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/200165

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by jichenjc (<email address hidden>) on branch: master
Review: https://review.openstack.org/200165
Reason: wrong direction

tags: removed: live-migration
Changed in nova:
assignee: jichenjc (jichenjc) → nobody
status: In Progress → Confirmed
Vijay (vjgblogs)
Changed in nova:
assignee: nobody → Vijay (vjgblogs)
status: Confirmed → In Progress
Revision history for this message
Nazeema Begum (nazeema123) wrote :

Is anyone still working on this bug...!if not i would like to work on it

Revision history for this message
George Shuklin (george-shuklin) wrote :

I think it is abandoned. Feel free to take.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Vijay (vjgblogs) → nobody
description: updated
Changed in nova:
assignee: nobody → Harshavardhan Metla (harsha24)
Revision history for this message
Harshavardhan Metla (harsha24) wrote :

Should there be a timeout after resize request of an instance before the confirm resize. So that we don't get a conflict and if the confirm resize timeout occurs then the instance reverts back to old state and then we can start the instance or we confirm the resize request and then start the instance.

Revision history for this message
Harshavardhan Metla (harsha24) wrote :

Here is log file generated while reporducing the bug
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;32mDEBUG nova.api.openstack.wsgi [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mAction: 'action', calling method: <bound method ServersController._start_server of <nova.api.openstack.compute.servers.ServersController object at 0x7efc62e87590>>, body: {"os-start": null}#033[00m #033[00;33m{{(pid=29396) _process_stack /opt/stack/nova/nova/api/openstack/wsgi.py:520}}#033[00m#033[00m
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;32mDEBUG oslo_concurrency.lockutils [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mLock "61490f1a-3837-4aa6-ad85-960894fcf515" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s#033[00m #033[00;33m{{(pid=29396) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:327}}#033[00m#033[00m
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;32mDEBUG oslo_concurrency.lockutils [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mLock "61490f1a-3837-4aa6-ad85-960894fcf515" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s#033[00m #033[00;33m{{(pid=29396) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:339}}#033[00m#033[00m
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;36mINFO nova.api.openstack.wsgi [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;36m] #033[01;35m#033[00;36mHTTP exception thrown: Cannot 'start' instance b580517c-2d54-444d-a3b5-c76e83297b0d while it is in vm_state resized#033[00m#033[00m
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;32mDEBUG nova.api.openstack.wsgi [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mReturning 409 to user: Cannot 'start' instance b580517c-2d54-444d-a3b5-c76e83297b0d while it is in vm_state resized#033[00m #033[00;33m{{(pid=29396) __call__ /opt/stack/nova/nova/api/openstack/wsgi.py:941}}#033[00m#033[00m
Jul 16 13:38:57 devstackTrain <email address hidden>[29384]: #033[00;36mINFO nova.api.openstack.requestlog [#033[01;36mNone req-9a3fcab8-b121-4e7f-9504-c834229bc07b #033[00;36madmin admin#033[00;36m] #033[01;35m#033[00;36m127.0.0.1 "POST /compute/v2.1/servers/b580517c-2d54-444d-a3b5-c76e83297b0d/action" status: 409 len: 144 microversion: 2.1 time: 0.132900#033[00m#033[00m

Revision history for this message
melanie witt (melwitt) wrote :

I had a look through the code because I was curious, and it seems to me we might be able to provide the ability to start an unconfirmed/unreverted resized instance by (1) allowing 'start' from the RESIZED state and (2) restoring the vm_state.RESIZED and task_state None to the instance after the 'start' completes, based on whether the instance has a migration_context indicating a resize-in-progress.

I was thinking the main challenge with this ask was avoiding the wipeout of the resize-related states on the instance after start completes, but we can figure out if the instance was in the middle of a resize by checking if there's a migration_context.

If the instance is in the middle of a resize, its migration_context will have the migration_id and we could lookup the Migration. Then, if the migration.status = 'finished' and migration.migration_type = 'resize', we know to set the vm_state to RESIZED and task_state to None after the 'start' is completed.

I think the only concern left is: would/could anything bad happen if a user attempts to 'start' a RESIZED instance that is already powered on? From the libvirt driver code, doing this would initiate a hard reboot. For other drivers, it depends on their power_on code. I think it might be bad/unexpected if we hard reboot a user's instance if they 'start' a powered on instance in state RESIZED (and for example, didn't realize the instance was already powered on or otherwise sent the request by mistake). But we can't check the power state from compute/api -- we rely on vm_state.STOPPED for that.

(Later) Actually we do have an Instance.power_state attribute we could look at in the compute/api for 'start'. So if vm_state.RESIZED then check Instance.power_state and if it's not power_state.SHUTDOWN, reject the request with a 409.

This is all a bit complex in order to allow 'start' and preserve instance states for a RESIZED instance, so I'm not sure how important this is to operators and whether it's worth the effort and complexity to implement.

On the other hand, we do allow users to 'resize' a STOPPED instance, yet they are disallowed from 'start'ing it to assess whether the resize went smoothly. I guess the workaround for that would be the 'revert' the resize, 'start' the instance, and then 'resize' it again.

Revision history for this message
George Shuklin (george-shuklin) wrote :

I've reported this bug and I can give an additional scenario to consider.

If instance is a 'half-pet' (e.g. have some important data), it's natural for operator (tenant) to perform careful shutdown of the instance to avoid any kind of disruption (f.e. by verifying that shutdown was performed fine and all data was saved and is good).

That that person make 'resize', and want to check if instance can start normally before loosing the chance of revert. Currently Openstack prevents this.

Tenant's operator has two options: either trust ACPI shutdown procedures (which are not always as flawless as we want, including angry systemd either waiting forever for some service to stop or killing it on timeout), or migrate instance blindly without chance to start instance.

Start can went wrong on many reasons: minor differences in CPU model, external network issues on provider networks (including obscure things like differences on MTU on uplink devices). So, revert feature is very much welcomed.

Basically, there are two options now:

1. Blindly trust shutdown code and have chance to revert (to the blindly trusted shutdowned state).
2. Or loose 'revert' if shutdown was controlled.

Revision history for this message
melanie witt (melwitt) wrote :

Thanks for giving the additional context to the problem -- that gives me more understanding. And sorry it's been years since you opened this and I just now commented on it 😬 I saw it in my bug subscriptions recently because of Harshavardhan's comment the other day.

I see now that you have no workarounds in this scenario where you want to do the resize while shutdown for data safety and then try a 'start' before committing to the new flavor.

Harshavardhan, I see that you have assigned this bug to yourself. Would you mind it if I proposed a proof-of-concept patch based on my ideas in comment 9?

Revision history for this message
Harshavardhan Metla (harsha24) wrote :

Yeah sure Melanie.I woudn't mind.

Changed in nova:
assignee: Harshavardhan Metla (harsha24) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.