Redeployment failed on tasks[system_provision]

Bug #1672964 reported by Nastya Urlapova
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov
Mitaka
In Progress
High
Vladimir Sharshov
Newton
Fix Released
Critical
Vladimir Sharshov
Ocata
Fix Committed
High
Vladimir Sharshov

Bug Description

If case of redeployment after stop deployment action, user get error:
All nodes are finished. Failed tasks: Task[system_provision/1], Task[system_provision/3] Stopping the deployment process!

Full scenario:
1. Create cluster in Ha mode with 1 controller
2. Add 1 node with controller role
3. Add 1 node with compute role
4. Add 1 node with cinder role
5. Verify network
6. Provision nodes
7. Make a test file on every node
8. Deploy nodes
9. Stop deployment
10. Verify nodes are not reset to bootstrap image
11. Re-deploy cluster <<< failed here

Node are available by mco, mcollective logs don't consist any errors.

Version: 10.0 ISO #1472
fuel-nailgun-10.0.0-1.mos9086.noarch
fuel-ostf-10.0.0-1.mos970.noarch
python-fuelclient-10.0.0-1.mos411.noarch
fuel-notify-10.0.0-1.mos8989.noarch
fuel-10.0.0-1.mos6384.noarch
fuel-utils-10.0.0-1.mos8989.noarch
fuel-agent-10.0.0-1.mos345.noarch
fuel-ui-10.0.0-1.mos3008.noarch
fuel-setup-10.0.0-1.mos6384.noarch
fuel-release-10.0.0-1.mos6384.noarch
fuel-bootstrap-cli-10.0.0-1.mos345.noarch
fuel-misc-10.0.0-1.mos8989.noarch
fuelmenu-10.0.0-1.mos300.noarch
fuel-openstack-metadata-10.0.0-1.mos9086.noarch
fuel-migrate-10.0.0-1.mos8989.noarch
fuel-library10.0-10.0.0-1.mos8989.noarch

Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Another failure
Scenario:
            1. Check mcollective version on bootstrap
            2. Create cluster
            3. Add one node to cluster
            4. Provision nodes
            5. Check mcollective version on node

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Another failure, scenario
            1. Check mcollective version on bootstrap
            2. Create cluster
            3. Add one node to cluster
            4. Provision nodes <<< failed here
Version: 10.0 ISO #1472

Revision history for this message
Nastya Urlapova (aurlapova) wrote :
tags: added: area-python
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So the root cause is that when we run a transaction we do not filter nodes per-graph, but rather per-transaction.

See

4344 2017-03-14 22:11:48.516 INFO [7fa84a51f880] (manager) Start new transaction: cluster=1 graphs=[{u'type': u'net-verification'}, {u'type': u'deletion'}, {u'type': u'prov ision'}, {u'type': u'default'}] dry_run=0 noop_run=False force=0
4345 2017-03-14 22:11:48.535 DEBUG [7fa84a51f880] (task) Mark task as deleted: 52547e48-8e1a-4e4c-84ec-bfddc3d3359b
4346 2017-03-14 22:11:48.536 DEBUG [7fa84a51f880] (task) Updating task: 52547e48-8e1a-4e4c-84ec-bfddc3d3359b
4347 2017-03-14 22:11:48.578 DEBUG [7fa84a51f880] (manager) Transaction 10 starts assembling.
4348 2017-03-14 22:11:48.579 DEBUG [7fa84a51f880] (manager) Transaction 10 finish assembling.
4349 2017-03-14 22:11:48.579 DEBUG [7fa84a51f880] (mule) MULE STARTING for TransactionsManager._execute_async
4350 2017-03-14 22:11:48.585 DEBUG [7fa84a51f880] (manager) Transaction 10 starts assembling.
4351 [pid: 18806|app: 0|req: 91/456] 10.109.0.1 () {40 vars in 601 bytes} [Tue Mar 14 22:11:48 2017] PUT /api/clusters/1/changes/ => generated 276 bytes in 217 msecs (HTTP/ 1.1 202) 4 headers in 191 bytes (2 switches on core 0)
4352 2017-03-14 22:11:48.612 WARNING [7fa84a51f880] (deployment_graph) Graph association with type 'net-verification' was requested for the unappropriated model instance <n ailgun.db.sqlalchemy.models.cluster.Cluster object at 0x7fa82c035c50> with ID=1
4353 2017-03-14 22:11:48.614 DEBUG [7fa84a51f880] (manager) applying nodes filter: $.pending_addition

We filter nodes only once here: for net-verification, but not for other graphs. Thus we run provision graph on already provisioned nodes.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Okay, the RC is different:

we have Node.reset_to_discover call in stop_deployment_resp method of nailgun/rpc/receiver.py code. When astute sends the report for stop deployment, it sends these nodes in rpc message body. And this method marks nodes as 'discover' and pending_addition. That's why provision graph picks them for reprovisioning.

Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

Raising to critical - it affects big amount of swarm threads

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/447083

Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/447083
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=496212798efa9f167de20a3ea0c2146658b2b466
Submitter: Jenkins
Branch: master

commit 496212798efa9f167de20a3ea0c2146658b2b466
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Mar 17 20:48:52 2017 +0300

    Fix wrong ready status instead of stopped for stop deployment

    Report ready status for node means successful node status
    which can be get if all tasks was passed with ready and skipped
    statuses.

    Same effect can be get if Astute mark node as skipped. In this
    case we also get equal status 'successful'.

    So we need ask node about skipped statuses before ask it about
    successful status to prevent losing context about stop
    deployment operation.

    Change-Id: I3c042425cab800de0bfc4e03f29414b145f44983
    Closes-Bug: #1672964

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/447481

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/447482

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/447483

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/ocata)

Reviewed: https://review.openstack.org/447481
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=56ca061009d419a0866d58d5d1aecef45349bdc5
Submitter: Jenkins
Branch: stable/ocata

commit 56ca061009d419a0866d58d5d1aecef45349bdc5
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Mar 17 20:48:52 2017 +0300

    Fix wrong ready status instead of stopped for stop deployment

    Report ready status for node means successful node status
    which can be get if all tasks was passed with ready and skipped
    statuses.

    Same effect can be get if Astute mark node as skipped. In this
    case we also get equal status 'successful'.

    So we need ask node about skipped statuses before ask it about
    successful status to prevent losing context about stop
    deployment operation.

    Change-Id: I3c042425cab800de0bfc4e03f29414b145f44983
    Closes-Bug: #1672964
    (cherry picked from commit 496212798efa9f167de20a3ea0c2146658b2b466)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/newton)

Reviewed: https://review.openstack.org/447482
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=8213652e49f0f2c28721dfb9da81ecb69d07f9c9
Submitter: Jenkins
Branch: stable/newton

commit 8213652e49f0f2c28721dfb9da81ecb69d07f9c9
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Mar 17 20:48:52 2017 +0300

    Fix wrong ready status instead of stopped for stop deployment

    Report ready status for node means successful node status
    which can be get if all tasks was passed with ready and skipped
    statuses.

    Same effect can be get if Astute mark node as skipped. In this
    case we also get equal status 'successful'.

    So we need ask node about skipped statuses before ask it about
    successful status to prevent losing context about stop
    deployment operation.

    Change-Id: I3c042425cab800de0bfc4e03f29414b145f44983
    Closes-Bug: #1672964
    (cherry picked from commit 496212798efa9f167de20a3ea0c2146658b2b466)

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :
Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/449247

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We reverted this feature. So the bug should disappear.

Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

https://product-ci.infra.mirantis.net/job/10.0.system_test.ubuntu.error_node_reinstallation/224/

2017-03-28 04:57:52 +0000 Puppet (err): /etc/puppet/shell_manifests/system_provision_command.sh: line 5: /usr/bin/provision: No such file or directory

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/450788

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/450875

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/450788
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=039ce9e0b82a9e24b0f598fbe528b80feb3fdf3c
Submitter: Jenkins
Branch: master

commit 039ce9e0b82a9e24b0f598fbe528b80feb3fdf3c
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue Mar 28 16:39:03 2017 +0300

    Do not send data about nodes in case of task deployment

    Nailgun use data about nodes in stop deployment respond
    to reset it to discovory state which is unexpected behavior
    for already provisioned nodes in case of task deployment

    Change-Id: I39de8a8afd627b0bf209d9a7f6ad6e19abd99016
    Partial-Bug: #1672964

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/450882

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/newton)

Reviewed: https://review.openstack.org/450875
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=ee2d95de33842187bf99929af321207011c3284f
Submitter: Jenkins
Branch: stable/newton

commit ee2d95de33842187bf99929af321207011c3284f
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue Mar 28 16:39:03 2017 +0300

    Do not send data about nodes in case of task deployment

    Nailgun use data about nodes in stop deployment respond
    to reset it to discovory state which is unexpected behavior
    for already provisioned nodes in case of task deployment

    Change-Id: I39de8a8afd627b0bf209d9a7f6ad6e19abd99016
    Partial-Bug: #1672964
    (cherry picked from commit 039ce9e0b82a9e24b0f598fbe528b80feb3fdf3c)

tags: added: in-stable-newton
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/ocata)

Reviewed: https://review.openstack.org/450882
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=445a16aac208c7b65b82ba66a5a083a882568c2e
Submitter: Jenkins
Branch: stable/ocata

commit 445a16aac208c7b65b82ba66a5a083a882568c2e
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue Mar 28 16:39:03 2017 +0300

    Do not send data about nodes in case of task deployment

    Nailgun use data about nodes in stop deployment respond
    to reset it to discovory state which is unexpected behavior
    for already provisioned nodes in case of task deployment

    Change-Id: I39de8a8afd627b0bf209d9a7f6ad6e19abd99016
    Partial-Bug: #1672964
    (cherry picked from commit 039ce9e0b82a9e24b0f598fbe528b80feb3fdf3c)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/449247
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=3fb3b83f4ad1254a9a54d6d81f5c58f33c92c601
Submitter: Jenkins
Branch: master

commit 3fb3b83f4ad1254a9a54d6d81f5c58f33c92c601
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Thu Mar 23 20:33:52 2017 +0300

    Excluding number of nodes in stop operation notification

    Nailgun use block of nodes in stop operation to reset
    such nodes in discovery state. Also Nailgun used such
    data to calculate count of nodes for notifications.
    But Astute will not send info about nodes in case
    of task deployment.

    This patch exclude count of nodes in stop notification
    to prevent misslining message about successful operation
    for 0 nodes

    Change-Id: I32da2ccce11b22378f58759703fc4a56e31fd993
    Closes-Bug: #1672964

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/460132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/460134

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/newton)

Reviewed: https://review.openstack.org/460134
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=d60e2ac26753ef3130ee8b68199016d7c7b8dec6
Submitter: Jenkins
Branch: stable/newton

commit d60e2ac26753ef3130ee8b68199016d7c7b8dec6
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Thu Mar 23 20:33:52 2017 +0300

    Excluding number of nodes in stop operation notification

    Nailgun use block of nodes in stop operation to reset
    such nodes in discovery state. Also Nailgun used such
    data to calculate count of nodes for notifications.
    But Astute will not send info about nodes in case
    of task deployment.

    This patch exclude count of nodes in stop notification
    to prevent misslining message about successful operation
    for 0 nodes

    Change-Id: I32da2ccce11b22378f58759703fc4a56e31fd993
    Closes-Bug: #1672964
    (cherry picked from commit 3fb3b83f4ad1254a9a54d6d81f5c58f33c92c601)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (stable/mitaka)

Change abandoned by Joshua Hesketh (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/447483
Reason: This branch (stable/mitaka) is at End Of Life

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/ocata)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/ocata
Review: https://review.opendev.org/460132
Reason: This repo is retired now, no further work will get merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-astute ocata-eol

This issue was fixed in the openstack/fuel-astute ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.