stable/ocata scenario002 broken due to volume went to status error

Bug #1695904 reported by Alex Schultz
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Carlos Camacho

Bug Description

It's been happening since at least May 16th,

2017-06-02 23:37:46.585488 | tenant-stack.volume1:
2017-06-02 23:37:46.585596 | resource_type: OS::Cinder::Volume
2017-06-02 23:37:46.585662 | physical_resource_id: 89d6a97e-5541-435f-9e03-3b58c57c4955
2017-06-02 23:37:46.585691 | status: CREATE_FAILED
2017-06-02 23:37:46.585715 | status_reason: |
2017-06-02 23:37:46.585767 | ResourceInError: resources.volume1: Went to status error due to "Unknown

See change https://review.openstack.org/#/c/461076/

http://logs.openstack.org/76/461076/1/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/31f19be/console.html.gz#_2017-05-16_18_59_15_000571
http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/bdd7f5b/console.html.gz#_2017-06-02_23_37_46_585488

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (8.8 KiB)

Looks like create failed. We have a few exceptions:

http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/bdd7f5b/logs/subnode-2/var/log/cinder/volume.txt.gz#_2017-06-02_23_37_29_880

2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume [req-0f53120c-8a61-4066-bbef-732ef3cca75d 6757101f42a948a9818786ceb8528430 ad580cee7399423db9d0eaa3acf0a63f - default default] Failed to copy image e5a605c5-0a8b-47e7-8b55-93793ed5b84a to volume: 89d6a97e-5541-435f-9e03-3b58c57c4955
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume Traceback (most recent call last):
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 535, in _copy_image_to_volume
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume context, volume, image_service, image_id)
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/cinder/volume/driver.py", line 809, in copy_image_to_encrypted_volume
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume context, volume, image_service, image_id, True)
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/cinder/volume/driver.py", line 827, in _copy_image_data_to_volume
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume encryption)
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/cinder/utils.py", line 514, in brick_attach_volume_encryptor
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume encryptor.attach_volume(context, **encryption)
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/os_brick/encryptors/luks.py", line 150, in attach_volume
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume key = self._get_key(context).get_encoded()
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/os_brick/encryptors/base.py", line 50, in _get_key
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume return self._key_manager.get(context, self.encryption_key_id)
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume File "/usr/lib/python2.7/site-packages/cinder/keymgr/conf_key_mgr.py", line 128, in get
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume raise KeyError(str(managed_object_id) + " != " + str(self.key_id))
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume KeyError: 'e3c00430-5907-4795-8719-7f2bca191b86 != 00000000-0000-0000-0000-000000000000'
2017-06-02 23:37:29.880 210553 ERROR cinder.volume.flows.manager.create_volume

http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/bdd7f5b/logs/s...

Read more...

Revision history for this message
Alan Bishop (alan-bishop) wrote :

This appears to be another situation (also see [1]) where, during the upgrade, pacemaker is starting cinder-volume too soon. In [1] the problem was due to the DB not being migrated before starting cinder-volume.

Here, the problem is scenario002 is now testing encrypted volumes, and the upgrade process adds the barbican service. puppet-cinder is responsible for creating the necessary entries in cinder.conf, and it does, but this is happening _after_ pacemaker started cinder-volume. Once again, cinder-volume is starting up too soon, before puppet-cinder has had a chance to prepare the new configuration.

[1] https://bugs.launchpad.net/tripleo/+bug/1691851

Revision history for this message
Alan Pevec (apevec) wrote :

Is this really stable/ocata only?

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Yea stable/ocata scenario002 upgrade (at least as i've seen).

Revision history for this message
Michele Baldessari (michele) wrote :

Is this still an issue? According to http://tripleo.org/cistatus.html this job has not failed in a while (at least since Alan's change in LP bug https://bugs.launchpad.net/tripleo/+bug/1691851)

Revision history for this message
Alex Schultz (alex-schultz) wrote :

We'll see if the recheck on https://review.openstack.org/#/c/461076/ goes through, if so i'll close out the bug.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

I've digged a little into http://logs.openstack.org/76/461076/1/ logs, especially:

 wget http://logs.openstack.org/76/461076/1/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/31f19be/logs/subnode-2/var/log/journal-text.txt.gz
 wget http://logs.openstack.org/76/461076/1/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/31f19be/logs/subnode-2/var/log/cluster/corosync.txt.gz

The pacemaker trigger restart mechanism seems to have been triggered fine:

zegrep 'openstack-cinder-volume resource restart flag' journal-text.txt.gz:

 - May 16 18:25:46.693001;
 - May 16 18:25:46.734952
 - May 16 18:28:30.344396

But I don't see the associated resource run in the log:

 zegrep 'ControllerPostPuppetRestart' journal-text.txt.gz

 - May 16 17:55:23
 - May 16 17:55:23
 - May 16 17:55:26

As a matter of fact the resource seems to be deleted during upgrade:

http://logs.openstack.org/76/461076/1/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/31f19be/console.html.gz#_2017-05-16_18_03_45_496650

This might explain the problem.

It should be noted that all pcs resource end up in unmanaged state: http://ix.io/wKb

So why is the ControllerPostPuppetRestart is deleted and why the resource end up in unmanaged state ?

Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Alan Bishop (alan-bishop) wrote :

Fresh results available at http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/ confirm Sofer's observation.

It looks like puppet-cinder is requesting a restart after the config changes (as a result of the upgrade), but the pacemaker stuff responsble for monitoring the resource_restart_flag is no longer present. And that's because the ControllerPostPuppetRestart heat resource was deleted during the update/upgrade.

Jun 08 18:57:09 - Upgrade initiated [0]
Jun 08 19:06:19 - ControllerPostPuppetRestart resource was deleted [1]
Jun 08 19:19:35 - cinder-volume restarted [2]
Jun 08 19:34:14 - "openstack-cinder-volume resource restart flag" issued [3]
Jun 08 19:37:59 - "openstack-cinder-volume resource restart flag" issued [4]
Jun 08 19:43:20 - Upgrade complete [5]

[0] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/console.html#_2017-06-08_18_57_09_104134
[1] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/console.html#_2017-06-08_19_06_19_963287
[2] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/logs/subnode-2/var/log/journal-text.txt.gz#_Jun_08_19_19_35_211321
[3] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/logs/subnode-2/var/log/journal-text.txt.gz#_Jun_08_19_34_14_205820
[4] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/logs/subnode-2/var/log/journal-text.txt.gz#_Jun_08_19_37_59_218360
[5] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/console.html#_2017-06-08_19_43_20_692680

Revision history for this message
Alan Bishop (alan-bishop) wrote :

I gather the "PostPuppet" resources were recently moved to the ControllerPostConfig deployment [0]. The logs show the ControllerPostConfig resource is created [1], but there's no evidence of the ControllerPostPuppetRestart resource being created.

I did some tests using a local deployment, and I checked the deployment plan uploaded to swift on the undercloud. The Jinja2 expanded version of post_puppet_pacemaker.j2.yaml [2] was essentially empty, and did not define any "PostPuppet" resources. I did some more tests, and for reasons I don't understand the check for the 'controller' tag [2] is failing.

I hacked out the check for the 'controller' tag in my deployment, and now the ControllerPostPuppetRestart is being deployed during ControllerPostConfig. I don't understand why the tag check seems broken.

[0] https://review.openstack.org/#/c/459229/
[1] http://logs.openstack.org/76/461076/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-upgrades/cc1f341/console.html#_2017-06-08_19_42_32_498323
[2] https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/extraconfig/tasks/post_puppet_pacemaker.j2.yaml#L14

Changed in tripleo:
assignee: nobody → Carlos Camacho (ccamacho)
status: Triaged → In Progress
Revision history for this message
Carlos Camacho (ccamacho) wrote :

Fix proposed here: https://review.openstack.org/#/c/472413/

Job passing correctly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/472972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ocata)

Change abandoned by Carlos Camacho (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/472972

Revision history for this message
Carlos Camacho (ccamacho) wrote :

Hey Alan, thanks a lot for you feedback man, it actually ringed some bells and hopefully it's all fixed..

"I don't understand why the tag check seems broken." That tag is broken as is not an stable/ocata feature, it's just not supported in O.

Please check the fix backport Review: https://review.openstack.org/472972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/472413
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=834fe9cde3a2931b673529793c2571651bc84672
Submitter: Jenkins
Branch: master

commit 834fe9cde3a2931b673529793c2571651bc84672
Author: Carlos Camacho <email address hidden>
Date: Thu Jun 8 23:18:44 2017 +0200

    Moving *postconfig where it was *postpuppet

    We need to ensure that the pacemaker cluster restarts
    in the end of the deployment.

    Due to the resources renaming we added the
    postconfig resource not in the end of the
    deployment as it was *postpuppet.

    Closes-bug: 1695904

    Change-Id: Ic6978fcff591635223b354831cd6cbe0802316cf

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/472972
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=628d7a7901bae6e1549d16e9203392da4fb117b2
Submitter: Jenkins
Branch: stable/ocata

commit 628d7a7901bae6e1549d16e9203392da4fb117b2
Author: Carlos Camacho <email address hidden>
Date: Thu Jun 8 23:18:44 2017 +0200

    Moving *postconfig where it was *postpuppet

    We need to ensure that the pacemaker cluster restarts
    in the end of the deployment.

    Due to the resources renaming we added the
    postconfig resource not in the end of the
    deployment as it was *postpuppet.

    Closes-bug: 1695904

    Change-Id: Ic6978fcff591635223b354831cd6cbe0802316cf

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.2.0

This issue was fixed in the openstack/tripleo-heat-templates 6.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b3

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.