Keystone API connection failed

Bug #1682058 reported by Alexander Chumakov
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Magnum
Fix Released
Undecided
Mark Goddard

Bug Description

With last commit: https://git.openstack.org/cgit/openstack/magnum/commit/?id=ff58a66b08252c8a790aadab860bf67c2215c85e
Magnum installed by official Ubuntu guide https://docs.openstack.org/project-install-guide/container-infrastructure-management/ocata/ on Ubuntu 16.04
And checked again from sources https://docs.openstack.org/developer/magnum/install-guide-from-source.html
The error occures after creating first cluster only one time till magnum-conductor servie restart:

2017-04-12 11:34:42.547 31353 ERROR magnum.common.keystone [req-cfbab520-d63f-410b-8ef5-a6adadef5880 - - - - -] Keystone API connection failed: no password, trust_id or token found.
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall [req-cfbab520-d63f-410b-8ef5-a6adadef5880 - - - - -] Fixed interval looping call 'magnum.service.periodic.ClusterUpdateJob.update_status' failed
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall Traceback (most recent call last):
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/service/periodic.py", line 70, in update_status
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall cdriver.update_cluster_status(self.ctx, self.cluster)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 78, in update_cluster_status
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall poller.poll_and_check()
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 167, in poll_and_check
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall stack = self.openstack_client.heat().stacks.get(
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/common/exception.py", line 57, in wrapped
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall return func(*args, **kw)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/common/clients.py", line 93, in heat
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall region_name=region_name)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/common/clients.py", line 44, in url_for
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall return self.keystone().session.get_endpoint(**kwargs)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/common/keystone.py", line 57, in session
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall auth = self._get_auth()
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall File "/var/lib/magnum/env/local/lib/python2.7/site-packages/magnum/common/keystone.py", line 97, in _get_auth
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall message='reason %s' % msg)
2017-04-12 11:34:42.547 31353 ERROR oslo.service.loopingcall AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.

Revision history for this message
fxpester (a-yurtaykin) wrote :

Confirmed this with kolla-openstack 4.0.0 on centos.

And this is most funny - "The error occures after creating first cluster only one time till magnum-conductor servie restart"

Revision history for this message
Mark Goddard (mgoddard) wrote :
Revision history for this message
Mark Goddard (mgoddard) wrote :
Download full text (5.2 KiB)

I see this on the master branch of magnum currently. It's a race condition, and does not happen on every cluster creation, but does happen on most. I have a server with 48 cores (24 + hyperthreading), and so have 48 magnum conductor worker processes. When this issue occurs, I actually see one of these tracebacks for each worker, and all seem to be aligned temporally. This is another issue which I will raise separately.

I added some logging to the relevant section of code:

diff --git a/magnum/drivers/heat/driver.py b/magnum/drivers/heat/driver.py
index 5aa5a8e..96563b3 100644
--- a/magnum/drivers/heat/driver.py
+++ b/magnum/drivers/heat/driver.py
@@ -73,6 +73,9 @@ class HeatDriver(driver.Driver):

     def update_cluster_status(self, context, cluster):
         stack_ctx = mag_ctx.make_cluster_context(cluster)
+ LOG.info("MG: cluster %s", dict(cluster))
         poller = HeatPoller(clients.OpenStackClients(stack_ctx), context,
                             cluster, self)
         poller.poll_and_check()

When the error occurs, we see the following output:

2017-07-04 18:58:08.384 41 INFO magnum.drivers.heat.driver [req-c48a6b51-e7c9-4898-956f-512f01ab3d68 - - - - -] MG: cluster {'updated_at': None, 'trustee_user_id': None, 'keypair': u'controller', 'node_co
unt': 1, 'id': 22, 'trust_id': None, 'magnum_cert_ref': None, 'user_id': u'ae1a02bad3a043549b77a64d946a3ec0', 'uuid': 'b78e4f5c-02cf-4208-bc09-ecea3e7c867b', 'cluster_template': ClusterTemplate(apiserver_
port=None,cluster_distro='fedora',coe='kubernetes',created_at=2017-07-03T13:07:52Z,dns_nameserver='8.8.8.8',docker_storage_driver='devicemapper',docker_volume_size=3,external_network_id='ilab',fixed_netwo
rk='p3-internal',fixed_subnet='p3-internal',flavor_id='compute-A-magnum',floating_ip_enabled=True,http_proxy=None,https_proxy=None,id=4,image_id='k8s-fedora-25',insecure_registry=None,keypair_id=None,labe
ls={},master_flavor_id='compute-A-magnum',master_lb_enabled=False,name='k8s-fedora-25',network_driver='flannel',no_proxy=None,project_id='37a2b9a3299e44f8bc3f6dc60007ef76',public=False,registry_enabled=Fa
lse,server_type='bm',tls_disabled=False,updated_at=None,user_id='ae1a02bad3a043549b77a64d946a3ec0',uuid='ea0ba754-a769-493a-ade7-d25762e4669e',volume_driver=None), 'ca_cert_ref': None, 'api_address': None
, 'master_addresses': [], 'create_timeout': 60, 'project_id': u'37a2b9a3299e44f8bc3f6dc60007ef76', 'status': u'CREATE_IN_PROGRESS', 'docker_volume_size': 3, 'master_count': 1, 'node_addresses': [], 'statu
s_reason': None, 'coe_version': None, 'cluster_template_id': u'ea0ba754-a769-493a-ade7-d25762e4669e', 'name': u'k8s-xxx', 'stack_id': None, 'created_at': datetime.datetime(2017, 7, 4, 17, 58, 3, tzinfo=<i
so8601.Utc>), 'container_version': None, 'trustee_password': None, 'trustee_username': None, 'discovery_url': None}

When the error does not occur, we see the following:

2017-07-04 19:00:37.368 40 INFO magnum.drivers.heat.driver [req-4cd7235d-c7f4-4497-817c-c9c0a8d57eed - - - - -] MG: cluster {'updated_at': datetime.datetime(2017, 7, 4, 17, 58, 12, tzinfo=<iso8601.Utc>),
'trustee_user_id': u'5f1fbc8933e749df8500e521fb95e746', 'keypair': u'controller', 'node_count': 1, 'id':...

Read more...

Revision history for this message
Mark Goddard (mgoddard) wrote :

Or perhaps even more simply, the stack_id is None before the failure, and not None after. Although the missing trust and trustee is what causes the keystone traceback, there is no point in polling a heat stack that does not exist, so we may as well check for stack_id == None.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to magnum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/480286

Changed in magnum:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Revision history for this message
Spyros Trigazis (strigazi) wrote :

Merged patch https://review.openstack.org/#/c/469852/ fixes it.

Mark, can you verify?

Changed in magnum:
assignee: Mark Goddard (mgoddard) → Spyros Trigazis (strigazi)
Changed in magnum:
assignee: Spyros Trigazis (strigazi) → Mark Goddard (mgoddard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/480286
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=88a6e3bab5f4af92ea4700580af05f5c20ab64f6
Submitter: Jenkins
Branch: master

commit 88a6e3bab5f4af92ea4700580af05f5c20ab64f6
Author: Mark Goddard <email address hidden>
Date: Tue Jul 4 19:24:28 2017 +0100

    Don't poll heat if no stack exists

    Cluster objects are created asynchronously from their underlying
    heat stacks, meaning that the periodic update can sometimes end up
    trying to poll a cluster's heat stack before the stack has been created.

    This change checks whether the stack_id is None and skips polling heat
    if so. This has the side effect of resolving bug 1682058, since we don't
    try to use a trust and trustee that do not exist.

    Change-Id: I73f039659250f1d5b69b23141835c4602c8e019a
    Closes-Bug: #1682058

Changed in magnum:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to magnum (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/494688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (stable/ocata)

Reviewed: https://review.openstack.org/494688
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=8cb0e1f57033f697ec43819952ef07b43582b3d1
Submitter: Jenkins
Branch: stable/ocata

commit 8cb0e1f57033f697ec43819952ef07b43582b3d1
Author: Mark Goddard <email address hidden>
Date: Tue Jul 4 19:24:28 2017 +0100

    Don't poll heat if no stack exists

    Cluster objects are created asynchronously from their underlying
    heat stacks, meaning that the periodic update can sometimes end up
    trying to poll a cluster's heat stack before the stack has been created.

    This change checks whether the stack_id is None and skips polling heat
    if so. This has the side effect of resolving bug 1682058, since we don't
    try to use a trust and trustee that do not exist.

    Change-Id: I73f039659250f1d5b69b23141835c4602c8e019a
    Closes-Bug: #1682058
    (cherry picked from commit 88a6e3bab5f4af92ea4700580af05f5c20ab64f6)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 5.0.0

This issue was fixed in the openstack/magnum 5.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.