wait() waits many hrs, or even infinity

Bug #1626515 reported by Ryan Beisner
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mojo: Continuous Delivery for Juju
Fix Released
High
Paul Collins

Bug Description

juju-wait will wait indefinitely for a settled status, and may keep waiting even though a unit is in an error state.

This is preventing us from using the new-ish mojo built-in wait, as we can't have our metal tied up for several hours in CI on failing jobs.

The mojo repo contains an old copy of juju-wait in tree.

Quite a while back we committed an optional max_wait to juju-wait address that, and have been using that as a post-deploy phase with juju-wait trunk to successfully wait for things to settle.

Mojo should consider re-freshing the juju-wait code, and plumbing the max_wait option through to be usable in a manifest.

This issue is likely scarce in production deploys. We run into it as we use Mojo to do charm testing, and sometimes those charms have issues. We need for the tooling to be as tunable and resilient to that as possible, as is possible with juju-wait from trunk.

Example:

#### mojo's juju-wait starts here

00:58:23.997 2016-09-22 08:12:33 [INFO] Waiting for environment to reach steady state
02:58:48.609 2016-09-22 10:12:58 [INFO] All units idle since 2016-09-22 10:12:25.640469Z (ceilometer/0, ceph-osd/0, ceph/0, ceph/1, ceph/2, cinder/0, glance/0, heat/0, keystone/0, mongodb/0, mysql/0, neutron-api/0, neutron-gateway/0, nova-cloud-controller/0, nova-compute/0, nova-compute/1, nova-compute/2, openstack-dashboard/0, rabbitmq-server/0, swift-proxy/0, swift-storage-z1/0, swift-storage-z2/0, swift-storage-z3/0)
02:58:48.609 2016-09-22 10:12:58 [INFO] Environment has reached steady state

#### mojo's juju-wait claims ready state, even though multiple units are in an error state

02:58:48.610 2016-09-22 10:12:58 [INFO] Manifest comment:
02:58:48.610
02:58:48.610 #############################################################################
02:58:48.610 Check juju statuses are green and that hooks have finished
02:58:48.610 #############################################################################
02:58:48.610
02:58:48.610
02:58:48.610 2016-09-22 10:12:58 [INFO] Pulling secrets from /srv/mojo/LOCAL/mojo-openstack-specs/specs/full_stack/stable_deploy_baremetal/mitaka to /srv/mojo/mojo-openstack-specs/xenial/osci-mojo/local
02:58:48.610 2016-09-22 10:12:58 [WARNING] Automatic secrets phase ran but secrets directory /srv/mojo/LOCAL/mojo-openstack-specs/specs/full_stack/stable_deploy_baremetal/mitaka does not exist!
02:58:48.611 2016-09-22 10:12:58 [INFO] Running script check_juju.py
03:43:51.354 2016-09-22 10:58:00 [WARNING] No debug log matching debug-logs found. Using default.

#### juju-wait trunk starts here

03:43:52.266 2016-09-22 10:58:01 [ERROR] INFO:root:Calling juju-wait
03:43:52.266 DEBUG:root:swift-storage-z1/0 workload status is error since 2016-09-22 10:12:34Z
03:43:52.266 DEBUG:root:swift-storage-z2/0 workload status is error since 2016-09-22 10:12:32Z
03:43:52.266 DEBUG:root:swift-storage-z3/0 workload status is error since 2016-09-22 10:12:30Z

Tags: uosci

Related branches

Ryan Beisner (1chb1n)
description: updated
Ryan Beisner (1chb1n)
description: updated
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

http://pastebin.ubuntu.com/23284241/ is a juju status --format=yaml from an environment where mojo has been waiting for over 12 hours.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

http://pastebin.ubuntu.com/23284283/ is the full mojo run output and juju bootstrap

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

http://pastebin.ubuntu.com/23284362/ is the full Juju log for this deployment

Tom Haddon (mthaddon)
Changed in mojo:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Paul Collins (pjdc) wrote :

Mojo's copy of juju-wait was refreshed on 2016-12-02 in r365. I'll take a look at wiring up max_wait.

Changed in mojo:
assignee: nobody → Paul Collins (pjdc)
Revision history for this message
Paul Collins (pjdc) wrote :

Linked branch wires up max_wait (as max-wait, for consistency) and it works for Juju 1. The Juju 2 code doesn't use juju-wait at all and will have to implement its own logic. For now, my branch raises NotImplementedError is raised when max-wait is supplied, although if folks are running the same specs with both versions of Juju, this may make max-wait difficult to use, so maybe we should to do something else instead.

Changed in mojo:
status: Confirmed → In Progress
Revision history for this message
Paul Collins (pjdc) wrote :

To deal with the Juju 2 situation, I've proposed the linked second branch. Juju2Status.check_and_wait doesn't use wait(), and there's already a "timeout" parameter that controls how long it persists with its main loop, so now when max_wait is specified for Juju 2, instead of the original NotImplementedError we just log a warning with a hint about "timeout".

Revision history for this message
James Hebden (ec0) wrote :

Just confirming that the max-wait issue on juju2 is impacting me, and the proposed patch allows mojo to deploy units to a juju2 environment for me.

Juju 2.0.2-yakkety-amd64
Mojo version - locally built 0.4.1 from max-wait-for-juju-2 branch
LXD 2.4.1
Ubuntu yakkety

Paul Collins (pjdc)
Changed in mojo:
status: In Progress → Fix Committed
Paul Collins (pjdc)
Changed in mojo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.