Canonical Juju

The Client.WatchAll API command never responds when the model has no machines

Bug #1560201 reported by Francis Ginther on 2016-03-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Eric Snow	Canonical Juju 2.0-beta4

Bug Description

I'm using this API call:
https://godoc.org/github.com/juju/juju/apiserver/client#Client.WatchAll

with version 2.0-beta2-xenial-amd64,

and deploying to an openstack cloud (canonistack lcy02).

I have 2 models:

fginther@sandman:~$ juju list-models
NAME OWNER LAST CONNECTION
lcy02-ctl admin@local 1 hour ago
jd-test* admin@local 13 minutes ago

Whenever I target jd-test, the Client.WatchAll call never returns if the model has no machines. It will wait for a response forever. However, as soon as I 'juju add-machine' in another terminal the method returns a response and all appears well again. The problem only occurs when the target model has no machines.

Targeting the controller model, 'lcy02-ctl', works just fine. This model always has at least one machine, the bootstrap instance.

I can supply debug traces of the response and requests if it would help.

Tags:

David Britton (dpb) on 2016-03-21

tags:	added: landscape
tags:	added: kanban-cross-team

🤖 Landscape Builder (landscape-builder) on 2016-03-21

tags:

removed: kanban-cross-team

Cheryl Jennings (cherylj) on 2016-03-21

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.0-beta4

Cheryl Jennings (cherylj) on 2016-03-24

tags:

added: juju-release-support

Cheryl Jennings (cherylj) on 2016-03-28

Changed in juju-core:
importance:	High → Critical

Katherine Cox-Buday (cox-katherine-e) on 2016-03-29

Changed in juju-core:
assignee:	nobody → Eric Snow (ericsnowcurrently)

Eric Snow (ericsnowcurrently) on 2016-03-29

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-29:

test-lp1560201.go Edit (985 bytes, text/plain)

Here are the steps I took to try (and fail) to reproduce:

--------------
git up master
godeps -u dependencies.tsv
go install ./...
juju bootstrap --upload-tools test-a lxd
juju create-model model-a
go run /tmp/test-lp1560201.go
--------------

(test-lp1560201.go is attached)

The critical part of the attached code is the following section:

--------------
func (c *testCommand) Run(ctx *cmd.Context) error {
        client, err := c.NewAPIClient()
        if err != nil {
                return err
        }
        defer client.Close()

        watcher, err := client.WatchAll()
        if err != nil {
                return err
        }
        if watcher == nil {
                panic("???")
        }
        fmt.Fprintln(ctx.Stdout, "did not block")
        return nil
}
--------------

It did not block.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-29:

@fginther, what would be the best way to reproduce this? The obvious-to-me approach did not block at all.

Note that I tested against master (effectively beta 3), so it may be that the problem has already been resolved. I'll double-check against the beta 2 code while waiting for your reply.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-29:

Using the code at the beta 2 tag, I was not able to reproduce the described failure using the attached script.

Katherine Cox-Buday (cox-katherine-e) on 2016-03-29

Changed in juju-core:
status:	In Progress → Incomplete

Cheryl Jennings (cherylj) on 2016-03-29

tags:

added: 2.0-count

Revision history for this message

Francis Ginther (fginther) wrote on 2016-03-30:

Eric, I'm trying to reproduce with beta3, but have been slowed by lp:1563576 and slowness with lcy01. If I get it to reproduce, I'll provide an API trace.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-30:

Thanks!

Revision history for this message

Francis Ginther (fginther) wrote on 2016-03-31:

Eric, I was able to reproduce this with beta3 on an lxd deployment. In my original report, I failed to notice that the hang occurs when the AllWatcherId is consumed by an AllWatcher API call. I'll attach a full trace.

I have pushed some code and a short test script to demonstrate this to lp:~fginther/python-jujuclient/lp-1560201-test.

To reproduce:
$ bzr branch lp:~fginther/python-jujuclient/lp-1560201-test
$ cd lp-1560201-test
$ virtualenv .venv
$ . .venv/bin/activate
$ python setup.py develop
$ ./test-case.py local.lxd-ctl:default

This assumes that you have bootstrapped an lxd controller named 'lxd-ctl' with an empty model named 'default'. The test script will dump out the juju API calls as they are being made. If the targeted model has no machines, it should hang. Adding a machine should break the hang.

Revision history for this message

Francis Ginther (fginther) wrote on 2016-03-31:

juju-api-trace.log Edit (14.7 KiB, text/plain)

Trace of juju api calls leading to the hang.

David Britton (dpb) on 2016-03-31

Changed in juju-core:
status:	Incomplete → New

Francis Ginther (fginther) on 2016-03-31

tags:

added: kanban-cross-team

Cheryl Jennings (cherylj) on 2016-03-31

Changed in juju-core:
status:	New → Triaged

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-31:

Thanks for the great instructions, Francis. I was able to reproduce the behavior you described. I also verified that after removing the machine, the API call goes back to hanging.

I also noticed that the API call finishes immediately after the the "juju add-machine" command returns. It does not wait for the machine to be provisioned. This implies to me that the watcher proceeds once the machine entry is added to the DB. I'm going to investigate further.

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-31:

So...I'd expect a Next() call on the watcher to block until the first change happens (e.g. add a machine). After the watcher gets the first change I'd expect it to block until the next change, and so forth. However, that second part does not match up with what I'm seeing. I'm looking into that.

As to what changed in 2.0, under 1.X the old environment started off with an added machine (from bootstrap). Under 2.0, that remains the case for the "admin" model under the controller. For the "default" model nothing has changed yet because the model is still empty at first. That's why the the initial watcher Next() call blocks.

What's the use case for watching the model? To know when the controller has finished bootstrapping?

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-03-31:

#10

Ah, every watcher is automatically started off with a single "change", basically a snapshot of the model at that moment. Subsequent Next() calls for the same watcher will block until the next change.

So, it may be that jujuclient.py is making some assumptions (e.g. in get_stat()) that no longer hold true in the multi-model-controller world of Juju 2.0. Did you run into the problem originally while using jujuclient.py or via some other code?

Revision history for this message

Francis Ginther (fginther) wrote on 2016-04-01:

#11

I ran into this from jujuclient.py. get_stat() is utilized as a general purpose method to get the status of everything in the model, which is why it can be called before anything has been deployed. Juju-deployer uses this to determine what's been deployed before doing anything else. The code indicates that the watcher API is used instead of FullStatus due to an older bug (lp:1203105) related to that interface. While that bug was fixed, I suspect no-one wanted to rework the code to use the status API.

I'm a little confused by the comment 'Ah, every watcher is automatically started off with a single "change"'. Does this hold for the default model that has no machines? Should it return some state?

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-01:

#12

Ah, then it sounds like jujuclient.py may need to accommodate the fact that new models start out empty in 2.0, meaning watchers do not get an initial "change" event. The alternative is for Juju to *always* send an initial change event to watchers, even if it is empty. I'll look into this.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-01:

#13

We're going to fix this in Juju (core).

Revision history for this message

Francis Ginther (fginther) wrote on 2016-04-01:

#14

Thanks for the update.

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2016-04-02:

#15

pending fix: http://reviews.vapour.ws/r/4406/

I've verified that the provided repro script works (does not hang) with the fix applied.

Eric Snow (ericsnowcurrently) on 2016-04-05

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-04-14

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta4 → none
milestone:	none → 2.0-beta4

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.