juju-core

Merge lp:~themue/juju-core/006-state-retry-delay into lp:~juju/juju-core/trunk

006-state-retry-delay
Merge into trunk

Proposed by Frank Mueller on 2012-12-13

Status:

Rejected

Rejected by:

William Reade on 2013-01-23

Proposed branch:

lp:~themue/juju-core/006-state-retry-delay

Merge into:

lp:~juju/juju-core/trunk

Diff against target:

194 lines (+115/-25)

3 files modified

juju/conn.go (+3/-2)
trivial/attempt.go (+22/-4)
trivial/trivial_test.go (+90/-19)

To merge this branch:

bzr merge lp:~themue/juju-core/006-state-retry-delay

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
The Go Language Gophers		2012-12-13	Pending
Review via email: mp+139745@code.launchpad.net

Description of the change

state: add retry delay during mongo connection

If the connection to the mongo state database fails,
e.g. while the system is still bootstrapping, the
system immediately retries to connect until timeout.
This change adds an increasing delay between those
retries.

https://codereview.appspot.com/6949044/

Revision history for this message

William Reade (fwereade) wrote on 2012-12-17:

LGTM

https://codereview.appspot.com/6949044/

Revision history for this message

John A Meinel (jameinel) wrote on 2012-12-17:

Your overview says it adds "increasing delay", but AttemptStrategy seems
to do a fixed delay. I guess you just changed your mind, but delaying
does seem like a good idea.

LGTM.

https://codereview.appspot.com/6949044/diff/3002/state/open.go
File state/open.go (right):

https://codereview.appspot.com/6949044/diff/3002/state/open.go#newcode70
state/open.go:70: attempt.Next()
don't you need to check if attempt.Next is returning false?

https://codereview.appspot.com/6949044/

Revision history for this message

William Reade (fwereade) wrote on 2013-01-14:

WIPping this due to further discussion with davecheney in https://codereview.appspot.com/6949044/.

Revision history for this message

William Reade (fwereade) wrote on 2013-01-17:

LGTM.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go
File trivial/attempt.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56
trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay))
What motivated the choice of 1.2?

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go
File trivial/trivial_test.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#newcode20
trivial/trivial_test.go:20: delta := 10 * time.Millisecond
I'm a bit concerned that it will be bard to test these entirely reliably
-- but so long as you've run these tests in a few different situations
I'm happy.

https://codereview.appspot.com/6949044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-01-17:

NOT LGTM.

i don't believe this will fix the problem, as discussed online and
outlined below.

https://codereview.appspot.com/6949044/diff/12001/juju/conn.go
File juju/conn.go (right):

https://codereview.appspot.com/6949044/diff/12001/juju/conn.go#newcode29
juju/conn.go:29: Behaviour: trivial.ExponentialInterval,
there's a reason this delay is short - this strategy is only here to
redial when we get an unauthorized access error, which is only likely to
happen while the bootstrap process is taking place, which usually only
takes about half a seco nd (the 60 second Total here is to cater for
excessive VM scheduler variance).

the redial loop in NewConn will not fix the issue of rapid dial retries.
i believe that's due to the mgo package which has a redial loop with no
sleep, and that the correct fix is there.

https://codereview.appspot.com/6949044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-01-17:

a few comments on the attempt changes.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go
File trivial/attempt.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode12
trivial/attempt.go:12: LinearInterval
when would it be appropriate to use LinearInterval?

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode13
trivial/attempt.go:13: ExponentialInterval
if we are going to have exponential backoff, i think we should do it
right, avoid thundering herds, and add a random element so that multiple
clients that are all running the same code don't create huge spikes as
they all redial at the same time.

that said, i'm not sure we need it yet, so i'd be tempted to drop this
code for now. i've made some comments anyway, in case the code does
land.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode21
trivial/attempt.go:21: Behaviour IntervalBehaviour // Control the
delays.
we use US spelling as a convention, so Behavior would be more correct.

also, i think we should have a maximum value, so the exponential backoff
is truncated. for example, if a juju ec2 instance takes about 140
seconds before it accepts connections, using an initial delay of 1s and
an exponent of 1.2, we're waiting more than 25 seconds between attempts
by the time the instance comes on line. i think that's probably too much
and we'd want to bound it.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56
trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay))
we don't need to convert to and from float64 here.

a.delay = a.delay * 6 / 5

would be fine in this case.

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go
File trivial/trivial_test.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#newcode25
trivial/trivial_test.go:25: want := []time.Duration{0, 200 *
time.Millisecond, 400 * time.Millisecond,
this are verbose enough now that i'd be tempted to write a function:

var quantum = time.Millisecond

func durations(xs ...int) []time.Duration {
    ds := make(time.Duration, len(ms))
    for i, m := range ms {
       ds = time.Duration(m) * quantum
    }
    return ds
}

then we can easily vary the quantum if we find that the test
fails on some machines.

also, the original test ran for only 0.25s, which worked fine,
and it also checked that the final Next took no time, which
is no longer checked.

there's no way that testing this "trivial" function should add four
seconds to our testing time.

it might be better to consider a way of testing it without it actually
sleeping (for example by making the tests define the "now" and "sleep"
functions for the duration of the test)

https://codereview.appspot.com/6949044/

a few comments on the attempt changes.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go
File trivial/attempt.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode12
trivial/attempt.go:12: LinearInterval
when would it be appropriate to use LinearInterval?

that said, i'm not sure we need it yet, so i'd be tempted to drop this
code for now. i've made some comments anyway, in case the code does
land.

a.delay = a.delay * 6 / 5

would be fine in this case.

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go
File trivial/trivial_test.go (right):

var quantum = time.Millisecond

func durations(xs ...int) []time.Duration {
    ds := make(time.Duration, len(ms))
    for i, m := range ms {
       ds = time.Duration(m) * quantum
    }
    return ds
}

then we can easily vary the quantum if we find that the test
fails on some machines.

also, the original test ran for only 0.25s, which worked fine,
and it also checked that the final Next took no time, which
is no longer checked.

there's no way that testing this "trivial" function should add four
seconds to our testing time.

it might be better to consider a way of testing it without it actually
sleeping (for example by making the tests define the "now" and "sleep"
functions for the duration of the test)

https://codereview.appspot.com/6949044/

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2013-01-18:

Copy from the respective email thread:

On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <email address hidden>
wrote:
>> Alternatively, lets just change the retry delay in juju.Conn to 1
second an call it a day.

> As I've tried to point out above, the retry delay in juju.Conn is
irrelevant.
> In all the live tests I've seen, I've never seen that delay being
> exercised - it's
> a highly unusual corner case.

> The right fix is in mgo.

mgo doesn't retry every 250ms.. the original reason for the bug was
purely to slow down the crazy punching and logging to more reasonable
levels.

That said, I agree with David. The level of importance and detail
being given to that problem is over the top. There are conversations
about this for more than *two months*, and this was supposed to be a
trivial bug.

Unless someone wants to propose that trivial branch, there are much
more important things to be doing than fine tuning how fast we connect
(!!!).

gustavo @ http://niemeyer.net

https://codereview.appspot.com/6949044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-01-18:

On 18 January 2013 12:02, <email address hidden> wrote:
> Copy from the respective email thread:
>
> On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <email address hidden>
> wrote:
>>>
>>> Alternatively, lets just change the retry delay in juju.Conn to 1
>
> second an call it a day.
>
>> As I've tried to point out above, the retry delay in juju.Conn is
>
> irrelevant.
>>
>> In all the live tests I've seen, I've never seen that delay being
>> exercised - it's
>> a highly unusual corner case.
>
>
>> The right fix is in mgo.
>
>
> mgo doesn't retry every 250ms..

True. mgo seems to try about three times every 500ms,
averaging once every 185ms.

http://play.golang.org/p/f4XEMFXKDz

> Unless someone wants to propose that trivial branch, there are much
> more important things to be doing than fine tuning how fast we connect
> (!!!).

I agree with that.

Revision history for this message

William Reade (fwereade) wrote on 2013-01-23:

Rejecting this on the basis of the above to clean up the review queue. (Sorry Frank.)

Unmerged revisions

779. By Frank Mueller on 2013-01-17

trivial: merged trunk before propose

778. By Frank Mueller on 2013-01-17

trivial: merged trunk

777. By Frank Mueller on 2013-01-17

trivial: added attempt interval behaviours

776. By Frank Mueller on 2012-12-14

state: changed open to use the attempt strategy

775. By Frank Mueller on 2012-12-13

state: exponential retry delay during mongo connection

774. By Gustavo Niemeyer on 2012-12-12

environs/ec2: use default-series on Bootstrap

R=rog, fwereade
CC=
https://codereview.appspot.com/6868070

773. By Roger Peppe on 2012-12-07

state/api: use TLS

R=niemeyer
CC=
https://codereview.appspot.com/6913043

772. By Gustavo Niemeyer on 2012-12-07

version: take series out of FORCE-VERSION

R=rog, fwereade
CC=
https://codereview.appspot.com/6907050

771. By William Reade on 2012-12-07

openstack: fix build

R=niemeyer
CC=
https://codereview.appspot.com/6907051

770. By William Reade on 2012-12-07

state: Dying unit cannot enter relation scope

R=jameinel, TheMue, rog
CC=
https://codereview.appspot.com/6864050

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Andrew W. Deane

Carlos Torres

Frank Mueller

John A Meinel

Kapil Thangavelu

The Go Language Gophers

 === modified file 'juju/conn.go'
 --- juju/conn.go	2013-01-14 16:03:29 +0000
 +++ juju/conn.go	2013-01-17 11:24:21 +0000
@@ -24,8 +24,9 @@
+ }
  var redialStrategy = trivial.AttemptStrategy{
--	Total: 60 * time.Second,
--	Delay: 250 * time.Millisecond,
++	Total:     60 * time.Second,
++	Delay:     1 * time.Second,
++	Behaviour: trivial.ExponentialInterval,
+ }
  // NewConn returns a new Conn that uses the
 === modified file 'trivial/attempt.go'
 --- trivial/attempt.go	2012-11-01 14:08:11 +0000
 +++ trivial/attempt.go	2013-01-17 11:24:21 +0000
@@ -4,15 +4,26 @@
  	"time"
+ )
++// IntervalBehaviour controls how the delay behaves.
++type IntervalBehaviour int
++
++const (
++	StaticInterval IntervalBehaviour = iota
++	LinearInterval
++	ExponentialInterval
++)
++
  // AttemptStrategy represents a strategy for waiting for an action
  // to complete successfully.
  type AttemptStrategy struct {
--	Total time.Duration // total duration of attempt.
--	Delay time.Duration // interval between each try in the burst.
++	Total     time.Duration     // Total duration of attempt.
++	Delay     time.Duration     // Initial interval between each try in the burst.
++	Behaviour IntervalBehaviour // Control the delays.
+ }
  type Attempt struct {
  	strategy AttemptStrategy
++	delay    time.Duration
  	end      time.Time
+ }
@@ -20,6 +31,7 @@
  func (a AttemptStrategy) Start() *Attempt {
  	return &Attempt{
  		strategy: a,
++		delay:    a.Delay,
+ 	}
+ }
@@ -33,9 +45,15 @@
  		return true
+ 	}
--	if !now.Add(a.strategy.Delay).Before(a.end) {
++	if !now.Add(a.delay).Before(a.end) {
  		return false
+ 	}
--	time.Sleep(a.strategy.Delay)
++	time.Sleep(a.delay)
++	switch a.strategy.Behaviour {
++	case LinearInterval:
++		a.delay += a.strategy.Delay
++	case ExponentialInterval:
++		a.delay = time.Duration(1.2 * float64(a.delay))
++	}
  	return true
+ }
 === modified file 'trivial/trivial_test.go'
 --- trivial/trivial_test.go	2013-01-16 14:24:54 +0000
 +++ trivial/trivial_test.go	2013-01-17 11:24:21 +0000
@@ -16,25 +16,96 @@
  var _ = Suite(trivialSuite{})
--func (trivialSuite) TestAttemptTiming(c *C) {
--	const delta = 0.01e9
--	testAttempt := trivial.AttemptStrategy{
--		Total: 0.25e9,
--		Delay: 0.1e9,
--	}
--	want := []time.Duration{0, 0.1e9, 0.2e9, 0.2e9}
--	got := make([]time.Duration, 0, len(want)) // avoid allocation when testing timing
--	t0 := time.Now()
--	for a := testAttempt.Start(); a.Next(); {
--		got = append(got, time.Now().Sub(t0))
--	}
--	got = append(got, time.Now().Sub(t0))
--	c.Assert(got, HasLen, len(want))
--	for i, got := range want {
--		lo := want[i] - delta
--		hi := want[i] + delta
--		if got < lo || got > hi {
--			c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), got.Seconds())
++func (trivialSuite) TestDefaultInterval(c *C) {
++	delta := 10 * time.Millisecond
++	testAttempt := trivial.AttemptStrategy{
++		Total: 1 * time.Second,
++		Delay: 200 * time.Millisecond,
++	}
++	want := []time.Duration{0, 200 * time.Millisecond, 400 * time.Millisecond,
++		600 * time.Millisecond, 800 * time.Millisecond}
++	got := make([]time.Duration, 0, len(want))
++	t0 := time.Now()
++	for a := testAttempt.Start(); a.Next(); {
++		got = append(got, time.Now().Sub(t0))
++	}
++	c.Assert(got, HasLen, len(want))
++	for i, g := range got {
++		lo := want[i] - delta
++		hi := want[i] + delta
++		if g < lo || g > hi {
++			c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
++		}
++	}
++}
++
++func (trivialSuite) TestStaticInterval(c *C) {
++	delta := 10 * time.Millisecond
++	testAttempt := trivial.AttemptStrategy{
++		Total:     1 * time.Second,
++		Delay:     200 * time.Millisecond,
++		Behaviour: trivial.StaticInterval,
++	}
++	want := []time.Duration{0, 200 * time.Millisecond, 400 * time.Millisecond,
++		600 * time.Millisecond, 800 * time.Millisecond}
++	got := make([]time.Duration, 0, len(want))
++	t0 := time.Now()
++	for a := testAttempt.Start(); a.Next(); {
++		got = append(got, time.Now().Sub(t0))
++	}
++	c.Assert(got, HasLen, len(want))
++	for i, g := range got {
++		lo := want[i] - delta
++		hi := want[i] + delta
++		if g < lo || g > hi {
++			c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
++		}
++	}
++}
++
++func (trivialSuite) TestLinearInterval(c *C) {
++	delta := 5 * time.Millisecond
++	testAttempt := trivial.AttemptStrategy{
++		Total:     1 * time.Second,
++		Delay:     100 * time.Millisecond,
++		Behaviour: trivial.LinearInterval,
++	}
++	want := []time.Duration{0, 100 * time.Millisecond, 300 * time.Millisecond, 600 * time.Millisecond}
++	got := make([]time.Duration, 0, len(want))
++	t0 := time.Now()
++	for a := testAttempt.Start(); a.Next(); {
++		got = append(got, time.Now().Sub(t0))
++	}
++	c.Assert(got, HasLen, len(want))
++	for i, g := range got {
++		lo := want[i] - delta
++		hi := want[i] + delta
++		if g < lo || g > hi {
++			c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
++		}
++	}
++}
++
++func (trivialSuite) TestExponentialInterval(c *C) {
++	delta := 10 * time.Millisecond
++	testAttempt := trivial.AttemptStrategy{
++		Total:     1 * time.Second,
++		Delay:     100 * time.Millisecond,
++		Behaviour: trivial.ExponentialInterval,
++	}
++	want := []time.Duration{0, 100 * time.Millisecond, 220 * time.Millisecond, 364 * time.Millisecond,
++		537 * time.Millisecond, 744 * time.Millisecond, 993 * time.Millisecond}
++	got := make([]time.Duration, 0, len(want))
++	t0 := time.Now()
++	for a := testAttempt.Start(); a.Next(); {
++		got = append(got, time.Now().Sub(t0))
++	}
++	c.Assert(got, HasLen, len(want))
++	for i, g := range got {
++		lo := want[i] - delta
++		hi := want[i] + delta
++		if g < lo || g > hi {
++			c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
+ 		}
+ 	}
+ }

juju-core

Merge lp:~themue/juju-core/006-state-retry-delay into lp:~juju/juju-core/trunk

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers