Merge lp:~themue/juju-core/006-state-retry-delay into lp:~juju/juju-core/trunk

Proposed by Frank Mueller
Status: Rejected
Rejected by: William Reade
Proposed branch: lp:~themue/juju-core/006-state-retry-delay
Merge into: lp:~juju/juju-core/trunk
Diff against target: 194 lines (+115/-25)
3 files modified
juju/conn.go (+3/-2)
trivial/attempt.go (+22/-4)
trivial/trivial_test.go (+90/-19)
To merge this branch: bzr merge lp:~themue/juju-core/006-state-retry-delay
Reviewer Review Type Date Requested Status
The Go Language Gophers Pending
Review via email: mp+139745@code.launchpad.net

Description of the change

state: add retry delay during mongo connection

If the connection to the mongo state database fails,
e.g. while the system is still bootstrapping, the
system immediately retries to connect until timeout.
This change adds an increasing delay between those
retries.

https://codereview.appspot.com/6949044/

To post a comment you must log in.
Revision history for this message
William Reade (fwereade) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

Your overview says it adds "increasing delay", but AttemptStrategy seems
to do a fixed delay. I guess you just changed your mind, but delaying
does seem like a good idea.

LGTM.

https://codereview.appspot.com/6949044/diff/3002/state/open.go
File state/open.go (right):

https://codereview.appspot.com/6949044/diff/3002/state/open.go#newcode70
state/open.go:70: attempt.Next()
don't you need to check if attempt.Next is returning false?

https://codereview.appspot.com/6949044/

Revision history for this message
William Reade (fwereade) wrote :

WIPping this due to further discussion with davecheney in https://codereview.appspot.com/6949044/.

Revision history for this message
William Reade (fwereade) wrote :

LGTM.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go
File trivial/attempt.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56
trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay))
What motivated the choice of 1.2?

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go
File trivial/trivial_test.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#newcode20
trivial/trivial_test.go:20: delta := 10 * time.Millisecond
I'm a bit concerned that it will be bard to test these entirely reliably
-- but so long as you've run these tests in a few different situations
I'm happy.

https://codereview.appspot.com/6949044/

Revision history for this message
Roger Peppe (rogpeppe) wrote :

NOT LGTM.

i don't believe this will fix the problem, as discussed online and
outlined below.

https://codereview.appspot.com/6949044/diff/12001/juju/conn.go
File juju/conn.go (right):

https://codereview.appspot.com/6949044/diff/12001/juju/conn.go#newcode29
juju/conn.go:29: Behaviour: trivial.ExponentialInterval,
there's a reason this delay is short - this strategy is only here to
redial when we get an unauthorized access error, which is only likely to
happen while the bootstrap process is taking place, which usually only
takes about half a seco nd (the 60 second Total here is to cater for
excessive VM scheduler variance).

the redial loop in NewConn will not fix the issue of rapid dial retries.
i believe that's due to the mgo package which has a redial loop with no
sleep, and that the correct fix is there.

https://codereview.appspot.com/6949044/

Revision history for this message
Roger Peppe (rogpeppe) wrote :

a few comments on the attempt changes.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go
File trivial/attempt.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode12
trivial/attempt.go:12: LinearInterval
when would it be appropriate to use LinearInterval?

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode13
trivial/attempt.go:13: ExponentialInterval
if we are going to have exponential backoff, i think we should do it
right, avoid thundering herds, and add a random element so that multiple
clients that are all running the same code don't create huge spikes as
they all redial at the same time.

that said, i'm not sure we need it yet, so i'd be tempted to drop this
code for now. i've made some comments anyway, in case the code does
land.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode21
trivial/attempt.go:21: Behaviour IntervalBehaviour // Control the
delays.
we use US spelling as a convention, so Behavior would be more correct.

also, i think we should have a maximum value, so the exponential backoff
is truncated. for example, if a juju ec2 instance takes about 140
seconds before it accepts connections, using an initial delay of 1s and
an exponent of 1.2, we're waiting more than 25 seconds between attempts
by the time the instance comes on line. i think that's probably too much
and we'd want to bound it.

https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56
trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay))
we don't need to convert to and from float64 here.

a.delay = a.delay * 6 / 5

would be fine in this case.

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go
File trivial/trivial_test.go (right):

https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#newcode25
trivial/trivial_test.go:25: want := []time.Duration{0, 200 *
time.Millisecond, 400 * time.Millisecond,
this are verbose enough now that i'd be tempted to write a function:

var quantum = time.Millisecond

func durations(xs ...int) []time.Duration {
    ds := make(time.Duration, len(ms))
    for i, m := range ms {
       ds = time.Duration(m) * quantum
    }
    return ds
}

then we can easily vary the quantum if we find that the test
fails on some machines.

also, the original test ran for only 0.25s, which worked fine,
and it also checked that the final Next took no time, which
is no longer checked.

there's no way that testing this "trivial" function should add four
seconds to our testing time.

it might be better to consider a way of testing it without it actually
sleeping (for example by making the tests define the "now" and "sleep"
functions for the duration of the test)

https://codereview.appspot.com/6949044/

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Copy from the respective email thread:

On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <email address hidden>
wrote:
>> Alternatively, lets just change the retry delay in juju.Conn to 1
second an call it a day.

> As I've tried to point out above, the retry delay in juju.Conn is
irrelevant.
> In all the live tests I've seen, I've never seen that delay being
> exercised - it's
> a highly unusual corner case.

> The right fix is in mgo.

mgo doesn't retry every 250ms.. the original reason for the bug was
purely to slow down the crazy punching and logging to more reasonable
levels.

That said, I agree with David. The level of importance and detail
being given to that problem is over the top. There are conversations
about this for more than *two months*, and this was supposed to be a
trivial bug.

Unless someone wants to propose that trivial branch, there are much
more important things to be doing than fine tuning how fast we connect
(!!!).

gustavo @ http://niemeyer.net

https://codereview.appspot.com/6949044/

Revision history for this message
Roger Peppe (rogpeppe) wrote :

On 18 January 2013 12:02, <email address hidden> wrote:
> Copy from the respective email thread:
>
> On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <email address hidden>
> wrote:
>>>
>>> Alternatively, lets just change the retry delay in juju.Conn to 1
>
> second an call it a day.
>
>> As I've tried to point out above, the retry delay in juju.Conn is
>
> irrelevant.
>>
>> In all the live tests I've seen, I've never seen that delay being
>> exercised - it's
>> a highly unusual corner case.
>
>
>> The right fix is in mgo.
>
>
> mgo doesn't retry every 250ms..

True. mgo seems to try about three times every 500ms,
averaging once every 185ms.

http://play.golang.org/p/f4XEMFXKDz

> Unless someone wants to propose that trivial branch, there are much
> more important things to be doing than fine tuning how fast we connect
> (!!!).

I agree with that.

Revision history for this message
William Reade (fwereade) wrote :

Rejecting this on the basis of the above to clean up the review queue. (Sorry Frank.)

Unmerged revisions

779. By Frank Mueller

trivial: merged trunk before propose

778. By Frank Mueller

trivial: merged trunk

777. By Frank Mueller

trivial: added attempt interval behaviours

776. By Frank Mueller

state: changed open to use the attempt strategy

775. By Frank Mueller

state: exponential retry delay during mongo connection

774. By Gustavo Niemeyer

environs/ec2: use default-series on Bootstrap

R=rog, fwereade
CC=
https://codereview.appspot.com/6868070

773. By Roger Peppe

state/api: use TLS

R=niemeyer
CC=
https://codereview.appspot.com/6913043

772. By Gustavo Niemeyer

version: take series out of FORCE-VERSION

R=rog, fwereade
CC=
https://codereview.appspot.com/6907050

771. By William Reade

openstack: fix build

R=niemeyer
CC=
https://codereview.appspot.com/6907051

770. By William Reade

state: Dying unit cannot enter relation scope

R=jameinel, TheMue, rog
CC=
https://codereview.appspot.com/6864050

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'juju/conn.go'
2--- juju/conn.go 2013-01-14 16:03:29 +0000
3+++ juju/conn.go 2013-01-17 11:24:21 +0000
4@@ -24,8 +24,9 @@
5 }
6
7 var redialStrategy = trivial.AttemptStrategy{
8- Total: 60 * time.Second,
9- Delay: 250 * time.Millisecond,
10+ Total: 60 * time.Second,
11+ Delay: 1 * time.Second,
12+ Behaviour: trivial.ExponentialInterval,
13 }
14
15 // NewConn returns a new Conn that uses the
16
17=== modified file 'trivial/attempt.go'
18--- trivial/attempt.go 2012-11-01 14:08:11 +0000
19+++ trivial/attempt.go 2013-01-17 11:24:21 +0000
20@@ -4,15 +4,26 @@
21 "time"
22 )
23
24+// IntervalBehaviour controls how the delay behaves.
25+type IntervalBehaviour int
26+
27+const (
28+ StaticInterval IntervalBehaviour = iota
29+ LinearInterval
30+ ExponentialInterval
31+)
32+
33 // AttemptStrategy represents a strategy for waiting for an action
34 // to complete successfully.
35 type AttemptStrategy struct {
36- Total time.Duration // total duration of attempt.
37- Delay time.Duration // interval between each try in the burst.
38+ Total time.Duration // Total duration of attempt.
39+ Delay time.Duration // Initial interval between each try in the burst.
40+ Behaviour IntervalBehaviour // Control the delays.
41 }
42
43 type Attempt struct {
44 strategy AttemptStrategy
45+ delay time.Duration
46 end time.Time
47 }
48
49@@ -20,6 +31,7 @@
50 func (a AttemptStrategy) Start() *Attempt {
51 return &Attempt{
52 strategy: a,
53+ delay: a.Delay,
54 }
55 }
56
57@@ -33,9 +45,15 @@
58 return true
59 }
60
61- if !now.Add(a.strategy.Delay).Before(a.end) {
62+ if !now.Add(a.delay).Before(a.end) {
63 return false
64 }
65- time.Sleep(a.strategy.Delay)
66+ time.Sleep(a.delay)
67+ switch a.strategy.Behaviour {
68+ case LinearInterval:
69+ a.delay += a.strategy.Delay
70+ case ExponentialInterval:
71+ a.delay = time.Duration(1.2 * float64(a.delay))
72+ }
73 return true
74 }
75
76=== modified file 'trivial/trivial_test.go'
77--- trivial/trivial_test.go 2013-01-16 14:24:54 +0000
78+++ trivial/trivial_test.go 2013-01-17 11:24:21 +0000
79@@ -16,25 +16,96 @@
80
81 var _ = Suite(trivialSuite{})
82
83-func (trivialSuite) TestAttemptTiming(c *C) {
84- const delta = 0.01e9
85- testAttempt := trivial.AttemptStrategy{
86- Total: 0.25e9,
87- Delay: 0.1e9,
88- }
89- want := []time.Duration{0, 0.1e9, 0.2e9, 0.2e9}
90- got := make([]time.Duration, 0, len(want)) // avoid allocation when testing timing
91- t0 := time.Now()
92- for a := testAttempt.Start(); a.Next(); {
93- got = append(got, time.Now().Sub(t0))
94- }
95- got = append(got, time.Now().Sub(t0))
96- c.Assert(got, HasLen, len(want))
97- for i, got := range want {
98- lo := want[i] - delta
99- hi := want[i] + delta
100- if got < lo || got > hi {
101- c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), got.Seconds())
102+func (trivialSuite) TestDefaultInterval(c *C) {
103+ delta := 10 * time.Millisecond
104+ testAttempt := trivial.AttemptStrategy{
105+ Total: 1 * time.Second,
106+ Delay: 200 * time.Millisecond,
107+ }
108+ want := []time.Duration{0, 200 * time.Millisecond, 400 * time.Millisecond,
109+ 600 * time.Millisecond, 800 * time.Millisecond}
110+ got := make([]time.Duration, 0, len(want))
111+ t0 := time.Now()
112+ for a := testAttempt.Start(); a.Next(); {
113+ got = append(got, time.Now().Sub(t0))
114+ }
115+ c.Assert(got, HasLen, len(want))
116+ for i, g := range got {
117+ lo := want[i] - delta
118+ hi := want[i] + delta
119+ if g < lo || g > hi {
120+ c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
121+ }
122+ }
123+}
124+
125+func (trivialSuite) TestStaticInterval(c *C) {
126+ delta := 10 * time.Millisecond
127+ testAttempt := trivial.AttemptStrategy{
128+ Total: 1 * time.Second,
129+ Delay: 200 * time.Millisecond,
130+ Behaviour: trivial.StaticInterval,
131+ }
132+ want := []time.Duration{0, 200 * time.Millisecond, 400 * time.Millisecond,
133+ 600 * time.Millisecond, 800 * time.Millisecond}
134+ got := make([]time.Duration, 0, len(want))
135+ t0 := time.Now()
136+ for a := testAttempt.Start(); a.Next(); {
137+ got = append(got, time.Now().Sub(t0))
138+ }
139+ c.Assert(got, HasLen, len(want))
140+ for i, g := range got {
141+ lo := want[i] - delta
142+ hi := want[i] + delta
143+ if g < lo || g > hi {
144+ c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
145+ }
146+ }
147+}
148+
149+func (trivialSuite) TestLinearInterval(c *C) {
150+ delta := 5 * time.Millisecond
151+ testAttempt := trivial.AttemptStrategy{
152+ Total: 1 * time.Second,
153+ Delay: 100 * time.Millisecond,
154+ Behaviour: trivial.LinearInterval,
155+ }
156+ want := []time.Duration{0, 100 * time.Millisecond, 300 * time.Millisecond, 600 * time.Millisecond}
157+ got := make([]time.Duration, 0, len(want))
158+ t0 := time.Now()
159+ for a := testAttempt.Start(); a.Next(); {
160+ got = append(got, time.Now().Sub(t0))
161+ }
162+ c.Assert(got, HasLen, len(want))
163+ for i, g := range got {
164+ lo := want[i] - delta
165+ hi := want[i] + delta
166+ if g < lo || g > hi {
167+ c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
168+ }
169+ }
170+}
171+
172+func (trivialSuite) TestExponentialInterval(c *C) {
173+ delta := 10 * time.Millisecond
174+ testAttempt := trivial.AttemptStrategy{
175+ Total: 1 * time.Second,
176+ Delay: 100 * time.Millisecond,
177+ Behaviour: trivial.ExponentialInterval,
178+ }
179+ want := []time.Duration{0, 100 * time.Millisecond, 220 * time.Millisecond, 364 * time.Millisecond,
180+ 537 * time.Millisecond, 744 * time.Millisecond, 993 * time.Millisecond}
181+ got := make([]time.Duration, 0, len(want))
182+ t0 := time.Now()
183+ for a := testAttempt.Start(); a.Next(); {
184+ got = append(got, time.Now().Sub(t0))
185+ }
186+ c.Assert(got, HasLen, len(want))
187+ for i, g := range got {
188+ lo := want[i] - delta
189+ hi := want[i] + delta
190+ if g < lo || g > hi {
191+ c.Errorf("attempt %d want %g got %g", i, want[i].Seconds(), g.Seconds())
192 }
193 }
194 }

Subscribers

People subscribed via source and target branches