Bug #1191487 “mgo sockets in a dirty state” : Bugs : juju-core

Revision history for this message

John A Meinel (jameinel) wrote on 2013-06-16:

#1

Turns out there is a race condition in mgo.

pinger can be looping. It will sleep and when it wakes up it will see the server is not closed.
It then calls AcquireSocket to start the next ping, and AcquireSocket gets to the Connect logic, where it blocks waiting for the connection to actually connect to Mongo.

During this time, another goroutine calls server.Close(), which has it loop over all liveConnections and call Close on them, and then set the liveConnection slice to nil.

Then the Connect returns, and this new connection immediately gets added to the liveConnection list. After doing the ping, pinger will notice the server is closed and will set loop=False and return.

I believe during this time, the connection is in readLoop (where it sits endlessly blocked on fill because there isn't a conversation going on at this point).

Regardless, nothing will ever call Close on that connection, and it will just sit in liveConnections.

I believe the fix is that after connect returns, once it acquires the server.Lock, it needs to check if server.closed is true before it adds the new connection to the liveConnections.. Something like:
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-16 17:04:01 +0000
@@ -104,6 +104,12 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
+ if server.closed {
+ // While we were waiting for Connect to finish, we were asked to close.
+ // What error should be here?
+ socket.Close()
+ return
+ }
                                server.liveSockets = append(server.liveSockets, socket)
                                server.Unlock()

However, I don't know what error to set (if any).

Partly because if there is an error from AcquireSocket but 'loop' is currently set true, it will go back to sleep again (because err is checked for being nil, but nothing is done if err is not nil.)

So there should probably be another check for an err type that indicates we should stop looping (presumably we suppress errs normally so that we will just ping whenever we can, and just skip it when it fails.)

Turns out there is a race condition in mgo.

pinger can be looping. It will sleep and when it wakes up it will see the server is not closed.
It then calls AcquireSocket to start the next ping, and AcquireSocket gets to the Connect logic, where it blocks waiting for the connection to actually connect to Mongo.

During this time, another goroutine calls server.Close(), which has it loop over all liveConnections and call Close on them, and then set the liveConnection slice to nil.

Then the Connect returns, and this new connection immediately gets added to the liveConnection list. After doing the ping, pinger will notice the server is closed and will set loop=False and return.

I believe during this time, the connection is in readLoop (where it sits endlessly blocked on fill because there isn't a conversation going on at this point).

Regardless, nothing will ever call Close on that connection, and it will just sit in liveConnections.

I believe the fix is that after connect returns, once it acquires the server.Lock, it needs to check if server.closed is true before it adds the new connection to the liveConnections.. Something like:
=== modified file 'server.go'
--- server.go   2013-05-31 21:44:07 +0000
+++ server.go   2013-06-16 17:04:01 +0000
@@ -104,6 +104,12 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
+                                if server.closed {
+                                    // While we were waiting for Connect to finish, we were asked to close.
+                                    // What error should be here?
+                                    socket.Close()
+                                    return
+                                }
                                server.liveSockets = append(server.liveSockets, socket)
                                server.Unlock()

However, I don't know what error to set (if any).

Partly because if there is an error from AcquireSocket but 'loop' is currently set true, it will go back to sleep again (because err is checked for being nil, but nothing is done if err is not nil.)

So there should probably be another check for an err type that indicates we should stop looping (presumably we suppress errs normally so that we will just ping whenever we can, and just skip it when it fails.)

Changed in mgo:
status:	New → Confirmed

Revision history for this message

John A Meinel (jameinel) wrote on 2013-06-17:

#2

I'm running with this patch on the go-bot
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-17 06:38:48 +0000
@@ -104,7 +104,16 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
- server.liveSockets = append(server.liveSockets, socket)
+ if server.closed {
+ // We don't add this to liveSockets
+ // because someone has already called
+ // server.Close() which closed and
+ // removed all active connections
+ log("server closed after Connect returned.")
+ socket.Close()
+ } else {
+ server.liveSockets = append(server.liveSockets, socket)
+ }
                                server.Unlock()
                        }
                }

Local testing with lots of sleeps and traps to make sure it triggers have shown it to get the test suite to pass reliably. (I can add logic that notes when this logic is triggered, and can see that the test would have failed if I didn't have it.)

I was concerned about the pinger code getting a closed socket, but
a) If you return an error from AcquireSocket, the pinger code just goes back to sleep if loop is true, so it will take 10s for it to wake up again to notice that the server is actually closed.
b) The code that acts on the socket just makes a 'no-op' request and ignores whether that request succeeds or not (it suppresses the result and the error), and just cares how long it would have taken.

The other callers of AcquireSocket could probably do something with an error here, but the very next call on the socket will fail and they handle that error. (They have to, given the socket could actually fail at any point).

So a different patch that called socket.Close() and then set an error, and then updated the pinger code to look for a ErrServerClosed or something like that would also be reasonable, but in the end is a more invasive change.

I'm running with this patch on the go-bot
=== modified file 'server.go'
--- server.go   2013-05-31 21:44:07 +0000
+++ server.go   2013-06-17 06:38:48 +0000
@@ -104,7 +104,16 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
-                               server.liveSockets = append(server.liveSockets, socket)
+                               if server.closed {
+                                       // We don't add this to liveSockets
+                                       // because someone has already called
+                                       // server.Close() which closed and
+                                       // removed all active connections
+                                       log("server closed after Connect returned.")
+                                       socket.Close()
+                               } else {
+                                       server.liveSockets = append(server.liveSockets, socket)
+                               }
                                server.Unlock()
                        }
                }

Local testing with lots of sleeps and traps to make sure it triggers have shown it to get the test suite to pass reliably. (I can add logic that notes when this logic is triggered, and can see that the test would have failed if I didn't have it.)

I was concerned about the pinger code getting a closed socket, but 
a) If you return an error from AcquireSocket, the pinger code just goes back to sleep if loop is true, so it will take 10s for it to wake up again to notice that the server is actually closed.
b) The code that acts on the socket just makes a 'no-op' request and ignores whether that request succeeds or not (it suppresses the result and the error), and just cares how long it would have taken.

The other callers of AcquireSocket could probably do something with an error here, but the very next call on the socket will fail and they handle that error. (They have to, given the socket could actually fail at any point).

So a  different patch that called socket.Close() and then set an error, and then updated the pinger code to look for a ErrServerClosed or something like that would also be reasonable, but in the end is a more invasive change.

Revision history for this message

John A Meinel (jameinel) wrote on 2013-06-17:

#3

I should also note what it took to trigger this failure.

cd cmd/jujud
go test -gocheck.v -gocheck.f ManageEnviron

A key requirement is that ManageEnviron has to take longer than 10s (the default pingDelay), because the logic failure is in the loop=True case (which always starts with a sleep before it actually triggers the code).

On my hardware, ManageEnviron often takes 5-8s and that will never trigger a failure.

Also, you have to have a Connect take long enough that the server.Close call triggers before the Connect has time to return.

I was able to make the test fail fairly reliably (>50% of the time) with this patch:
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-17 06:48:07 +0000
@@ -213,7 +213,7 @@
return result
}

-var pingDelay = 10 * time.Second
+var pingDelay = 5 * time.Second // 10 * time.Second

func (server *mongoServer) pinger(loop bool) {
op := queryOp{

=== modified file 'socket.go'
--- socket.go 2013-04-11 05:27:36 +0000
+++ socket.go 2013-06-17 06:48:41 +0000
@@ -31,6 +31,7 @@
        "labix.org/v2/mgo/bson"
        "net"
        "sync"
+ "time"
)

type replyFunc func(err error, reply *replyOp, docNum int, docData []byte)
@@ -112,6 +113,7 @@
        stats.socketsAlive(+1)
        debugf("Socket %p to %s: initialized", socket, socket.addr)
        socket.resetNonce()
+ time.Sleep(10*time.Millisecond)
        go socket.readLoop()
        return socket
}

And then, of course, with both patches it runs smoothly again.

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2013-06-17: Re: [Bug 1191487] Re: mgo sockets in a dirty state

#4

Download full text (3.2 KiB)

Thanks for working on this John. While I don't have much of a say in this,
we've done a lot of work to work around philosophical differences with the
mgo driver. Would a more effective to solve this issue be to present the
problem to the upstream and have a better fix incorporated. It sounds like
for scaling, and other requirements, we effectively want all the automagic
reconnection logic in the driver disabled, as our agents already cope with
this via their retry logic.

On Mon, Jun 17, 2013 at 5:01 PM, John A Meinel <email address hidden>wrote:

> I should also note what it took to trigger this failure.
>
> cd cmd/jujud
> go test -gocheck.v -gocheck.f ManageEnviron
>
> A key requirement is that ManageEnviron has to take longer than 10s (the
> default pingDelay), because the logic failure is in the loop=True case
> (which always starts with a sleep before it actually triggers the code).
>
> On my hardware, ManageEnviron often takes 5-8s and that will never
> trigger a failure.
>
> Also, you have to have a Connect take long enough that the server.Close
> call triggers before the Connect has time to return.
>
> I was able to make the test fail fairly reliably (>50% of the time) with
> this patch:
> === modified file 'server.go'
> --- server.go 2013-05-31 21:44:07 +0000
> +++ server.go 2013-06-17 06:48:07 +0000
> @@ -213,7 +213,7 @@
> return result
> }
>
> -var pingDelay = 10 * time.Second
> +var pingDelay = 5 * time.Second // 10 * time.Second
>
> func (server *mongoServer) pinger(loop bool) {
> op := queryOp{
>
> === modified file 'socket.go'
> --- socket.go 2013-04-11 05:27:36 +0000
> +++ socket.go 2013-06-17 06:48:41 +0000
> @@ -31,6 +31,7 @@
> "labix.org/v2/mgo/bson"
> "net"
> "sync"
> + "time"
> )
>
> type replyFunc func(err error, reply *replyOp, docNum int, docData []byte)
> @@ -112,6 +113,7 @@
> stats.socketsAlive(+1)
> debugf("Socket %p to %s: initialized", socket, socket.addr)
> socket.resetNonce()
> + time.Sleep(10*time.Millisecond)
> go socket.readLoop()
> return socket
> }
>
> And then, of course, with both patches it runs smoothly again.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1191487
>
> Title:
> mgo sockets in a dirty state
>
> Status in juju-core:
> Triaged
> Status in mgo:
> Confirmed
>
> Bug description:
> /home/tarmac/trees/src/launchpad.net/juju-core/testing/mgo.go:240:
> c.Fatal("Test left sockets in a dirty state")
> ... Error: Test left sockets in a dirty state
>
> Occasionally when running the test suite (especially on Tarmac) I get
> the above failure, which resolves itself if I just run it again.
> I don't know *what* is leaving things in a dirty state, but this might
> be a hint for why things don't always work smoothly with the packaged Mongo
> (we might actually not be cleaning up properly).
>
> This was run on tarmac with the 2.2.0 tarball build of mongo.
>
> https://code.launchpad.net/~danilo/juju-core/lbox-check-emacs/+merge/169680/comments/377328
>
> To manage n...

Thanks for working on this John. While I don't have much of a say in this,
we've done a lot of work to work around philosophical differences with the
mgo driver. Would a more effective to solve this issue be to present the
problem to the upstream and have a better fix incorporated. It sounds like
for scaling, and other requirements, we effectively want all the automagic
reconnection logic in the driver disabled, as our agents already cope with
this via their retry logic.

On Mon, Jun 17, 2013 at 5:01 PM, John A Meinel <john@arbash-meinel.com>wrote:

> I should also note what it took to trigger this failure.
>
> cd cmd/jujud
> go test -gocheck.v -gocheck.f ManageEnviron
>
> A key requirement is that ManageEnviron has to take longer than 10s (the
> default pingDelay), because the logic failure is in the loop=True case
> (which always starts with a sleep before it actually triggers the code).
>
> On my hardware, ManageEnviron often takes 5-8s and that will never
> trigger a failure.
>
> Also, you have to have a Connect take long enough that the server.Close
> call triggers before the Connect has time to return.
>
> I was able to make the test fail fairly reliably (>50% of the time) with
> this patch:
> === modified file 'server.go'
> --- server.go   2013-05-31 21:44:07 +0000
> +++ server.go   2013-06-17 06:48:07 +0000
> @@ -213,7 +213,7 @@
>         return result
>  }
>
> -var pingDelay = 10 * time.Second
> +var pingDelay = 5 * time.Second // 10 * time.Second
>
>  func (server *mongoServer) pinger(loop bool) {
>         op := queryOp{
>
> === modified file 'socket.go'
> --- socket.go   2013-04-11 05:27:36 +0000
> +++ socket.go   2013-06-17 06:48:41 +0000
> @@ -31,6 +31,7 @@
>         "labix.org/v2/mgo/bson"
>         "net"
>         "sync"
> +       "time"
>  )
>
>  type replyFunc func(err error, reply *replyOp, docNum int, docData []byte)
> @@ -112,6 +113,7 @@
>         stats.socketsAlive(+1)
>         debugf("Socket %p to %s: initialized", socket, socket.addr)
>         socket.resetNonce()
> +       time.Sleep(10*time.Millisecond)
>         go socket.readLoop()
>         return socket
>  }
>
> And then, of course, with both patches it runs smoothly again.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1191487
>
> Title:
>   mgo sockets in a dirty state
>
> Status in juju-core:
>   Triaged
> Status in mgo:
>   Confirmed
>
> Bug description:
>   /home/tarmac/trees/src/launchpad.net/juju-core/testing/mgo.go:240:
>       c.Fatal("Test left sockets in a dirty state")
>   ... Error: Test left sockets in a dirty state
>
>   Occasionally when running the test suite (especially on Tarmac) I get
> the above failure, which resolves itself if I just run it again.
>   I don't know *what* is leaving things in a dirty state, but this might
> be a hint for why things don't always work smoothly with the packaged Mongo
> (we might actually not be cleaning up properly).
>
>   This was run on tarmac with the 2.2.0 tarball build of mongo.
>
> https://code.launchpad.net/~danilo/juju-core/lbox-check-emacs/+merge/169680/comments/377328
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1191487/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2013-06-17:

#5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2013-06-17 11:24, Dave Cheney wrote:
> Thanks for working on this John. While I don't have much of a say
> in this, we've done a lot of work to work around philosophical
> differences with the mgo driver. Would a more effective to solve
> this issue be to present the problem to the upstream and have a
> better fix incorporated. It sounds like for scaling, and other
> requirements, we effectively want all the automagic reconnection
> logic in the driver disabled, as our agents already cope with this
> via their retry logic.
>

Well there is still a race condition in the code, so it should get
fixed one way or another.

As for us disabling the pinger entirely, I think that can be a
different bug/patch.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlG+xy4ACgkQJdeBCYSNAAOvbACcCWrIRqHPVUewtBLBwhL7jYRT
7k4AoI/LekPwL28NntTUXxX6bj+UsfVU
=mT4F
-----END PGP SIGNATURE-----

John A Meinel (jameinel) on 2013-06-20

Changed in mgo:
status:	Confirmed → Fix Committed
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2013-06-24:

#6

It was a race condition in mgo that we were managing to trigger.

Changed in juju-core:
status:	Triaged → Invalid

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-10-21:

#7

saucy test log for r1996 Edit (9.4 KiB, text/plain)

I am seeing intermittent failure running the test suite for mgo. These error appeared about 10 days ago. does my attached log related to this bug? I am running saucy with mongodb 1:2.4.6-0ubuntu5 with juju-core r1996

/home/curtis/Work/gocode/src/launchpad.net/juju-core/testing/mgo.go:333:
c.Fatal("Test left sockets in a dirty state")
... Error: Test left sockets in a dirty state

----------------------------------------------------------------------
PANIC: conn_test.go:509: DeployLocalSuite.TestDeployWithForceMachineRejectsTooManyUnits

Gustavo Niemeyer (niemeyer) on 2014-03-24

Changed in mgo:
status:	Fix Committed → Fix Released

juju-core

mgo sockets in a dirty state

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Invalid	Critical	John A Meinel
	mgo	Fix Released	Undecided	John A Meinel