mgo sockets in a dirty state

Bug #1191487 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Invalid
Critical
John A Meinel
mgo
Fix Released
Undecided
John A Meinel

Bug Description

/home/tarmac/trees/src/launchpad.net/juju-core/testing/mgo.go:240:
    c.Fatal("Test left sockets in a dirty state")
... Error: Test left sockets in a dirty state

Occasionally when running the test suite (especially on Tarmac) I get the above failure, which resolves itself if I just run it again.
I don't know *what* is leaving things in a dirty state, but this might be a hint for why things don't always work smoothly with the packaged Mongo (we might actually not be cleaning up properly).

This was run on tarmac with the 2.2.0 tarball build of mongo.
https://code.launchpad.net/~danilo/juju-core/lbox-check-emacs/+merge/169680/comments/377328

Tags: tarmac test

Related branches

Revision history for this message
John A Meinel (jameinel) wrote :

Turns out there is a race condition in mgo.

pinger can be looping. It will sleep and when it wakes up it will see the server is not closed.
It then calls AcquireSocket to start the next ping, and AcquireSocket gets to the Connect logic, where it blocks waiting for the connection to actually connect to Mongo.

During this time, another goroutine calls server.Close(), which has it loop over all liveConnections and call Close on them, and then set the liveConnection slice to nil.

Then the Connect returns, and this new connection immediately gets added to the liveConnection list. After doing the ping, pinger will notice the server is closed and will set loop=False and return.

I believe during this time, the connection is in readLoop (where it sits endlessly blocked on fill because there isn't a conversation going on at this point).

Regardless, nothing will ever call Close on that connection, and it will just sit in liveConnections.

I believe the fix is that after connect returns, once it acquires the server.Lock, it needs to check if server.closed is true before it adds the new connection to the liveConnections.. Something like:
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-16 17:04:01 +0000
@@ -104,6 +104,12 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
+ if server.closed {
+ // While we were waiting for Connect to finish, we were asked to close.
+ // What error should be here?
+ socket.Close()
+ return
+ }
                                server.liveSockets = append(server.liveSockets, socket)
                                server.Unlock()

However, I don't know what error to set (if any).

Partly because if there is an error from AcquireSocket but 'loop' is currently set true, it will go back to sleep again (because err is checked for being nil, but nothing is done if err is not nil.)

So there should probably be another check for an err type that indicates we should stop looping (presumably we suppress errs normally so that we will just ping whenever we can, and just skip it when it fails.)

Changed in mgo:
status: New → Confirmed
Revision history for this message
John A Meinel (jameinel) wrote :

I'm running with this patch on the go-bot
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-17 06:38:48 +0000
@@ -104,7 +104,16 @@
                        socket, err = server.Connect()
                        if err == nil {
                                server.Lock()
- server.liveSockets = append(server.liveSockets, socket)
+ if server.closed {
+ // We don't add this to liveSockets
+ // because someone has already called
+ // server.Close() which closed and
+ // removed all active connections
+ log("server closed after Connect returned.")
+ socket.Close()
+ } else {
+ server.liveSockets = append(server.liveSockets, socket)
+ }
                                server.Unlock()
                        }
                }

Local testing with lots of sleeps and traps to make sure it triggers have shown it to get the test suite to pass reliably. (I can add logic that notes when this logic is triggered, and can see that the test would have failed if I didn't have it.)

I was concerned about the pinger code getting a closed socket, but
a) If you return an error from AcquireSocket, the pinger code just goes back to sleep if loop is true, so it will take 10s for it to wake up again to notice that the server is actually closed.
b) The code that acts on the socket just makes a 'no-op' request and ignores whether that request succeeds or not (it suppresses the result and the error), and just cares how long it would have taken.

The other callers of AcquireSocket could probably do something with an error here, but the very next call on the socket will fail and they handle that error. (They have to, given the socket could actually fail at any point).

So a different patch that called socket.Close() and then set an error, and then updated the pinger code to look for a ErrServerClosed or something like that would also be reasonable, but in the end is a more invasive change.

Revision history for this message
John A Meinel (jameinel) wrote :

I should also note what it took to trigger this failure.

cd cmd/jujud
go test -gocheck.v -gocheck.f ManageEnviron

A key requirement is that ManageEnviron has to take longer than 10s (the default pingDelay), because the logic failure is in the loop=True case (which always starts with a sleep before it actually triggers the code).

On my hardware, ManageEnviron often takes 5-8s and that will never trigger a failure.

Also, you have to have a Connect take long enough that the server.Close call triggers before the Connect has time to return.

I was able to make the test fail fairly reliably (>50% of the time) with this patch:
=== modified file 'server.go'
--- server.go 2013-05-31 21:44:07 +0000
+++ server.go 2013-06-17 06:48:07 +0000
@@ -213,7 +213,7 @@
        return result
 }

-var pingDelay = 10 * time.Second
+var pingDelay = 5 * time.Second // 10 * time.Second

 func (server *mongoServer) pinger(loop bool) {
        op := queryOp{

=== modified file 'socket.go'
--- socket.go 2013-04-11 05:27:36 +0000
+++ socket.go 2013-06-17 06:48:41 +0000
@@ -31,6 +31,7 @@
        "labix.org/v2/mgo/bson"
        "net"
        "sync"
+ "time"
 )

 type replyFunc func(err error, reply *replyOp, docNum int, docData []byte)
@@ -112,6 +113,7 @@
        stats.socketsAlive(+1)
        debugf("Socket %p to %s: initialized", socket, socket.addr)
        socket.resetNonce()
+ time.Sleep(10*time.Millisecond)
        go socket.readLoop()
        return socket
 }

And then, of course, with both patches it runs smoothly again.

Revision history for this message
Dave Cheney (dave-cheney) wrote : Re: [Bug 1191487] Re: mgo sockets in a dirty state
Download full text (3.2 KiB)

Thanks for working on this John. While I don't have much of a say in this,
we've done a lot of work to work around philosophical differences with the
mgo driver. Would a more effective to solve this issue be to present the
problem to the upstream and have a better fix incorporated. It sounds like
for scaling, and other requirements, we effectively want all the automagic
reconnection logic in the driver disabled, as our agents already cope with
this via their retry logic.

On Mon, Jun 17, 2013 at 5:01 PM, John A Meinel <email address hidden>wrote:

> I should also note what it took to trigger this failure.
>
> cd cmd/jujud
> go test -gocheck.v -gocheck.f ManageEnviron
>
> A key requirement is that ManageEnviron has to take longer than 10s (the
> default pingDelay), because the logic failure is in the loop=True case
> (which always starts with a sleep before it actually triggers the code).
>
> On my hardware, ManageEnviron often takes 5-8s and that will never
> trigger a failure.
>
> Also, you have to have a Connect take long enough that the server.Close
> call triggers before the Connect has time to return.
>
> I was able to make the test fail fairly reliably (>50% of the time) with
> this patch:
> === modified file 'server.go'
> --- server.go 2013-05-31 21:44:07 +0000
> +++ server.go 2013-06-17 06:48:07 +0000
> @@ -213,7 +213,7 @@
> return result
> }
>
> -var pingDelay = 10 * time.Second
> +var pingDelay = 5 * time.Second // 10 * time.Second
>
> func (server *mongoServer) pinger(loop bool) {
> op := queryOp{
>
> === modified file 'socket.go'
> --- socket.go 2013-04-11 05:27:36 +0000
> +++ socket.go 2013-06-17 06:48:41 +0000
> @@ -31,6 +31,7 @@
> "labix.org/v2/mgo/bson"
> "net"
> "sync"
> + "time"
> )
>
> type replyFunc func(err error, reply *replyOp, docNum int, docData []byte)
> @@ -112,6 +113,7 @@
> stats.socketsAlive(+1)
> debugf("Socket %p to %s: initialized", socket, socket.addr)
> socket.resetNonce()
> + time.Sleep(10*time.Millisecond)
> go socket.readLoop()
> return socket
> }
>
> And then, of course, with both patches it runs smoothly again.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1191487
>
> Title:
> mgo sockets in a dirty state
>
> Status in juju-core:
> Triaged
> Status in mgo:
> Confirmed
>
> Bug description:
> /home/tarmac/trees/src/launchpad.net/juju-core/testing/mgo.go:240:
> c.Fatal("Test left sockets in a dirty state")
> ... Error: Test left sockets in a dirty state
>
> Occasionally when running the test suite (especially on Tarmac) I get
> the above failure, which resolves itself if I just run it again.
> I don't know *what* is leaving things in a dirty state, but this might
> be a hint for why things don't always work smoothly with the packaged Mongo
> (we might actually not be cleaning up properly).
>
> This was run on tarmac with the 2.2.0 tarball build of mongo.
>
> https://code.launchpad.net/~danilo/juju-core/lbox-check-emacs/+merge/169680/comments/377328
>
> To manage n...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2013-06-17 11:24, Dave Cheney wrote:
> Thanks for working on this John. While I don't have much of a say
> in this, we've done a lot of work to work around philosophical
> differences with the mgo driver. Would a more effective to solve
> this issue be to present the problem to the upstream and have a
> better fix incorporated. It sounds like for scaling, and other
> requirements, we effectively want all the automagic reconnection
> logic in the driver disabled, as our agents already cope with this
> via their retry logic.
>

Well there is still a race condition in the code, so it should get
fixed one way or another.

As for us disabling the pinger entirely, I think that can be a
different bug/patch.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlG+xy4ACgkQJdeBCYSNAAOvbACcCWrIRqHPVUewtBLBwhL7jYRT
7k4AoI/LekPwL28NntTUXxX6bj+UsfVU
=mT4F
-----END PGP SIGNATURE-----

John A Meinel (jameinel)
Changed in mgo:
status: Confirmed → Fix Committed
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
John A Meinel (jameinel) wrote :

It was a race condition in mgo that we were managing to trigger.

Changed in juju-core:
status: Triaged → Invalid
Revision history for this message
Curtis Hovey (sinzui) wrote :

I am seeing intermittent failure running the test suite for mgo. These error appeared about 10 days ago. does my attached log related to this bug? I am running saucy with mongodb 1:2.4.6-0ubuntu5 with juju-core r1996

/home/curtis/Work/gocode/src/launchpad.net/juju-core/testing/mgo.go:333:
    c.Fatal("Test left sockets in a dirty state")
... Error: Test left sockets in a dirty state

----------------------------------------------------------------------
PANIC: conn_test.go:509: DeployLocalSuite.TestDeployWithForceMachineRejectsTooManyUnits

Changed in mgo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.