creating bzrdir on launchpad is slow (BzrDirFormat.initialize_ex_1.16)

Bug #740759 reported by Martin Pool
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
John A Meinel

Bug Description

According to my trace files in <https://bugs.launchpad.net/bzr/+bug/680763>, calling BzrDirFormat.initialize_ex_1.16 on Launchpad seems to take about 3.0s, whereas running it on another machine takes less than half a second. I suspect that it's doing some extra work or checks in lp-serve.

Revision history for this message
John A Meinel (jameinel) wrote :

Attaching a callgrind of the lp-serve process. Did a bit of hacking to execute it manually on my test VM, which is currently terribly memory constrained. So don't take this as 100% accurate, but it at least gives us something to look at.

To do this, I hacked bzr so that I could get the contents of the HPSS request, and then checked the launchpad_dev db to find out my user id.
I then did:
cat /home/jameinel/request.txt | BZR_PLUGIN_PATH=bzrplugins ./bin/py eggs/bzr-2.2.2_lp_2-py2.6-linux-i686.egg/EGG-INFO/scripts/bzr lp-serve --inet 243652 --lsprof-file create-branch.callgrind

So this is a profile of *only* the initialize_on_transport_ex request.

Revision history for this message
John A Meinel (jameinel) wrote :

Some analysis. I believe nodes that are marked <module> mean the time is spent importing the module. This test was done without using the forking server, etc, so it probably has a bit of import time overhead that we will experience on production, but in theory the forking server helps with.

Of the 3.1s it spends to run lp-serve, there isn't a huge 'this is the problem' function. What I do see, are *lots* of <module> overhead.

1.268 <module> lp.codehosting.bzrutils
 .939 <module> lp.codehosting.vfs
 .938 <module> lp.codehosting.vfs.branchfs
 .832 <module> canonical.launchpad.webapp
 .831 <module> canonical.launchpad.webapp.launchpadform
 .801 <module> lp.app.browser.launchpadform
 .748 <module> canonical.launchpad.webapp.menu
 .675 <module> canonical.launchpad.webapp.authorization
 .641 <module> lp.codehosting
 .640 <module> canonical.config
 .469 <module> canoniacl.database.sqlbase
...

Once you are passed the imports, you get down into:
 .296 serve bzrlib.smart.medium:208
 .294 _getConfig canonical.config:192 (11 calls)
 .294 accept_bytes bzrlib.smart.protocol: 904

Looking at the top-down view,
1268 <module> lp.codehosting.bzrutils
 939 <module> lp.codehosting.vfs
 641 <module> lp.codehosting
 296 run_server bzrlib.plugins.lpserve:96

So we spend 1268+939+641 = 2.848s just importing the launchpad codebase to do the branch permissions, and only 296ms actually executing the request.

 .

Revision history for this message
John A Meinel (jameinel) wrote :

Attaching an lsprof output running underneath the forking server. So modules should already be imported at this point. To generate it, I did:

=== modified file 'bzrplugins/lpserve/__init__.py'
--- bzrplugins/lpserve/__init__.py 2011-03-05 23:59:05 +0000
+++ bzrplugins/lpserve/__init__.py 2011-03-23 16:48:16 +0000
@@ -508,6 +508,7 @@
         # that would probably only really be measurable if we prefork. As
         # it looks like ~200ms is 'fork()' time, but only 50ms is
         # run-the-command time.
+ command_argv.extend(('--lsprof-file', 'prof.txt'))
         retcode = commands.run_bzr_catch_errors(command_argv)
         self._close_child_file_descriptors()
         trace.mutter('%d finished %r'

Revision history for this message
John A Meinel (jameinel) wrote :

Some analysis
Overall time under lsprof on this VM was 1.4s (run_argv_aliases), parse_args took a surprising 100ms, and _setup_outf 50ms.

Going deep into the code, the bulk of the time is:

           1 0 0.7240 0.0000 bzrlib.bzrdir:2076(make_directory)
          +1 0 0.7240 0.0002 +lp.codehosting.vfs.transport:314(mkdir)

Which breaks down into:

           2 0 0.7411 0.0001 lp.codehosting.vfs.branchfs:505(mkdir)
          +2 0 0.4761 0.0002 +lp.codehosting.vfs.transport:105(_getUnderylingTransportAndPath)
          +2 0 0.2485 0.0000 +twisted.internet.defer:300(addErrback)
          +2 0 0.0164 0.0000 +twisted.internet.defer:290(addCallback)

Which breaks down further as:

           3 0 0.6480 0.0002 lp.codehosting.vfs.transport:105(_getUnderylingTransportAndPath)
          +3 0 0.0002 0.0001 +lp.codehosting.vfs.transport:99(_abspath)
          +3 0 0.6476 0.0001 +lp.codehosting.vfs.branchfs:392(translateVirtualPath)
           3 0 0.6476 0.0001 lp.codehosting.vfs.branchfs:392(translateVirtualPath)
          +3 0 0.1682 0.0007 +twisted.internet.defer:271(addCallbacks)
          +3 0 0.4791 0.0002 +lp.codehosting.vfs.branchfsclient:124(translatePath)
          +3 0 0.0003 0.0000 +twisted.internet.defer:300(addErrback)

           3 0 0.4791 0.0002 lp.codehosting.vfs.branchfsclient:124(translatePath)
          +2 0 0.4784 0.0002 +lp.services.twistedsupport.xmlrpc:47(callRemote)
          +3 0 0.0002 0.0002 +lp.codehosting.vfs.branchfsclient:83(_getFromCache)
          +2 0 0.0004 0.0000 +twisted.internet.defer:290(addCallback)
          +1 0 0.0000 0.0000 +twisted.internet.defer:53(succeed)

           3 0 0.4663 0.0001 lp.services.twistedsupport.xmlrpc:34(callRemote)
          +3 0 0.0001 0.0001 +<getattr>
          +3 0 0.4661 0.0000 +xmlrpclib:1198(__call__)

And what do you know, the xmlrpc service rears its ugly head again. Looking at it from a different aggregate:

           3 0 0.7262 0.0002 lp.services.twistedsupport.xmlrpc:47(callRemote)
          +3 0 0.7260 0.0004 +twisted.internet.defer:112(maybeDeferred)
          +1 0 0.0000 0.0000 +socket:274(__del__)

So all the time seems to be spent in the XMLRPC calls. The next steps seem to be figuring out why the XMLRPC calls are so expensive.

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 740759] Re: creating bzrdir on launchpad is slow (BzrDirFormat.initialize_ex_1.16)

Of course this is a different xmlrpc (the internal dc one) to the
other aspect of the bug.

It looks like there is some substantial delay inside lp.codehosting
itself, and also some out-of-process delay talking to the database via
xmlrpc.

I wonder if we have, or could get, some logging showing how long those
calls take in production. Or perhaps we could get a losa to execute a
similar call and measure it.

If you want to pull on this string you can but you're not obliged.

Martin

Revision history for this message
Martin Pool (mbp) wrote :

After speaking to Robert about this, it seems some of this is going to be due to just generally high load on that machine and in the data centre, and there are already RTs underway to address that. There probably is also some room for improvement in how this specific code runs, but it's probably not worth pursuing until the general thing is improved.

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, Mar 24, 2011 at 11:52 AM, Martin Pool <email address hidden> wrote:
> Of course this is a different xmlrpc (the internal dc one) to the
> other aspect of the bug.

Internally we have little separation: the internal xmlrpc is just
another port on our primary appservers. There is a different haproxy
cluster so the queueing effects can be different, but its suboptimal -
we'd like to move to a single optimisable config but thats a bit
longer term.

Revision history for this message
Martin Pool (mbp) wrote :

Well, my real point is that removing xmlrpc from the bzr client's
connection process will not help with this.

Martin

Revision history for this message
John A Meinel (jameinel) wrote :

Right. *my* point is that the xmlrpc being run on "locally" is issuing 3 total queries, but takes up more than 700ms on my VM. No good way to test that on production, I think. I suppose I could add some debug logging that reports how long the XMLRPC calls are taking as part of lp-serve.

Given that the *bzr* work to actually initialize the directory was only 200ms, spending 3x longer to decide that it could actually do them seems significant, and worth trying to optimize.

As Robert says, I'm sure the actual datacenter XMLRPC logic in production is horribly complex and might be hard to optimize. I might give it a poke and see if there is anything obvious. (SQL request on a table without the appropriate index, etc.)

Revision history for this message
John A Meinel (jameinel) wrote :

I re-ran this test on a VM that had a bit better perf characteristics, just to make sure that wasn't the specific issue.

'time echo hello | ssh' took 242ms
'time cat initialize_bzrdir_ex.txt | ssh' took 877ms

Which corresponds well with lp-serve taking 664ms in 'run_argv_aliases'

What surprises me a lot, is the slowest function is this:
          29 0 0.4081 0.2940 twisted.python.failure:416(__getstate__)
      +34492 0 0.1141 0.0235 +twisted.python.reflect:557(safe_repr)
         +11 0 0.0000 0.0000 +<method 'copy' of 'dict' objects>
        2680 0 0.1329 0.1329 <method 'recv' of '_socket.socket' objects>

We still have this:
          48 0 0.5155 0.0003 lp.codehosting.vfs.branchfs:392(translateVirtualPath)
         +48 0 0.3116 0.0011 +twisted.internet.defer:271(addCallbacks)
           6 0 0.1432 0.0001 lp.services.twistedsupport.xmlrpc:34(callRemote)
          +6 0 0.0001 0.0001 +<getattr>
          +6 0 0.1430 0.0000 +xmlrpclib:1198(__call__)

So of the 664ms to actually initialize, 515ms of it is spent in translateVirtualPath. However, of that time, 410ms is actually spent handling failure.__getstate__
35k calls to safe_repr sure seems like a lot for 29 calls to the function. It hints that exception handling in twisted is really quite expensive. Though I'm sure --lsprof exacerbates this.

It does say that the xmlrpc calls cost 143ms of the 600ms.

Without --lsprof, from outside I see 161ms to 'echo hello' and 671ms to 'initialize_on_transport_ex', so 510ms to actually do the initialization.

That says lsprof doesn't have all that much overhead. Considering it is only 510ms to 660ms with lsprof active.

Being a bit more invasive, and logging lp.services.twistedsupport.xmlrpc.BlockingProxy to log how long the xmlrpc request takes, without running --lsprof, I get:

0.029 hpss request: [140370829006592] BzrDirFormat.initialize_ex_1.16 'Bazaar-NG
[ 5073] 2011-03-24 15:19:01.447 INFO: method: translatePath took 0.025s
[ 5073] 2011-03-24 15:19:01.560 INFO: method: createBranch took 0.057s
[ 5073] 2011-03-24 15:19:01.583 INFO: method: translatePath took 0.022s
[ 5073] 2011-03-24 15:19:01.632 INFO: method: translatePath took 0.021s
[ 5073] 2011-03-24 15:19:01.681 INFO: method: translatePath took 0.018s
[ 5073] 2011-03-24 15:19:01.917 INFO: method: translatePath took 0.012s
0.543 creating repository in lp-78465616:///~jameinel/%2Bjunk/x/.bzr.

Which is 514ms to run initialize_ex. 155ms of that is spent in XMLRPC calls.

Compare that to openssh directly:

Thu 2011-03-24 15:22:46 +0100
0.032 bazaar version: 2.1.1
0.032 bzr arguments: [u'serve', u'--inet', u'--directory=/', u'--allow-writes']
...
0.113 hpss request: [139780618553088] BzrDirFormat.initialize_ex_1.16 'Bazaar-NG m
0.143 creating repository in filtered-41417872:///~jameinel/%2Bjunk/x/.bzr/.

Which says it takes 30ms to actually perform the action.

Revision history for this message
John A Meinel (jameinel) wrote :

This attachment is the '.bzr.log' from the forking child. With one small tweak to time each xmlrpc call.

I think individual hpss calls are shown because I have "-Dhpss" set as default. Regardless, it is nice to see.

You can see things like this:

0.121 hpss request: [140529189369600] BzrDirFormat.initialize_ex_1.16 'Bazaar-NG
...
0.546 creating repository in lp-82537104:///~jameinel/%2Bjunk/x/.bzr.
...
0.561 end: [140529189369600] 0.440s

Which show that it is, indeed, taking 440ms to initialize_on_transport_ex. The 6 rpc calls take about 132ms of that time. Real world behavior will obviously depend a lot on load, etc. But that leaves 308ms of time spent inside the lp-serve process itself.

Next I'll upload the same .bzr.log file for a pure 'bzr' host.

Revision history for this message
John A Meinel (jameinel) wrote :

"bzr push bzr+ssh://host/~jameinel/+junk/x" directly to the host (using openssh and system bzr) (I didn't realize that ~user really does work for regular servers.)

The key bit is:

0.137 hpss request: [140462824666880] BzrDirFormat.initialize_ex_1.16 'Bazaar-NG
0.164 creating repository in filtered-51379344:///~/%2Bjunk/x/.bzr/.
0.169 end: [140462824666880] 0.032s

So it is 32ms to do that without Launchpad's VFS abstraction, and 440ms to do the same thing with it. O.o

Revision history for this message
Martin Pool (mbp) wrote :

So the current scoreboard seems to be: 3.0s excess time to do this on
lpnet, vs about 450ms to do it on lp.dev.

It seems a lot of the difference between them is due to load or
queueing on the production instance. Since that's 85% of the problem
and since there is work underway to fix it, I think we should shelve
this until that is done, and then measure again and investigate the
causes of the delay vs plain bzr.

Martin

Revision history for this message
Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
[...]
> So of the 664ms to actually initialize, 515ms of it is spent in
> translateVirtualPath. However, of that time, 410ms is actually spent
> handling failure.__getstate__
> 35k calls to safe_repr sure seems like a lot for 29 calls to the
> function. It hints that exception handling in twisted is really quite
> expensive. Though I'm sure --lsprof exacerbates this.

That is a lot of calls. I wonder what the failures actually are, and if
they can be avoided?

Twisted's Failure class is definitely uglier and more expensive than it
ought to be.

The __getstate__ calls are probably due to calls to
failure.cleanFailure(), and the most likely caller of that is Deferred;
when a Deferred finds it can't run any more callbacks/errbacks
immediately (there are no more callbacks in the chain, or the most
recent callback returned an unfired Deferred, etc), then it calls
cleanFailure basically to release memory. Given how expensive
cleanFailure can be that's a fairly unsatisfactory compromise. It would
be nice to figure out something better.

-Andrew.

Revision history for this message
John A Meinel (jameinel) wrote :

Andrew, I believe you are correct about who is calling cleanFailure (Deferred). I tried a simple hack:

    def cleanFailure(self):
        """Remove references to other objects, replacing them with strings.
        """
        return
        self.__dict__ = self.__getstate__()

With that code in place, I got:
***** WITH cleanFailure ******
0.037 hpss request: [139767171196672] BzrDirFormat.initialize_ex_1.16
0.516 end: [139767171196672] 0.482s

0.037 hpss request: [139767171196672] BzrDirFormat.initialize_ex_1.16
0.768 end: [139767171196672] 0.736s

0.036 hpss request: [139767171196672] BzrDirFormat.initialize_ex_1.16
0.523 end: [139767171196672] 0.490s

***** WITHOUT cleanFailure ******
0.035 hpss request: [140723835176704] BzrDirFormat.initialize_ex_1.16
0.424 end: [140723835176704] 0.393s

0.044 hpss request: [140723835176704] BzrDirFormat.initialize_ex_1.16
0.275 end: [140723835176704] 0.236s

0.040 hpss request: [140723835176704] BzrDirFormat.initialize_ex_1.16
0.321 end: [140723835176704] 0.285s

0.036 hpss request: [140723835176704] BzrDirFormat.initialize_ex_1.16
0.264 end: [140723835176704] 0.231s

Best times, 231ms vs 482ms. So __getstate__ is accounting for ~251ms in my testing. Which is pretty frickin huge. My immediate thought is that we should just avoid exceptions-as-return-codes when going through a Deferred.

Revision history for this message
John A Meinel (jameinel) wrote :

So one more hack. Changing twisted.python.Failure to just have:

 tb = None

before it actually does anything with the traceback. Doing that the times are:

***** TB = NONE ******
0.033 hpss request: [140029379712768] BzrDirFormat.initialize_ex_1.16
0.372 end: [140029379712768] 0.352s
0.037 hpss request: [140029379712768] BzrDirFormat.initialize_ex_1.16
0.287 end: [140029379712768] 0.254s
0.035 hpss request: [140029379712768] BzrDirFormat.initialize_ex_1.16
0.252 end: [140029379712768] 0.220s
0.035 hpss request: [140029379712768] BzrDirFormat.initialize_ex_1.16
0.215 end: [140029379712768] 0.183s

So best case time around 183ms. Digging further into that:

0.035 hpss request: [140029379712768] BzrDirFormat.initialize_ex_1.16
0.061 method: translatePath (243652, '//~jameinel/%2Bjunk/x')
  took 0.025s
0.105 method: createBranch (243652, '/~jameinel/%2Bjunk/x')
  took 0.043s
0.130 method: translatePath (243652, '//~jameinel/%2Bjunk/x')
  took 0.024s
0.167 method: translatePath (243652, '//~jameinel/%2Bjunk/.bzr/branch-format')
  took 0.023s
0.186 method: translatePath (243652, '//~jameinel/.bzr/branch-format')
  took 0.018s
0.200 method: translatePath (243652, '//.bzr/branch-format')
  took 0.014s
0.201 creating repository in lp-79166544:///~jameinel/%2Bjunk/x/.bzr.
0.215 response: [140029379712768] ('.', 'yes', 'no', 'yes'
0.215 response end: [140029379712768] -1301322142.923s
0.215 end: [140029379712768] 0.183s

So the XMLRPC calls end up costing: 147ms. Plus the baseline of 30ms or so. (And looking at the 'slow' version, the XMLRPC time is 300ms).

So a summary:

  30ms raw-bzrlib
 150ms xmlrpc calls to get and create branch mapping
 300ms Twisted Failure traceback handling
 -----
 480ms Time in codehosting on my VM.

That sure looks like a lot of overhead for our virtualization efforts.

Revision history for this message
John A Meinel (jameinel) wrote :

So Andrew put together lp:~spiv/launchpad/codehosting-cheaper-failures, and I did some tracing, and found one more case where we were creating a deferred that always used the default Failure handling. Putting together those patches, I did a bit more perf testing. The following numbers are running 'initialize' multiple times (deleting the branch inbetween) and picking the best time.

 703ms baseline (no modifications)
 573ms DeferredBlockingProxy, special 'makeDeferred'
 255ms codehosting-cheaper-failures
 173ms codehosting-cheaper-failures + makeDeferred
 180ms baseline, hacking Twisted.Failure.__init__ 'tb = None' always

To see if this had a real-world effect, I went ahead and did a 'bzr push' of a one-revision branch, in a loop of 20 calls.
 1005ms baseline
  764ms codehosting-cheaper-failures + makeDeferred

That includes the ssh handshake overhead, and all the other work that 'push' has to do. (Best case 'initialize' calls during this is 408ms for baseline, 159ms patched)

Martin Pool (mbp)
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
Martin Pool (mbp) wrote :

https://bugs.launchpad.net/launchpad/+bug/751042 for the specific issue of this being slow in twisted.

Revision history for this message
John A Meinel (jameinel) wrote :

See: https://bugs.launchpad.net/launchpad/+bug/751042/comments/2

Right now, the time averages about 2.0s for the "initialize_on_transport_ex" call on the production server. There is probably a little bit of time which includes ping (my ping is currently highly variable, fluctuating from 75ms to 300ms to bazaar.launchpad.net, I'm guessing using wireless across the house isn't helping.)

However, 4s peak and 1.3s minimum aren't because of my local network.

Revision history for this message
John A Meinel (jameinel) wrote :

I think the only thing left to do for this one is to update the version of Twisted that Launchpad uses. Since Andrew got his update to make Failure not critically slow into Twisted core. Has there been a released version of that, or is it only in trunk?

Revision history for this message
Andrew Bennetts (spiv) wrote :

It's still only in Twisted's trunk. Whenever 11.1 is released it'll be included. We could backport just that patch to Launchpad's version (or just update Launchpad's version to current trunk?).

Revision history for this message
John A Meinel (jameinel) wrote :

Andrew's patch landed, and the next-best step in this is to just upgrade to a Twisted that has his improvements such that by default Failure is cheap.
I imagine that Twisted will just get upgraded naturally after the next release, so I'm considering this bug closed.

Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.