loggerhead

Merge lp:~jameinel/loggerhead/less_work_for_head_716217 into lp:loggerhead

less_work_for_head_716217
Merge into trunk-rich

Proposed by John A Meinel on 2011-02-10

Status:	Work in progress
Proposed branch:	lp:~jameinel/loggerhead/less_work_for_head_716217
Merge into:	lp:loggerhead
Prerequisite:	lp:~jameinel/loggerhead/head_middleware
Diff against target:	266 lines (+129/-24) 6 files modified NEWS (+7/-13) loggerhead/controllers/__init__.py (+39/-3) loggerhead/controllers/changelog_ui.py (+15/-0) loggerhead/tests/__init__.py (+1/-4) loggerhead/tests/test_controllers.py (+67/-0) loggerhead/tests/test_simple.py (+0/-4)
To merge this branch:	bzr merge lp:~jameinel/loggerhead/less_work_for_head_716217
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Loggerhead Team		2011-02-10	Pending
Review via email: mp+49182@code.launchpad.net

Commit message

Do less work on HEAD requests for 'changes' urls, (bug #716217)

Description of the change

This is a bit of exploration in what it would take to shortcut the HEAD request.

It adds an api "validate_call()" which can do one of:
  1) Return True, and set Headers to be exactly what they would be for a normal request.
  2) Raise an error exception, though this should also match a normal request. This is how the
     stack currently handles stuff like 404.
  3) Return False, indicating "I don't know without further processing".

If (3) is triggered, then we do the request normally.

This is built on my http_head branch. It doesn't have to be, but it did find some bugs in the code.

What it *really* pointed out to me, is how hard it is to test loggerhead code, how many edge cases raise random errors rather than clean http error codes, and how much argument handling is handled very late in the game.

Specifically for '/changes', you can have arguments of the form:

/changes/<revid>/file/path?filter_file_id=<file_id>&q=<query?>

revid could be a revno, which gets validated fairly early. However, if it is a revid that isn't in the ancestry of the branch, it fails rather late (after we've started trying to find the history, and we've already loaded the branch cache, etc.)

If a file path is passed, then we map it to a file id. So we'd have to test the inventory at a given revision, etc.

So while I can certainly make it quick to answer the HEAD request that we have haproxy requesting, it is *very hard* to do it generically.

the only bit that we could do more easily is to *not* expand the template if we're answering a HEAD request.

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-10:

Thanks for having a go at that. I see your point about it being hard
to be sure it will work and this approach sounds good. It gives us a
way to make particular things faster if we care.

+ # Since HEAD is supposed to match GET, shouldn't we be doing something
+ # like sorting the headers?

I don't think the headers are considered to be ordered, or that anyone
would complain they were in a different order.

On a brief read through it looks good

> === modified file 'loggerhead/controllers/__init__.py'
> --- loggerhead/controllers/__init__.py 2009-10-17 08:59:33 +0000
> +++ loggerhead/controllers/__init__.py 2011-02-10 05:28:12 +0000
> @@ -49,8 +49,18 @@
>
>
> class TemplatedBranchView(object):
> + """Base class for most views.
> +
> + This is based on the idea that .get_values() will populate a dict with
> + values for a template, which will then be expanded and returned for a given
> + request.
> +
> + def validate_call(self, path, kwargs, headers):
> + if path is not None or kwargs or self.args:
> + # This includes a file-id, abort for now
> + return False
> + # Make sure we have a valid revid
> + # XXX: If revid is a string and not in the history of the branch, we
> + # won't detect that here. That is done as part of get_view. For
> + # now, if self.args is not empty, we just abort above.
> + self.get_revid()
> + # Note: We know it is safe to return True, because we don't set any
> + # actual headers. As long as path is None and there aren't
> + # kwargs, then the only remaining bit is if revid is not actually
> + # part of the branch, we can fail early.
> + return True
> +

I wouldn't say 'abort' which sounds like failing the request; rather
'have to actually render it' or something.

lp:~jameinel/loggerhead/less_work_for_head_716217 updated on 2011-02-10

428. By John A Meinel on 2011-02-10: Cleanups from Martin's review.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-10:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 11:59 PM, Martin Pool wrote:
> Thanks for having a go at that. I see your point about it being hard
> to be sure it will work and this approach sounds good. It gives us a
> way to make particular things faster if we care.
>
> + # Since HEAD is supposed to match GET, shouldn't we be doing something
> + # like sorting the headers?
>
> I don't think the headers are considered to be ordered, or that anyone
> would complain they were in a different order.
>
> On a brief read through it looks good

You didn't actually vote here, and you voted Needs_Fixing on the
prerequisite branch.

I have 3 patches addressing this, and they all handle a little bit of
the issue, but all are valuable:

https://code.launchpad.net/~jameinel/loggerhead/head_middleware/+merge/49170
(prerequisite) Ensures that we conform to the HTTP spec. For any
requests that we serve. We could probably add to it some sort of warning
to the log file if we generate content that gets suppressed?

https://code.launchpad.net/~jameinel/loggerhead/head_middleware/+merge/49170
(this fix) Gives a specific method for optimizing a HEAD request we care
about performance. In this particular case it should make "HEAD
/changes" very fast as long as there aren't any other options. Which is
exactly what haproxy is sending.

https://code.launchpad.net/~jameinel/loggerhead/no_template_for_head_716217/+merge/49185
Could be considered enough of a fix by itself. But it doesn't address
performance as much as this one, nor does it ensure that all HEAD
requests conform to the HTTP spec like the first one.

If I had to pick only one fix, it would certainly be the last one, but I
think all 3 are worth merging.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1T6CIACgkQJdeBCYSNAAMibACgkTiWa1pcLWiRdeVtcuAt4+8y
iK4AoJ/kYQo4/x4/NjKIiH4TgxMk+HU3
=SSlu
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 11:59 PM, Martin Pool wrote:
> Thanks for having a go at that.  I see your point about it being hard
> to be sure it will work and this approach sounds good.  It gives us a
> way to make particular things faster if we care.
> 
> +        # Since HEAD is supposed to match GET, shouldn't we be doing something
> +        # like sorting the headers?
> 
> I don't think the headers are considered to be ordered, or that anyone
> would complain they were in a different order.
> 
> On a brief read through it looks good

You didn't actually vote here, and you voted Needs_Fixing on the
prerequisite branch.

I have 3 patches addressing this, and they all handle a little bit of
the issue, but all are valuable:

iEYEARECAAYFAk1T6CIACgkQJdeBCYSNAAMibACgkTiWa1pcLWiRdeVtcuAt4+8y
iK4AoJ/kYQo4/x4/NjKIiH4TgxMk+HU3
=SSlu
-----END PGP SIGNATURE-----

lp:~jameinel/loggerhead/less_work_for_head_716217 updated on 2011-03-03

429. By John A Meinel on 2011-03-03: Merge trunk in, resolving merge conflicts.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-03-03:

Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.

Doing some benchmarking. Using a local bzr.dev branch, whose history is cached on disk, but not necessarily in memory.

HEAD-1 GET-1 HEAD* GET*
trunk 0.330 0.312 0.153 0.203
less_work 0.059 0.487 0.035 0.207

* time for requests after the first request

So for an uncached HEAD request of bzr.dev/changes, it drops the request time
from >300ms down to 60ms. Since the haproxies are querying every 10s, the
cached form is probably more relevant. Which is 153ms down to 35ms.

I don't really know if it is significant or not in the grand scheme of things.
But it drops us from .153s/10s = 1.5% pure overhead to 0.035s/10s = 0.35% overhead
for haproxy telling us everything is ok.

It might also serve as a template if we find other HEAD requests causing overhead.

Note that this patch is technically built on the HeadMiddleware patch, though I
can revert those changes if necessary.

Revision history for this message

Martin Pool (mbp) wrote on 2011-03-04:

On 3 March 2011 23:04, John A Meinel <email address hidden> wrote:
> Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.

If by kick it/leave it you mean just do nothing with it until we see
that cutting the cost of HEAD requests are really worthwhile, that's
ok with me. We can just mark it rejected and leave it attached to the
bug.

Sorry for sending you on a wild goose chase; I didn't think enough
about the fact that getting a realistic result might mean taking it
most of the way through rendering. I don't think cutting the cost of
initial GET is a good tradeoff to speed up later HEAD requests.

I do think it's worthwhile/important that HEAD not actually send a
body back, because that clearly is violating the rfc in a fairly
fundamental way. Just generating and discarding it is ok with me.

It seems like eventually to be able to stream the body we would need
to be able to tell before rendering whether the request is ultimately
likely to succeed or not.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-03-04:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/4/2011 4:55 AM, Martin Pool wrote:
> On 3 March 2011 23:04, John A Meinel <email address hidden> wrote:
>> Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.
>
> If by kick it/leave it you mean just do nothing with it until we see
> that cutting the cost of HEAD requests are really worthwhile, that's
> ok with me. We can just mark it rejected and leave it attached to the
> bug.

Kick it => mark rejected
Leave it => leave in the queue, maybe mark it WiP.

>
> Sorry for sending you on a wild goose chase; I didn't think enough
> about the fact that getting a realistic result might mean taking it
> most of the way through rendering. I don't think cutting the cost of
> initial GET is a good tradeoff to speed up later HEAD requests.

Well, it was an afternoon, longer just sitting around.

>
> I do think it's worthwhile/important that HEAD not actually send a
> body back, because that clearly is violating the rfc in a fairly
> fundamental way. Just generating and discarding it is ok with me.
>
> It seems like eventually to be able to stream the body we would need
> to be able to tell before rendering whether the request is ultimately
> likely to succeed or not.
>

Right. Though I don't know that 'changes' has a huge amount of content
that streaming it is a big deal. Something like View/Annotate certainly.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1wjNYACgkQJdeBCYSNAAPRnwCbByyCziZukGN63r09rs5VIegf
pEUAnjAJmgi7NmY9zvB8NTxPrXGGKLyl
=8YPY
-----END PGP SIGNATURE-----

Unmerged revisions

429. By John A Meinel on 2011-03-03

Merge trunk in, resolving merge conflicts.

428. By John A Meinel on 2011-02-10

Cleanups from Martin's review.

427. By John A Meinel on 2011-02-10

Find a case that isn't handled easily, punt.

Note that the tests assert what error is raised, but we really want a better one.
Not worth fixing until we get more of old trunk/experimental merged.

426. By John A Meinel on 2011-02-10

Add some comments, realize that we probably still give the wrong value for a HEAD
if a revid is requested that isn't in the path.

425. By John A Meinel on 2011-02-10

Add some tests for the ChangeLogUI, and implement a method for cheesing HEAD requests.

Basically, if we can answer quickly that HEAD is ok, do so, but allow fallbacks
that do it the way we used to, rather than requiring we handle all processing
immediately.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Colin Watson

James Barlow

John A Meinel

Matt Nordhoff

Michael Hudson-Doyle

Paul Hummer

jay ham

 === modified file 'NEWS'
 --- NEWS	2011-03-03 11:52:04 +0000
 +++ NEWS	2011-03-03 11:52:05 +0000
@@ -1,7 +1,6 @@
  What's changed in loggerhead?
  =============================
--<<<<<<< TREE
 .19 [???]
  ----------------
@@ -10,6 +9,12 @@
        multiple threads, and issue concurrent requests.
        (John Arbash Meinel)
++    - HEAD requests should not return body content. This is done by adding
++      another wsgi middleware that strips the body when the REQUEST_METHOD is
++      HEAD. Note that you have to add the middleware into your pipeline, and
++      it does not decrease the actual work done.
++      (John Arbash Meinel, #716201)
++
      - If we get a HEAD request, there is no reason to expand the template, we
        shouldn't be returning body content anyway.
        (John Arbash Meinel, #716201, #716217)
@@ -17,21 +22,10 @@
      - Merge the pqm changes back into trunk, after trunk was reverted to an old
        revision. (John Arbash Meinel, #716152)
--    - The json module is no longer claimed to be supported as alternative for
++    - The json module is no longer claimed to be supported as alternative for
        simplejson. (Jelmer Vernooij, #586611)
--=======
--1.19 [????]
-------------
--
--    - HEAD requests should not return body content. This is done by adding
--      another wsgi middleware that strips the body when the REQUEST_METHOD is
--      HEAD. Note that you have to add the middleware into your pipeline, and
--      it does not decrease the actual work done.
--      (John Arbash Meinel, #716201)
--
-->>>>>>> MERGE-SOURCE
 .18 [10Nov2010]
  ----------------
 === modified file 'loggerhead/controllers/__init__.py'
 --- loggerhead/controllers/__init__.py	2011-02-10 17:01:10 +0000
 +++ loggerhead/controllers/__init__.py	2011-03-03 11:52:05 +0000
@@ -51,6 +51,15 @@
  class TemplatedBranchView(object):
++    """Base class for most views.
++
++    This is based on the idea that .get_values() will populate a dict with
++    values for a template, which will then be expanded and returned for a given
++    request.
++
++    :cvar template_path: Subclasses set this to the template that will be
++        used for this particular view.
++    """
      template_path = None
@@ -67,6 +76,25 @@
          self.__history = self._history_callable()
          return self.__history
++
++    def validate_call(self, path, kwargs, headers):
++        """Make sure the request being made is valid.
++
++        This is done for HEAD requests. Instead of calling get_values and
++        rendering the template, we validate the request, and return just the
++        headers.
++
++        Classes which implement validate_call should take care to make sure the
++        headers they set will be identical to headers set for get_values().
++        Probably the best way to do this is to have get_values() call
++        validate_call directly, and only set header data there.
++
++        :return: True if you know the call is valid, raise an exception if
++            there is a problem (preferabbly an HTTPException), and False if we
++            should just process the call as normal.
++        """
++        return False
++
      def __call__(self, environ, start_response):
          z = time.time()
          kwargs = dict(parse_querystring(environ))
@@ -92,13 +120,16 @@
          vals.update(templatefunctions)
          headers = {}
++        if (environ.get('REQUEST_METHOD', 'GET') == 'HEAD'):
++            if self.validate_call(path, kwargs, headers):
++                self._call_start_response(start_response, headers)
++                return []
++
          vals.update(self.get_values(path, kwargs, headers))
          self.log.info('Getting information for %s: %r secs' % (
              self.__class__.__name__, time.time() - z))
--        if 'Content-Type' not in headers:
--            headers['Content-Type'] = 'text/html'
--        writer = start_response("200 OK", headers.items())
++        writer = self._call_start_response(start_response, headers)
          if environ.get('REQUEST_METHOD') == 'HEAD':
              # No content for a HEAD request
              return []
@@ -112,6 +143,11 @@
                  self.__class__.__name__, time.time() - z, w.bytes))
          return []
++    def _call_start_response(self, start_response, headers):
++        if 'Content-Type' not in headers:
++            headers['Content-Type'] = 'text/html'
++        return start_response("200 OK", headers.items())
++
      def get_revid(self):
          h = self._history
          if h is None:
 === modified file 'loggerhead/controllers/changelog_ui.py'
 --- loggerhead/controllers/changelog_ui.py	2011-03-02 14:07:21 +0000
 +++ loggerhead/controllers/changelog_ui.py	2011-03-03 11:52:05 +0000
@@ -32,6 +32,21 @@
      template_path = 'loggerhead.templates.changelog'
++    def validate_call(self, path, kwargs, headers):
++        if path is not None or kwargs or self.args:
++            # This includes extra arguments that we don't validate yet. So
++            # process as normal.
++
++            # path is not None indicates we have a changelog of a specific file
++            # kwargs can be a query or a file_id filter, etc.
++            # self.args indicates we have a revision id, or a revno. We need
++            #   to validate that the revision is in the ancestry of the branch
++            #   tip.
++            return False
++        # Note: We know it is safe to return True, because we don't set any
++        #       actual headers.
++        return True
++
      def get_values(self, path, kwargs, headers):
          history = self._history
          revid = self.get_revid()
 === modified file 'loggerhead/tests/__init__.py'
 --- loggerhead/tests/__init__.py	2011-03-03 11:52:04 +0000
 +++ loggerhead/tests/__init__.py	2011-03-03 11:52:05 +0000
@@ -20,11 +20,8 @@
          (__name__ + '.' + x) for x in [
              'test_controllers',
              'test_corners',
--<<<<<<< TREE
++            'test_http_head',
              'test_load_test',
--=======
--            'test_http_head',
-->>>>>>> MERGE-SOURCE
              'test_simple',
              'test_templating',
          ]]))
 === modified file 'loggerhead/tests/test_controllers.py'
 --- loggerhead/tests/test_controllers.py	2011-02-10 17:01:10 +0000
 +++ loggerhead/tests/test_controllers.py	2011-03-03 11:52:05 +0000
@@ -6,6 +6,7 @@
  from bzrlib import errors
  from loggerhead.apps.branch import BranchWSGIApp
++from loggerhead.controllers.changelog_ui import ChangeLogUI
  from loggerhead.controllers.annotate_ui import AnnotateUI
  from loggerhead.controllers.inventory_ui import InventoryUI
  from loggerhead.controllers.revision_ui import RevisionUI
@@ -13,6 +14,72 @@
  from loggerhead import util
++class TestChangeLogUI(BasicTests):
++
++    def make_branch_and_changelog_ui(self, tree_shape):
++        tree = self.make_branch_and_tree('.')
++        self.build_tree(tree_shape)
++        tree.smart_add([])
++        tree.commit('simple message')
++        tree.branch.lock_read()
++        self.addCleanup(tree.branch.unlock)
++        branch_app = BranchWSGIApp(tree.branch, '')
++        branch_app.log.setLevel(logging.CRITICAL)
++        # These are usually set in BranchWSGIApp.app(), which is set from env
++        # settings set by BranchesFromTransportRoot, so we fake it.
++        branch_app._static_url_base = '/'
++        branch_app._url_base = '/'
++        return tree.branch, ChangeLogUI(branch_app, branch_app.get_history)
++
++    def consume_app(self, app, extra_environ=None):
++        env = {'SCRIPT_NAME': '/changes', 'PATH_INFO': ''}
++        if extra_environ is not None:
++            env.update(extra_environ)
++        body = StringIO()
++        start = []
++        def start_response(status, headers, exc_info=None):
++            start.append((status, headers, exc_info))
++            return body.write
++        extra_content = list(app(env, start_response))
++        body.writelines(extra_content)
++        return start[0], body.getvalue()
++
++    def test_smoke(self):
++        bzrbranch, change_ui = self.make_branch_and_changelog_ui(
++            ['file'])
++        start, content = self.consume_app(change_ui)
++        self.assertEqual(('200 OK', [('Content-Type', 'text/html')], None),
++                         start)
++        self.assertContainsRe(content, 'simple message')
++
++    def test_simple_head_doesnt_yield_body(self):
++        bzrbranch, change_ui = self.make_branch_and_changelog_ui(
++            ['file'])
++        start, content = self.consume_app(change_ui,
++                            extra_environ={'REQUEST_METHOD': 'HEAD'})
++        self.assertEqual(('200 OK', [('Content-Type', 'text/html')], None),
++                         start)
++        self.assertEqualDiff('', content)
++
++    def test_simple_head_bad_revno(self):
++        bzrbranch, change_ui = self.make_branch_and_changelog_ui(
++            ['file'])
++        # TODO: This is not the ideal error. should be 404, to be done in a
++        #       future patch
++        self.assertRaises(errors.NoSuchRevision, self.consume_app,
++            change_ui, extra_environ={'REQUEST_METHOD': 'HEAD',
++                                      'PATH_INFO': '/9999'})
++
++    def test_simple_head_bad_revid(self):
++        bzrbranch, change_ui = self.make_branch_and_changelog_ui(
++            ['file'])
++        # TODO: This is not the ideal error. should probably be 404, to be done
++        #       in a future patch
++        self.assertRaises(HTTPServerError, self.consume_app,
++            change_ui, extra_environ={'REQUEST_METHOD': 'HEAD',
++                                      'PATH_INFO': '/no-such-revid'})
++
++
  class TestInventoryUI(BasicTests):
      def make_bzrbranch_and_inventory_ui_for_tree_shape(self, shape):
 === modified file 'loggerhead/tests/test_simple.py'
 --- loggerhead/tests/test_simple.py	2011-03-03 11:52:04 +0000
 +++ loggerhead/tests/test_simple.py	2011-03-03 11:52:05 +0000
@@ -1,8 +1,4 @@
--<<<<<<< TREE
--# Copyright (C) 2007-2011 Canonical Ltd.
--=======
  # Copyright (C) 2007, 2008, 2009, 2011 Canonical Ltd.
-->>>>>>> MERGE-SOURCE
+ #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by