Merge lp:~jameinel/loggerhead/less_work_for_head_716217 into lp:loggerhead

Proposed by John A Meinel
Status: Work in progress
Proposed branch: lp:~jameinel/loggerhead/less_work_for_head_716217
Merge into: lp:loggerhead
Prerequisite: lp:~jameinel/loggerhead/head_middleware
Diff against target: 266 lines (+129/-24)
6 files modified
NEWS (+7/-13)
loggerhead/controllers/__init__.py (+39/-3)
loggerhead/controllers/changelog_ui.py (+15/-0)
loggerhead/tests/__init__.py (+1/-4)
loggerhead/tests/test_controllers.py (+67/-0)
loggerhead/tests/test_simple.py (+0/-4)
To merge this branch: bzr merge lp:~jameinel/loggerhead/less_work_for_head_716217
Reviewer Review Type Date Requested Status
Loggerhead Team Pending
Review via email: mp+49182@code.launchpad.net

Commit message

Do less work on HEAD requests for 'changes' urls, (bug #716217)

Description of the change

This is a bit of exploration in what it would take to shortcut the HEAD request.

It adds an api "validate_call()" which can do one of:
  1) Return True, and set Headers to be exactly what they would be for a normal request.
  2) Raise an error exception, though this should also match a normal request. This is how the
     stack currently handles stuff like 404.
  3) Return False, indicating "I don't know without further processing".

If (3) is triggered, then we do the request normally.

This is built on my http_head branch. It doesn't have to be, but it did find some bugs in the code.

What it *really* pointed out to me, is how hard it is to test loggerhead code, how many edge cases raise random errors rather than clean http error codes, and how much argument handling is handled very late in the game.

Specifically for '/changes', you can have arguments of the form:

 /changes/<revid>/file/path?filter_file_id=<file_id>&q=<query?>

revid could be a revno, which gets validated fairly early. However, if it is a revid that isn't in the ancestry of the branch, it fails rather late (after we've started trying to find the history, and we've already loaded the branch cache, etc.)

If a file path is passed, then we map it to a file id. So we'd have to test the inventory at a given revision, etc.

So while I can certainly make it quick to answer the HEAD request that we have haproxy requesting, it is *very hard* to do it generically.

the only bit that we could do more easily is to *not* expand the template if we're answering a HEAD request.

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote :

Thanks for having a go at that. I see your point about it being hard
to be sure it will work and this approach sounds good. It gives us a
way to make particular things faster if we care.

+ # Since HEAD is supposed to match GET, shouldn't we be doing something
+ # like sorting the headers?

I don't think the headers are considered to be ordered, or that anyone
would complain they were in a different order.

On a brief read through it looks good

> === modified file 'loggerhead/controllers/__init__.py'
> --- loggerhead/controllers/__init__.py  2009-10-17 08:59:33 +0000
> +++ loggerhead/controllers/__init__.py  2011-02-10 05:28:12 +0000
> @@ -49,8 +49,18 @@
>
>
>  class TemplatedBranchView(object):
> +    """Base class for most views.
> +
> +    This is based on the idea that .get_values() will populate a dict with
> +    values for a template, which will then be expanded and returned for a given
> +    request.
> +
> +    def validate_call(self, path, kwargs, headers):
> +        if path is not None or kwargs or self.args:
> +            # This includes a file-id, abort for now
> +            return False
> +        # Make sure we have a valid revid
> +        # XXX: If revid is a string and not in the history of the branch, we
> +        #      won't detect that here. That is done as part of get_view. For
> +        #      now, if self.args is not empty, we just abort above.
> +        self.get_revid()
> +        # Note: We know it is safe to return True, because we don't set any
> +        #       actual headers. As long as path is None and there aren't
> +        #       kwargs, then the only remaining bit is if revid is not actually
> +        #       part of the branch, we can fail early.
> +        return True
> +

I wouldn't say 'abort' which sounds like failing the request; rather
'have to actually render it' or something.

428. By John A Meinel

Cleanups from Martin's review.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 11:59 PM, Martin Pool wrote:
> Thanks for having a go at that. I see your point about it being hard
> to be sure it will work and this approach sounds good. It gives us a
> way to make particular things faster if we care.
>
> + # Since HEAD is supposed to match GET, shouldn't we be doing something
> + # like sorting the headers?
>
> I don't think the headers are considered to be ordered, or that anyone
> would complain they were in a different order.
>
> On a brief read through it looks good

You didn't actually vote here, and you voted Needs_Fixing on the
prerequisite branch.

I have 3 patches addressing this, and they all handle a little bit of
the issue, but all are valuable:

https://code.launchpad.net/~jameinel/loggerhead/head_middleware/+merge/49170
(prerequisite) Ensures that we conform to the HTTP spec. For any
requests that we serve. We could probably add to it some sort of warning
to the log file if we generate content that gets suppressed?

https://code.launchpad.net/~jameinel/loggerhead/head_middleware/+merge/49170
(this fix) Gives a specific method for optimizing a HEAD request we care
about performance. In this particular case it should make "HEAD
/changes" very fast as long as there aren't any other options. Which is
exactly what haproxy is sending.

https://code.launchpad.net/~jameinel/loggerhead/no_template_for_head_716217/+merge/49185
Could be considered enough of a fix by itself. But it doesn't address
performance as much as this one, nor does it ensure that all HEAD
requests conform to the HTTP spec like the first one.

If I had to pick only one fix, it would certainly be the last one, but I
think all 3 are worth merging.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1T6CIACgkQJdeBCYSNAAMibACgkTiWa1pcLWiRdeVtcuAt4+8y
iK4AoJ/kYQo4/x4/NjKIiH4TgxMk+HU3
=SSlu
-----END PGP SIGNATURE-----

429. By John A Meinel

Merge trunk in, resolving merge conflicts.

Revision history for this message
John A Meinel (jameinel) wrote :

Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.

Doing some benchmarking. Using a local bzr.dev branch, whose history is cached on disk, but not necessarily in memory.

            HEAD-1 GET-1 HEAD* GET*
trunk 0.330 0.312 0.153 0.203
less_work 0.059 0.487 0.035 0.207

 * time for requests after the first request

So for an uncached HEAD request of bzr.dev/changes, it drops the request time
from >300ms down to 60ms. Since the haproxies are querying every 10s, the
cached form is probably more relevant. Which is 153ms down to 35ms.

I don't really know if it is significant or not in the grand scheme of things.
But it drops us from .153s/10s = 1.5% pure overhead to 0.035s/10s = 0.35% overhead
for haproxy telling us everything is ok.

It might also serve as a template if we find other HEAD requests causing overhead.

Note that this patch is technically built on the HeadMiddleware patch, though I
can revert those changes if necessary.

Revision history for this message
Martin Pool (mbp) wrote :

On 3 March 2011 23:04, John A Meinel <email address hidden> wrote:
> Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.

If by kick it/leave it you mean just do nothing with it until we see
that cutting the cost of HEAD requests are really worthwhile, that's
ok with me. We can just mark it rejected and leave it attached to the
bug.

Sorry for sending you on a wild goose chase; I didn't think enough
about the fact that getting a realistic result might mean taking it
most of the way through rendering. I don't think cutting the cost of
initial GET is a good tradeoff to speed up later HEAD requests.

I do think it's worthwhile/important that HEAD not actually send a
body back, because that clearly is violating the rfc in a fairly
fundamental way. Just generating and discarding it is ok with me.

It seems like eventually to be able to stream the body we would need
to be able to tell before rendering whether the request is ultimately
likely to succeed or not.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/4/2011 4:55 AM, Martin Pool wrote:
> On 3 March 2011 23:04, John A Meinel <email address hidden> wrote:
>> Should we kick this back? I'd like to leave it, pending evaluation of whether the HEAD requests are significant overhead. history-db also reduces the base overhead.
>
> If by kick it/leave it you mean just do nothing with it until we see
> that cutting the cost of HEAD requests are really worthwhile, that's
> ok with me. We can just mark it rejected and leave it attached to the
> bug.

Kick it => mark rejected
Leave it => leave in the queue, maybe mark it WiP.

>
> Sorry for sending you on a wild goose chase; I didn't think enough
> about the fact that getting a realistic result might mean taking it
> most of the way through rendering. I don't think cutting the cost of
> initial GET is a good tradeoff to speed up later HEAD requests.

Well, it was an afternoon, longer just sitting around.

>
> I do think it's worthwhile/important that HEAD not actually send a
> body back, because that clearly is violating the rfc in a fairly
> fundamental way. Just generating and discarding it is ok with me.
>
> It seems like eventually to be able to stream the body we would need
> to be able to tell before rendering whether the request is ultimately
> likely to succeed or not.
>

Right. Though I don't know that 'changes' has a huge amount of content
that streaming it is a big deal. Something like View/Annotate certainly.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1wjNYACgkQJdeBCYSNAAPRnwCbByyCziZukGN63r09rs5VIegf
pEUAnjAJmgi7NmY9zvB8NTxPrXGGKLyl
=8YPY
-----END PGP SIGNATURE-----

Unmerged revisions

429. By John A Meinel

Merge trunk in, resolving merge conflicts.

428. By John A Meinel

Cleanups from Martin's review.

427. By John A Meinel

Find a case that isn't handled easily, punt.

Note that the tests assert what error is raised, but we really want a better one.
Not worth fixing until we get more of old trunk/experimental merged.

426. By John A Meinel

Add some comments, realize that we probably still give the wrong value for a HEAD
if a revid is requested that isn't in the path.

425. By John A Meinel

Add some tests for the ChangeLogUI, and implement a method for cheesing HEAD requests.

Basically, if we can answer quickly that HEAD is ok, do so, but allow fallbacks
that do it the way we used to, rather than requiring we handle all processing
immediately.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2011-03-03 11:52:04 +0000
3+++ NEWS 2011-03-03 11:52:05 +0000
4@@ -1,7 +1,6 @@
5 What's changed in loggerhead?
6 =============================
7
8-<<<<<<< TREE
9 1.19 [???]
10 ----------------
11
12@@ -10,6 +9,12 @@
13 multiple threads, and issue concurrent requests.
14 (John Arbash Meinel)
15
16+ - HEAD requests should not return body content. This is done by adding
17+ another wsgi middleware that strips the body when the REQUEST_METHOD is
18+ HEAD. Note that you have to add the middleware into your pipeline, and
19+ it does not decrease the actual work done.
20+ (John Arbash Meinel, #716201)
21+
22 - If we get a HEAD request, there is no reason to expand the template, we
23 shouldn't be returning body content anyway.
24 (John Arbash Meinel, #716201, #716217)
25@@ -17,21 +22,10 @@
26 - Merge the pqm changes back into trunk, after trunk was reverted to an old
27 revision. (John Arbash Meinel, #716152)
28
29- - The json module is no longer claimed to be supported as alternative for
30+ - The json module is no longer claimed to be supported as alternative for
31 simplejson. (Jelmer Vernooij, #586611)
32
33
34-=======
35-1.19 [????]
36------------
37-
38- - HEAD requests should not return body content. This is done by adding
39- another wsgi middleware that strips the body when the REQUEST_METHOD is
40- HEAD. Note that you have to add the middleware into your pipeline, and
41- it does not decrease the actual work done.
42- (John Arbash Meinel, #716201)
43-
44->>>>>>> MERGE-SOURCE
45 1.18 [10Nov2010]
46 ----------------
47
48
49=== modified file 'loggerhead/controllers/__init__.py'
50--- loggerhead/controllers/__init__.py 2011-02-10 17:01:10 +0000
51+++ loggerhead/controllers/__init__.py 2011-03-03 11:52:05 +0000
52@@ -51,6 +51,15 @@
53
54
55 class TemplatedBranchView(object):
56+ """Base class for most views.
57+
58+ This is based on the idea that .get_values() will populate a dict with
59+ values for a template, which will then be expanded and returned for a given
60+ request.
61+
62+ :cvar template_path: Subclasses set this to the template that will be
63+ used for this particular view.
64+ """
65
66 template_path = None
67
68@@ -67,6 +76,25 @@
69 self.__history = self._history_callable()
70 return self.__history
71
72+
73+ def validate_call(self, path, kwargs, headers):
74+ """Make sure the request being made is valid.
75+
76+ This is done for HEAD requests. Instead of calling get_values and
77+ rendering the template, we validate the request, and return just the
78+ headers.
79+
80+ Classes which implement validate_call should take care to make sure the
81+ headers they set will be identical to headers set for get_values().
82+ Probably the best way to do this is to have get_values() call
83+ validate_call directly, and only set header data there.
84+
85+ :return: True if you know the call is valid, raise an exception if
86+ there is a problem (preferabbly an HTTPException), and False if we
87+ should just process the call as normal.
88+ """
89+ return False
90+
91 def __call__(self, environ, start_response):
92 z = time.time()
93 kwargs = dict(parse_querystring(environ))
94@@ -92,13 +120,16 @@
95 vals.update(templatefunctions)
96 headers = {}
97
98+ if (environ.get('REQUEST_METHOD', 'GET') == 'HEAD'):
99+ if self.validate_call(path, kwargs, headers):
100+ self._call_start_response(start_response, headers)
101+ return []
102+
103 vals.update(self.get_values(path, kwargs, headers))
104
105 self.log.info('Getting information for %s: %r secs' % (
106 self.__class__.__name__, time.time() - z))
107- if 'Content-Type' not in headers:
108- headers['Content-Type'] = 'text/html'
109- writer = start_response("200 OK", headers.items())
110+ writer = self._call_start_response(start_response, headers)
111 if environ.get('REQUEST_METHOD') == 'HEAD':
112 # No content for a HEAD request
113 return []
114@@ -112,6 +143,11 @@
115 self.__class__.__name__, time.time() - z, w.bytes))
116 return []
117
118+ def _call_start_response(self, start_response, headers):
119+ if 'Content-Type' not in headers:
120+ headers['Content-Type'] = 'text/html'
121+ return start_response("200 OK", headers.items())
122+
123 def get_revid(self):
124 h = self._history
125 if h is None:
126
127=== modified file 'loggerhead/controllers/changelog_ui.py'
128--- loggerhead/controllers/changelog_ui.py 2011-03-02 14:07:21 +0000
129+++ loggerhead/controllers/changelog_ui.py 2011-03-03 11:52:05 +0000
130@@ -32,6 +32,21 @@
131
132 template_path = 'loggerhead.templates.changelog'
133
134+ def validate_call(self, path, kwargs, headers):
135+ if path is not None or kwargs or self.args:
136+ # This includes extra arguments that we don't validate yet. So
137+ # process as normal.
138+
139+ # path is not None indicates we have a changelog of a specific file
140+ # kwargs can be a query or a file_id filter, etc.
141+ # self.args indicates we have a revision id, or a revno. We need
142+ # to validate that the revision is in the ancestry of the branch
143+ # tip.
144+ return False
145+ # Note: We know it is safe to return True, because we don't set any
146+ # actual headers.
147+ return True
148+
149 def get_values(self, path, kwargs, headers):
150 history = self._history
151 revid = self.get_revid()
152
153=== modified file 'loggerhead/tests/__init__.py'
154--- loggerhead/tests/__init__.py 2011-03-03 11:52:04 +0000
155+++ loggerhead/tests/__init__.py 2011-03-03 11:52:05 +0000
156@@ -20,11 +20,8 @@
157 (__name__ + '.' + x) for x in [
158 'test_controllers',
159 'test_corners',
160-<<<<<<< TREE
161+ 'test_http_head',
162 'test_load_test',
163-=======
164- 'test_http_head',
165->>>>>>> MERGE-SOURCE
166 'test_simple',
167 'test_templating',
168 ]]))
169
170=== modified file 'loggerhead/tests/test_controllers.py'
171--- loggerhead/tests/test_controllers.py 2011-02-10 17:01:10 +0000
172+++ loggerhead/tests/test_controllers.py 2011-03-03 11:52:05 +0000
173@@ -6,6 +6,7 @@
174 from bzrlib import errors
175
176 from loggerhead.apps.branch import BranchWSGIApp
177+from loggerhead.controllers.changelog_ui import ChangeLogUI
178 from loggerhead.controllers.annotate_ui import AnnotateUI
179 from loggerhead.controllers.inventory_ui import InventoryUI
180 from loggerhead.controllers.revision_ui import RevisionUI
181@@ -13,6 +14,72 @@
182 from loggerhead import util
183
184
185+class TestChangeLogUI(BasicTests):
186+
187+ def make_branch_and_changelog_ui(self, tree_shape):
188+ tree = self.make_branch_and_tree('.')
189+ self.build_tree(tree_shape)
190+ tree.smart_add([])
191+ tree.commit('simple message')
192+ tree.branch.lock_read()
193+ self.addCleanup(tree.branch.unlock)
194+ branch_app = BranchWSGIApp(tree.branch, '')
195+ branch_app.log.setLevel(logging.CRITICAL)
196+ # These are usually set in BranchWSGIApp.app(), which is set from env
197+ # settings set by BranchesFromTransportRoot, so we fake it.
198+ branch_app._static_url_base = '/'
199+ branch_app._url_base = '/'
200+ return tree.branch, ChangeLogUI(branch_app, branch_app.get_history)
201+
202+ def consume_app(self, app, extra_environ=None):
203+ env = {'SCRIPT_NAME': '/changes', 'PATH_INFO': ''}
204+ if extra_environ is not None:
205+ env.update(extra_environ)
206+ body = StringIO()
207+ start = []
208+ def start_response(status, headers, exc_info=None):
209+ start.append((status, headers, exc_info))
210+ return body.write
211+ extra_content = list(app(env, start_response))
212+ body.writelines(extra_content)
213+ return start[0], body.getvalue()
214+
215+ def test_smoke(self):
216+ bzrbranch, change_ui = self.make_branch_and_changelog_ui(
217+ ['file'])
218+ start, content = self.consume_app(change_ui)
219+ self.assertEqual(('200 OK', [('Content-Type', 'text/html')], None),
220+ start)
221+ self.assertContainsRe(content, 'simple message')
222+
223+ def test_simple_head_doesnt_yield_body(self):
224+ bzrbranch, change_ui = self.make_branch_and_changelog_ui(
225+ ['file'])
226+ start, content = self.consume_app(change_ui,
227+ extra_environ={'REQUEST_METHOD': 'HEAD'})
228+ self.assertEqual(('200 OK', [('Content-Type', 'text/html')], None),
229+ start)
230+ self.assertEqualDiff('', content)
231+
232+ def test_simple_head_bad_revno(self):
233+ bzrbranch, change_ui = self.make_branch_and_changelog_ui(
234+ ['file'])
235+ # TODO: This is not the ideal error. should be 404, to be done in a
236+ # future patch
237+ self.assertRaises(errors.NoSuchRevision, self.consume_app,
238+ change_ui, extra_environ={'REQUEST_METHOD': 'HEAD',
239+ 'PATH_INFO': '/9999'})
240+
241+ def test_simple_head_bad_revid(self):
242+ bzrbranch, change_ui = self.make_branch_and_changelog_ui(
243+ ['file'])
244+ # TODO: This is not the ideal error. should probably be 404, to be done
245+ # in a future patch
246+ self.assertRaises(HTTPServerError, self.consume_app,
247+ change_ui, extra_environ={'REQUEST_METHOD': 'HEAD',
248+ 'PATH_INFO': '/no-such-revid'})
249+
250+
251 class TestInventoryUI(BasicTests):
252
253 def make_bzrbranch_and_inventory_ui_for_tree_shape(self, shape):
254
255=== modified file 'loggerhead/tests/test_simple.py'
256--- loggerhead/tests/test_simple.py 2011-03-03 11:52:04 +0000
257+++ loggerhead/tests/test_simple.py 2011-03-03 11:52:05 +0000
258@@ -1,8 +1,4 @@
259-<<<<<<< TREE
260-# Copyright (C) 2007-2011 Canonical Ltd.
261-=======
262 # Copyright (C) 2007, 2008, 2009, 2011 Canonical Ltd.
263->>>>>>> MERGE-SOURCE
264 #
265 # This program is free software; you can redistribute it and/or modify
266 # it under the terms of the GNU General Public License as published by

Subscribers

People subscribed via source and target branches