Merge lp:~ian-clatworthy/bzr/faster-log.old into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy on 2009-05-25
Status: Merged
Merge reported by: Ian Clatworthy
Merged at revision: not available
Proposed branch: lp:~ian-clatworthy/bzr/faster-log.old
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 331 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-log.old
Reviewer Review Type Date Requested Status
bzr-core 2009-05-25 Pending
Review via email: mp+6769@code.launchpad.net
To post a comment you must log in.
Ian Clatworthy (ian-clatworthy) wrote :

This patch makes log --long faster by around 10%. For OpenOffice.org with 260K revisions, the time drops from 50 seconds to 46 seconds.

John A Meinel (jameinel) wrote :
Download full text (3.7 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~ian-clatworthy/bzr/faster-log into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This patch makes log --long faster by around 10%. For OpenOffice.org with 260K revisions, the time drops from 50 seconds to 46 seconds.
>

I'll try to give a bit of a summary as to why I think this is faster,
though it would be nice if you would give a similar summary when submitting.

1) Pull things like
+ levels = rqst.get('levels')
+ limit = rqst.get('limit')
+ diff_type = rqst.get('diff_type')

Out of the inner loops, and set the variable once at the outside. This
also handles stuff like using an if in the inner loop rather than a
function call:

+ if diff_type is None:
+ diff = None
+ else:
+ diff = self._format_diff(rev, rev_id, diff_type)

2) Factor out "custom_properties" into a re-used helper function.

3) Add a new function: format_date_with_offset_in_original_timezone

4) Change log_revision(self, revision)

so that rather than calling outfile.write() for every line of the
revision, batching it up into a list, and then using ('\n' +
indent).join(lines) and writing that out as a single string.

IMO the batching of everything for one revision seems just about right.

I'm a bit concerned, though, about handling Unicode Author names and
UTF-8 'diff' strings. My guess is that this actually breaks it, and we
just don't know it yet.

Can you please ensure that we have a test with non-ascii file content
(even better if it isn't even UTF-8 since we don't assert anything about
file content encoding), and then also include a non-ascii Author name.
That ensures that we want to output Unicode for the name, and something
random for the diff content.

5) I'm a bit curious about this:
=== modified file 'bzrlib/osutils.py'
- --- bzrlib/osutils.py 2009-05-23 04:55:52 +0000
+++ bzrlib/osutils.py 2009-05-25 05:46:01 +0000
@@ -686,6 +686,8 @@
     return offset.days * 86400 + offset.seconds

 weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in
weekdays]
+

^- Does this mean that we are skipping localization? So someone with
de_DE.UTF-8 will no longer see the German short-names for days of the week?

+def format_date_with_offset_in_original_timezone(t, offset=0,
+ _cache=_offset_cache):
+ """Return a formatted date string in the original timezone.
+
+ This routine may be faster then format_date.
+
+ :param t: Seconds since the epoch.
+ :param offset: Timezone offset in seconds east of utc.
+ """
+ if offset is None:
+ offset = 0
+ tt = time.gmtime(t + offset)
+ date_fmt = _default_format_by_weekday_num[tt[6]]
+ date_str = time.strftime(date_fmt, tt)
+ offset_str = _cache.get(offset, None)
+ if offset_str is None:
+ offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
+ _cache[offset] = offset_str
+ return date_str + offset_str
+

^- I wonder if it might be faster to include the offset_str inside the
'date_fm...

Read more...

John A Meinel (jameinel) wrote :
Download full text (3.4 KiB)

Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~ian-clatworthy/bzr/faster-log into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This patch makes log --long faster by around 10%. For OpenOffice.org with 260K revisions, the time drops from 50 seconds to 46 seconds.
>

I'll try to give a bit of a summary as to why I think this is faster,
though it would be nice if you would give a similar summary when submitting.

1) Pull things like
+ levels = rqst.get('levels')
+ limit = rqst.get('limit')
+ diff_type = rqst.get('diff_type')

Out of the inner loops, and set the variable once at the outside. This
also handles stuff like using an if in the inner loop rather than a
function call:

+ if diff_type is None:
+ diff = None
+ else:
+ diff = self._format_diff(rev, rev_id, diff_type)

2) Factor out "custom_properties" into a re-used helper function.

3) Add a new function: format_date_with_offset_in_original_timezone

4) Change log_revision(self, revision)

so that rather than calling outfile.write() for every line of the
revision, batching it up into a list, and then using ('\n' +
indent).join(lines) and writing that out as a single string.

IMO the batching of everything for one revision seems just about right.

I'm a bit concerned, though, about handling Unicode Author names and
UTF-8 'diff' strings. My guess is that this actually breaks it, and we
just don't know it yet.

Can you please ensure that we have a test with non-ascii file content
(even better if it isn't even UTF-8 since we don't assert anything about
file content encoding), and then also include a non-ascii Author name.
That ensures that we want to output Unicode for the name, and something
random for the diff content.

5) I'm a bit curious about this:
=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py 2009-05-23 04:55:52 +0000
+++ bzrlib/osutils.py 2009-05-25 05:46:01 +0000
@@ -686,6 +686,8 @@
     return offset.days * 86400 + offset.seconds

 weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in
weekdays]
+

^- Does this mean that we are skipping localization? So someone with
de_DE.UTF-8 will no longer see the German short-names for days of the week?

+def format_date_with_offset_in_original_timezone(t, offset=0,
+ _cache=_offset_cache):
+ """Return a formatted date string in the original timezone.
+
+ This routine may be faster then format_date.
+
+ :param t: Seconds since the epoch.
+ :param offset: Timezone offset in seconds east of utc.
+ """
+ if offset is None:
+ offset = 0
+ tt = time.gmtime(t + offset)
+ date_fmt = _default_format_by_weekday_num[tt[6]]
+ date_str = time.strftime(date_fmt, tt)
+ offset_str = _cache.get(offset, None)
+ if offset_str is None:
+ offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
+ _cache[offset] = offset_str
+ return date_str + offset_str
+

^- I wonder if it might be faster to include the offset_str inside the
'date_fmt' string, since it would require one less string...

Read more...

Ian Clatworthy (ian-clatworthy) wrote :

John A Meinel wrote:
> I'll try to give a bit of a summary as to why I think this is faster,
> though it would be nice if you would give a similar summary when submitting.
>
>
[snip]
That's an excellent summary - thanks.
> IMO the batching of everything for one revision seems just about right.
>
> I'm a bit concerned, though, about handling Unicode Author names and
> UTF-8 'diff' strings. My guess is that this actually breaks it, and we
> just don't know it yet.
>
> Can you please ensure that we have a test with non-ascii file content
> (even better if it isn't even UTF-8 since we don't assert anything about
> file content encoding), and then also include a non-ascii Author name.
> That ensures that we want to output Unicode for the name, and something
> random for the diff content.
>
I'm confident diff strings are handled correctly because Alexander
submitted a bug fixing this for 1.13. Note that diff strings aren't
written to self.to_file (in the code), but to a wrapped version thereof.

I suspect we do have tests for Unicode authors but I'll confirm that
(probably tomorrow).
> 5) I'm a bit curious about this:
> === modified file 'bzrlib/osutils.py'
> --- bzrlib/osutils.py 2009-05-23 04:55:52 +0000
> +++ bzrlib/osutils.py 2009-05-25 05:46:01 +0000
> @@ -686,6 +686,8 @@
> return offset.days * 86400 + offset.seconds
>
> weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
> +_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in
> weekdays]
> +
>
> ^- Does this mean that we are skipping localization? So someone with
> de_DE.UTF-8 will no longer see the German short-names for days of the week?
>
Yes it does. There's a separate function - format_local_date IIRC - that
doesn't use English day names. I'll run annotate and find out why we're
doing this but, at a wild guess, it might be related to ensuring tests
pass regardless of locale?

FWIW, I'm *not* changing current behaviour here. If we what to
change/fix the code, I'd like to make that a separate patch if that's OK.

Ian C.

Ian Clatworthy (ian-clatworthy) wrote :

So checking our existing test coverage, we have log tests for unicode messages and unicode diff content but not for unicode committer/authors. Do I need to add those before landing this? My preference is to put that task on my list but not do it immediately, given all I'm doing is replacing "multiple outf.write()" calls with "collecting lines & calling outf.write() once". I can't see how that can break things or how it reduces test coverage?

Otherwise, your suggested tweaks are fine by me.

Martin Pool (mbp) wrote :

2009/5/29 Ian Clatworthy <email address hidden>:
> So checking our existing test coverage, we have log tests for unicode messages and unicode diff content but not for unicode committer/authors. Do I need to add those before landing this? My preference is to put that task on my list but not do it immediately, given all I'm doing is replacing "multiple outf.write()" calls with "collecting lines & calling outf.write() once". I can't see how that can break things or how it reduces test coverage?

It's down to you, but I think those tests will be fairly easy to
write, and it's probably better not to have it hanging around.

--
Martin <http://launchpad.net/~mbp/>

4382. By Ian Clatworthy on 2009-06-10

merge bzr.dev r4426

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-08-30 23:51:10 +0000
3+++ NEWS 2009-08-31 04:38:13 +0000
4@@ -972,6 +972,8 @@
5 (Alexander Belchenko, Jelmer Vernooij, John Arbash Meinel)
6
7
8+* ``bzr log`` is now faster. (Ian Clatworthy)
9+
10 Bug Fixes
11 *********
12
13
14=== modified file 'bzrlib/log.py'
15--- bzrlib/log.py 2009-06-10 03:56:49 +0000
16+++ bzrlib/log.py 2009-08-31 04:38:13 +0000
17@@ -83,6 +83,7 @@
18 )
19 from bzrlib.osutils import (
20 format_date,
21+ format_date_with_offset_in_original_timezone,
22 get_terminal_encoding,
23 re_compile_checked,
24 terminal_width,
25@@ -384,27 +385,28 @@
26 :return: An iterator yielding LogRevision objects.
27 """
28 rqst = self.rqst
29+ levels = rqst.get('levels')
30+ limit = rqst.get('limit')
31+ diff_type = rqst.get('diff_type')
32 log_count = 0
33 revision_iterator = self._create_log_revision_iterator()
34 for revs in revision_iterator:
35 for (rev_id, revno, merge_depth), rev, delta in revs:
36 # 0 levels means show everything; merge_depth counts from 0
37- levels = rqst.get('levels')
38 if levels != 0 and merge_depth >= levels:
39 continue
40- diff = self._format_diff(rev, rev_id)
41+ if diff_type is None:
42+ diff = None
43+ else:
44+ diff = self._format_diff(rev, rev_id, diff_type)
45 yield LogRevision(rev, revno, merge_depth, delta,
46 self.rev_tag_dict.get(rev_id), diff)
47- limit = rqst.get('limit')
48 if limit:
49 log_count += 1
50 if log_count >= limit:
51 return
52
53- def _format_diff(self, rev, rev_id):
54- diff_type = self.rqst.get('diff_type')
55- if diff_type is None:
56- return None
57+ def _format_diff(self, rev, rev_id, diff_type):
58 repo = self.branch.repository
59 if len(rev.parent_ids) == 0:
60 ancestor_id = _mod_revision.NULL_REVISION
61@@ -1291,7 +1293,7 @@
62 preferred_levels = 0
63
64 def __init__(self, to_file, show_ids=False, show_timezone='original',
65- delta_format=None, levels=None, show_advice=False):
66+ delta_format=None, levels=None, show_advice=False):
67 """Create a LogFormatter.
68
69 :param to_file: the file to output to
70@@ -1367,48 +1369,64 @@
71 else:
72 return ''
73
74- def show_foreign_info(self, rev, indent):
75+ def show_properties(self, revision, indent):
76+ """Displays the custom properties returned by each registered handler.
77+
78+ If a registered handler raises an error it is propagated.
79+ """
80+ for line in self.custom_properties(revision):
81+ self.to_file.write("%s%s\n" % (indent, line))
82+
83+ def custom_properties(self, revision):
84+ """Format the custom properties returned by each registered handler.
85+
86+ If a registered handler raises an error it is propagated.
87+
88+ :return: a list of formatted lines (excluding trailing newlines)
89+ """
90+ lines = self._foreign_info_properties(revision)
91+ for key, handler in properties_handler_registry.iteritems():
92+ lines.extend(self._format_properties(handler(revision)))
93+ return lines
94+
95+ def _foreign_info_properties(self, rev):
96 """Custom log displayer for foreign revision identifiers.
97
98 :param rev: Revision object.
99 """
100 # Revision comes directly from a foreign repository
101 if isinstance(rev, foreign.ForeignRevision):
102- self._write_properties(indent, rev.mapping.vcs.show_foreign_revid(
103- rev.foreign_revid))
104- return
105+ return rev.mapping.vcs.show_foreign_revid(rev.foreign_revid)
106
107 # Imported foreign revision revision ids always contain :
108 if not ":" in rev.revision_id:
109- return
110+ return []
111
112 # Revision was once imported from a foreign repository
113 try:
114 foreign_revid, mapping = \
115 foreign.foreign_vcs_registry.parse_revision_id(rev.revision_id)
116 except errors.InvalidRevisionId:
117- return
118+ return []
119
120- self._write_properties(indent,
121+ return self._format_properties(
122 mapping.vcs.show_foreign_revid(foreign_revid))
123
124- def show_properties(self, revision, indent):
125- """Displays the custom properties returned by each registered handler.
126-
127- If a registered handler raises an error it is propagated.
128- """
129- for key, handler in properties_handler_registry.iteritems():
130- self._write_properties(indent, handler(revision))
131-
132- def _write_properties(self, indent, properties):
133+ def _format_properties(self, properties):
134+ lines = []
135 for key, value in properties.items():
136- self.to_file.write(indent + key + ': ' + value + '\n')
137+ lines.append(key + ': ' + value)
138+ return lines
139
140 def show_diff(self, to_file, diff, indent):
141 for l in diff.rstrip().split('\n'):
142 to_file.write(indent + '%s\n' % (l,))
143
144
145+# Separator between revisions in long format
146+_LONG_SEP = '-' * 60
147+
148+
149 class LongLogFormatter(LogFormatter):
150
151 supports_merge_revisions = True
152@@ -1417,46 +1435,59 @@
153 supports_tags = True
154 supports_diff = True
155
156+ def __init__(self, *args, **kwargs):
157+ super(LongLogFormatter, self).__init__(*args, **kwargs)
158+ if self.show_timezone == 'original':
159+ self.date_string = self._date_string_original_timezone
160+ else:
161+ self.date_string = self._date_string_with_timezone
162+
163+ def _date_string_with_timezone(self, rev):
164+ return format_date(rev.timestamp, rev.timezone or 0,
165+ self.show_timezone)
166+
167+ def _date_string_original_timezone(self, rev):
168+ return format_date_with_offset_in_original_timezone(rev.timestamp,
169+ rev.timezone or 0)
170+
171 def log_revision(self, revision):
172 """Log a revision, either merged or not."""
173 indent = ' ' * revision.merge_depth
174- to_file = self.to_file
175- to_file.write(indent + '-' * 60 + '\n')
176+ lines = [_LONG_SEP]
177 if revision.revno is not None:
178- to_file.write(indent + 'revno: %s%s\n' % (revision.revno,
179+ lines.append('revno: %s%s' % (revision.revno,
180 self.merge_marker(revision)))
181 if revision.tags:
182- to_file.write(indent + 'tags: %s\n' % (', '.join(revision.tags)))
183+ lines.append('tags: %s' % (', '.join(revision.tags)))
184 if self.show_ids:
185- to_file.write(indent + 'revision-id: ' + revision.rev.revision_id)
186- to_file.write('\n')
187+ lines.append('revision-id: %s' % (revision.rev.revision_id,))
188 for parent_id in revision.rev.parent_ids:
189- to_file.write(indent + 'parent: %s\n' % (parent_id,))
190- self.show_foreign_info(revision.rev, indent)
191- self.show_properties(revision.rev, indent)
192+ lines.append('parent: %s' % (parent_id,))
193+ lines.extend(self.custom_properties(revision.rev))
194
195 committer = revision.rev.committer
196 authors = revision.rev.get_apparent_authors()
197 if authors != [committer]:
198- to_file.write(indent + 'author: %s\n' % (", ".join(authors),))
199- to_file.write(indent + 'committer: %s\n' % (committer,))
200+ lines.append('author: %s' % (", ".join(authors),))
201+ lines.append('committer: %s' % (committer,))
202
203 branch_nick = revision.rev.properties.get('branch-nick', None)
204 if branch_nick is not None:
205- to_file.write(indent + 'branch nick: %s\n' % (branch_nick,))
206-
207- date_str = format_date(revision.rev.timestamp,
208- revision.rev.timezone or 0,
209- self.show_timezone)
210- to_file.write(indent + 'timestamp: %s\n' % (date_str,))
211-
212- to_file.write(indent + 'message:\n')
213+ lines.append('branch nick: %s' % (branch_nick,))
214+
215+ lines.append('timestamp: %s' % (self.date_string(revision.rev),))
216+
217+ lines.append('message:')
218 if not revision.rev.message:
219- to_file.write(indent + ' (no message)\n')
220+ lines.append(' (no message)')
221 else:
222 message = revision.rev.message.rstrip('\r\n')
223 for l in message.split('\n'):
224- to_file.write(indent + ' %s\n' % (l,))
225+ lines.append(' %s' % (l,))
226+
227+ # Dump the output, appending the delta and diff if requested
228+ to_file = self.to_file
229+ to_file.write("%s%s\n" % (indent, ('\n' + indent).join(lines)))
230 if revision.delta is not None:
231 # We don't respect delta_format for compatibility
232 revision.delta.show(to_file, self.show_ids, indent=indent,
233@@ -1515,7 +1546,6 @@
234 self.show_timezone, date_fmt="%Y-%m-%d",
235 show_offset=False),
236 tags, self.merge_marker(revision)))
237- self.show_foreign_info(revision.rev, indent+offset)
238 self.show_properties(revision.rev, indent+offset)
239 if self.show_ids:
240 to_file.write(indent + offset + 'revision-id:%s\n'
241
242=== modified file 'bzrlib/osutils.py'
243--- bzrlib/osutils.py 2009-07-23 16:01:17 +0000
244+++ bzrlib/osutils.py 2009-08-31 04:38:13 +0000
245@@ -687,6 +687,8 @@
246 return offset.days * 86400 + offset.seconds
247
248 weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
249+_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in weekdays]
250+
251
252 def format_date(t, offset=0, timezone='original', date_fmt=None,
253 show_offset=True):
254@@ -706,6 +708,32 @@
255 date_str = time.strftime(date_fmt, tt)
256 return date_str + offset_str
257
258+
259+# Cache of formatted offset strings
260+_offset_cache = {}
261+
262+
263+def format_date_with_offset_in_original_timezone(t, offset=0,
264+ _cache=_offset_cache):
265+ """Return a formatted date string in the original timezone.
266+
267+ This routine may be faster then format_date.
268+
269+ :param t: Seconds since the epoch.
270+ :param offset: Timezone offset in seconds east of utc.
271+ """
272+ if offset is None:
273+ offset = 0
274+ tt = time.gmtime(t + offset)
275+ date_fmt = _default_format_by_weekday_num[tt[6]]
276+ date_str = time.strftime(date_fmt, tt)
277+ offset_str = _cache.get(offset, None)
278+ if offset_str is None:
279+ offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
280+ _cache[offset] = offset_str
281+ return date_str + offset_str
282+
283+
284 def format_local_date(t, offset=0, timezone='original', date_fmt=None,
285 show_offset=True):
286 """Return an unicode date string formatted according to the current locale.
287@@ -725,6 +753,7 @@
288 date_str = date_str.decode(get_user_encoding(), 'replace')
289 return date_str + offset_str
290
291+
292 def _format_date(t, offset, timezone, date_fmt, show_offset):
293 if timezone == 'utc':
294 tt = time.gmtime(t)
295
296=== modified file 'bzrlib/revision.py'
297--- bzrlib/revision.py 2009-06-30 16:16:55 +0000
298+++ bzrlib/revision.py 2009-08-31 04:38:14 +0000
299@@ -54,8 +54,11 @@
300
301 def __init__(self, revision_id, properties=None, **args):
302 self.revision_id = revision_id
303- self.properties = properties or {}
304- self._check_properties()
305+ if properties is None:
306+ self.properties = {}
307+ else:
308+ self.properties = properties
309+ self._check_properties()
310 self.committer = None
311 self.parent_ids = []
312 self.parent_sha1s = []
313
314=== modified file 'bzrlib/tests/test_osutils.py'
315--- bzrlib/tests/test_osutils.py 2009-07-23 16:01:17 +0000
316+++ bzrlib/tests/test_osutils.py 2009-08-31 04:38:14 +0000
317@@ -365,6 +365,14 @@
318 # Instead blackbox.test_locale should check for localized
319 # dates once they do occur in output strings.
320
321+ def test_format_date_with_offset_in_original_timezone(self):
322+ self.assertEqual("Thu 1970-01-01 00:00:00 +0000",
323+ osutils.format_date_with_offset_in_original_timezone(0))
324+ self.assertEqual("Fri 1970-01-02 03:46:40 +0000",
325+ osutils.format_date_with_offset_in_original_timezone(100000))
326+ self.assertEqual("Fri 1970-01-02 05:46:40 +0200",
327+ osutils.format_date_with_offset_in_original_timezone(100000, 7200))
328+
329 def test_local_time_offset(self):
330 """Test that local_time_offset() returns a sane value."""
331 offset = osutils.local_time_offset()