Bazaar

Code review comment for lp:~ian-clatworthy/bzr/faster-log.old

faster-log.old
Merge into trunk-old

Revision history for this message

John A Meinel (jameinel) wrote on 2009-05-25:

Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~ian-clatworthy/bzr/faster-log into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This patch makes log --long faster by around 10%. For OpenOffice.org with 260K revisions, the time drops from 50 seconds to 46 seconds.
>

I'll try to give a bit of a summary as to why I think this is faster,
though it would be nice if you would give a similar summary when submitting.

1) Pull things like
+ levels = rqst.get('levels')
+ limit = rqst.get('limit')
+ diff_type = rqst.get('diff_type')

Out of the inner loops, and set the variable once at the outside. This
also handles stuff like using an if in the inner loop rather than a
function call:

+ if diff_type is None:
+ diff = None
+ else:
+ diff = self._format_diff(rev, rev_id, diff_type)

2) Factor out "custom_properties" into a re-used helper function.

3) Add a new function: format_date_with_offset_in_original_timezone

4) Change log_revision(self, revision)

so that rather than calling outfile.write() for every line of the
revision, batching it up into a list, and then using ('\n' +
indent).join(lines) and writing that out as a single string.

IMO the batching of everything for one revision seems just about right.

I'm a bit concerned, though, about handling Unicode Author names and
UTF-8 'diff' strings. My guess is that this actually breaks it, and we
just don't know it yet.

Can you please ensure that we have a test with non-ascii file content
(even better if it isn't even UTF-8 since we don't assert anything about
file content encoding), and then also include a non-ascii Author name.
That ensures that we want to output Unicode for the name, and something
random for the diff content.

5) I'm a bit curious about this:
=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py 2009-05-23 04:55:52 +0000
+++ bzrlib/osutils.py 2009-05-25 05:46:01 +0000
@@ -686,6 +686,8 @@
return offset.days * 86400 + offset.seconds

weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in
weekdays]
+

^- Does this mean that we are skipping localization? So someone with
de_DE.UTF-8 will no longer see the German short-names for days of the week?

+def format_date_with_offset_in_original_timezone(t, offset=0,
+ _cache=_offset_cache):
+ """Return a formatted date string in the original timezone.
+
+ This routine may be faster then format_date.
+
+ :param t: Seconds since the epoch.
+ :param offset: Timezone offset in seconds east of utc.
+ """
+ if offset is None:
+ offset = 0
+ tt = time.gmtime(t + offset)
+ date_fmt = _default_format_by_weekday_num[tt[6]]
+ date_str = time.strftime(date_fmt, tt)
+ offset_str = _cache.get(offset, None)
+ if offset_str is None:
+ offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
+ _cache[offset] = offset_str
+ return date_str + offset_str
+

^- I wonder if it might be faster to include the offset_str inside the
'date_fmt' string, since it would require one less string combination.
That said, I wouldn't lose much sleep over it. Though you've obviously
spent a lot of time trying to make it fast.

I would also say that if you are going to have "if offset is None:" then
I would just set the default as "offset=None" rather than "offset=0".

John
=:->

Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~ian-clatworthy/bzr/faster-log into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This patch makes log --long faster by around 10%. For OpenOffice.org with 260K revisions, the time drops from 50 seconds to 46 seconds.
>

I'll try to give a bit of a summary as to why I think this is faster,
though it would be nice if you would give a similar summary when submitting.

1) Pull things like
+        levels = rqst.get('levels')
+        limit = rqst.get('limit')
+        diff_type = rqst.get('diff_type')

Out of the inner loops, and set the variable once at the outside. This
also handles stuff like using an if in the inner loop rather than a
function call:

+                if diff_type is None:
+                    diff = None
+                else:
+                    diff = self._format_diff(rev, rev_id, diff_type)

2) Factor out "custom_properties" into a re-used helper function.

3) Add a new function:  format_date_with_offset_in_original_timezone

4) Change log_revision(self, revision)

so that rather than calling outfile.write() for every line of the
revision, batching it up into a list, and then using ('\n' +
indent).join(lines) and writing that out as a single string.

IMO the batching of everything for one revision seems just about right.

I'm a bit concerned, though, about handling Unicode Author names and
UTF-8 'diff' strings. My guess is that this actually breaks it, and we
just don't know it yet.

5) I'm a bit curious about this:
=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py	2009-05-23 04:55:52 +0000
+++ bzrlib/osutils.py	2009-05-25 05:46:01 +0000
@@ -686,6 +686,8 @@
     return offset.days * 86400 + offset.seconds

weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+_default_format_by_weekday_num = [wd + " %Y-%m-%d %H:%M:%S" for wd in
weekdays]
+

^- Does this mean that we are skipping localization? So someone with
de_DE.UTF-8 will no longer see the German short-names for days of the week?

+def format_date_with_offset_in_original_timezone(t, offset=0,
+    _cache=_offset_cache):
+    """Return a formatted date string in the original timezone.
+
+    This routine may be faster then format_date.
+
+    :param t: Seconds since the epoch.
+    :param offset: Timezone offset in seconds east of utc.
+    """
+    if offset is None:
+        offset = 0
+    tt = time.gmtime(t + offset)
+    date_fmt = _default_format_by_weekday_num[tt[6]]
+    date_str = time.strftime(date_fmt, tt)
+    offset_str = _cache.get(offset, None)
+    if offset_str is None:
+        offset_str = ' %+03d%02d' % (offset / 3600, (offset / 60) % 60)
+        _cache[offset] = offset_str
+    return date_str + offset_str
+

I would also say that if you are going to have "if offset is None:" then
I would just set the default as "offset=None" rather than "offset=0".

John
=:->

« Back to merge proposal