Merge lp:~ian-clatworthy/bzr/faster-diff2 into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Merged
Approved by: Ian Clatworthy
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~ian-clatworthy/bzr/faster-diff2
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 73 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-diff2
Reviewer Review Type Date Requested Status
Martin Pool Disapprove
Review via email: mp+6813@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Some more tweaks for speeding up diff. One tweak stops multiple lookups of the dirstate entry in DirStateRevisionTree while grabbing info (e.g. mtime, contents) about a file-id. The other skips an id2path conversion when the path is already known.

On OOo, these take the time down from 2.6 to 1.7 seconds for 'bzr diff' and from 1.5 to 1.0 seconds for 'bzr diff file'.

Revision history for this message
Martin Pool (mbp) wrote :

I'm really not sure about the cache part of this:

Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster. Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level. (I wonder what John thinks.) Could you store it at the diff level?

review: Disapprove
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Martin Pool wrote:
> Review: Disapprove
> I'm really not sure about the cache part of this:
>
> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster. Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level. (I wonder what John thinks.) Could you store it at the diff level?
>
So, there are 2 issues here. I'll respond to the simple one first.

FWIW, I did initially try using a tuple as the cache key but different
parts of the code use different tuples to look up the same entry, e.g.
('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
thing. I explicitly append "path:" to path keys so that it can only
break in the very obscure circumstance of:

1. a file-id starting with that prefix
2. the rest of the file-id matching a path in the tree
3. a path being passed to a method without a file-id

So it's possible but about as likely as me winning the lottery. :-) One
simple fix is to only cache if the file-id is set and just use it. Or we
can prefix path keys with a character not permitted in file-ids. Of
course, if we're not caching at the tree level then it doesn't matter.

Cache consistency on dirstate mutation is a far more serious problem.
I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
naively) felt it couldn't change. Or that the change was easily detected?

I'm not overly keen to solve cache consistency at the diff level. The
code there is quite clean simply doing stuff like tree.get_mtime() and
tree.get_file_text(). I don't think we want diff code knowing about or
caring about dirstate entries. We could introduce a compound method
though that hands back both the mtime and file-text at once. diff.py
would need a fair amount of refactoring to use it though because the two
bits of data are looked up quite separately currently. Another thing to
keep in mind is that something like 'get_file_text_and_mtime()' only
helps diff while a lower level cache potentially helps many operations.

Another option is to move the cache down into dirstate, where it can be
easily cleared when a change happens, particularly if my
"faster-dirstate-saving" branch lands with the refactoring I've done in
there. We seem to have a pretty common pattern of:

  1. call iter changes to find what's different
  2. call get_entry to lookup the entry/s for each change

iter-changes is *really* fast while get-entry on large trees isn't
(because it bisects every time). So maybe iter_changes ought to keep a
cache of changed entries it finds and get_entry could check that first?
If I'm right, the combination ought to scream along, speeding up diff
and perhaps several other operations too. The downside is that I hate
changing dirstate :-) and that the change might require modifications to
both python and pyrex code. But it sounds a better direction in theory.

How would you like me to proceed?

Revision history for this message
Martin Pool (mbp) wrote :
Download full text (4.7 KiB)

2009/5/28 Ian Clatworthy <email address hidden>:
> Martin Pool wrote:
>> Review: Disapprove
>> I'm really not sure about the cache part of this:
>>
>> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster.  Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level.  (I wonder what John thinks.)  Could you store it at the diff level?
>>
> So, there are 2 issues here. I'll respond to the simple one first.
>
> FWIW, I did initially try using a tuple as the cache key but different
> parts of the code use different tuples to look up the same entry, e.g.
> ('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
> thing. I explicitly append "path:" to path keys so that it can only
> break in the very obscure circumstance of:
>
> 1. a file-id starting with that prefix
> 2. the rest of the file-id matching a path in the tree
> 3. a path being passed to a method without a file-id
>
> So it's possible but about as likely as me winning the lottery. :-) One
> simple fix is to only cache if the file-id is set and just use it. Or we
> can prefix path keys with a character not permitted in file-ids. Of
> course, if we're not caching at the tree level then it doesn't matter.

It's just that if there are other ways to spell it, it seems a shame
to have the risk at all, and for that matter to have new strings in
use that are different from what's used elsewhere.

>
> Cache consistency on dirstate mutation is a far more serious problem.
> I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
> naively) felt it couldn't change. Or that the change was easily detected?

I haven't strictly checked if this will break your cache, but the
dirstate object can hold different basis trees after a commit or
update. (set_parent_trees or something.) I can't prove it will fail
but it certainly seems risky, or likely to break later on.

>
> I'm not overly keen to solve cache consistency at the diff level. The
> code there is quite clean simply doing stuff like tree.get_mtime() and
> tree.get_file_text(). I don't think we want diff code knowing about or
> caring about dirstate entries. We could introduce a compound method
> though that hands back both the mtime and file-text at once. diff.py
> would need a fair amount of refactoring to use it though because the two
> bits of data are looked up quite separately currently. Another thing to
> keep in mind is that something like 'get_file_text_and_mtime()' only
> helps diff while a lower level cache potentially helps many operations.

That sounds a lot more attractive to me; it's just better all round to
do fewer calls and get all you want, rather than more calls. Aside
from avoiding possible bugs in the cache, it will be faster when the
cache misses, and faster even when the cache hits because of less
function calls.

>
> Another option is to move the cache down into dirstate, where it can be
> easily cleared when a change happens, particularly if my
> "f...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.1 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> iter-changes is *really* fast while get-entry on large trees isn't
>> (because it bisects every time). So maybe iter_changes ought to keep a
>> cache of changed entries it finds and get_entry could check that first?
>> If I'm right, the combination ought to scream along, speeding up diff
>> and perhaps several other operations too. The downside is that I hate
>> changing dirstate :-) and that the change might require modifications to
>> both python and pyrex code. But it sounds a better direction in theory.
>>
>> How would you like me to proceed?
>
> I believe the point of iter_changes was that it'd give you back all
> you need, and we'd prefer not to do more round trips afterwards.
> Ideally it would be a bit abstracted away from the details of dirstate
> representation, but still tuples that are fast to construct. If there
> are some items that would be slow to compute and not always needed you
> could possibly pass back a (curried?) callable.
>
> What does diff need to get from the tree that's not returned by
> iter_changes? The text and the mtime? Really neither of them ought
> to need to look at the dirstate, if they have the file id and path, as
> far as I can see. The mtime just comes from either the working copy
> or the revision, and the text too can be determined by either the path
> or the file-id.
>

For a while now I've been of the mind that 'iter_changes' should be
returning real objects, rather than tuples. IMO the *important* bit is
that you need to avoid doing O(tree) objects for a primarily O(changes)
operation.

iter_changes() can be "O(tree)" but only when...
1) include_unchanged=True
2) initial commit

Under both cases, the action being *performed* is O(tree), so
iter_changes isn't creating extra overhead.

If I was to personally give a direction for this, it would be to create
a new iter_changes-like operation, which returns real objects with
attributes back to the calling code. (An initial skin could just wrap
iter_changes() into real objects.)
And along with those objects, you could then add "get me more
information", which could include stuff like caching in the dirblock
entry, to quickly return extra info.

If you feel the overhead is in 'bisect' time, then we should look at
where that bisect is being done. I would much rather have a 'location
hint' cache at the dirstate level, which would seed the bisect search.
So when iter_changes finds something interesting about ('foo', 'bar',
'baz') it can then set the cache hint, so that a lookup of ('foo',
'bar') knows right where to find the object. I would also design it so
that it grabs that entry, checks if it is correct. If not, it
invalidates the entry (possibly the whole cache?) and then does the
normal bisect.

This would also have implications for multiple '_get_entry' calls, since
if you just grabbed it for X.rev, it should be extremely fast to grab it
for X.mtime.

Note, I would use something like an LRU cache or even FIFO cache here. I
guess you did somethin along the same lines with the
'_last_entry[_lookup]' functions, though I think I might put it under
the _get_entry() call, rather than above it.

...

Read more...

Revision history for this message
Martin Pool (mbp) wrote :

2009/5/28 John A Meinel <email address hidden>:
> If I was to personally give a direction for this, it would be to create
> a new iter_changes-like operation, which returns real objects with
> attributes back to the calling code. (An initial skin could just wrap
> iter_changes() into real objects.)
> And along with those objects, you could then add "get me more
> information", which could include stuff like caching in the dirblock
> entry, to quickly return extra info.

I agree.

>
> If you feel the overhead is in 'bisect' time, then we should look at
> where that bisect is being done. I would much rather have a 'location
> hint' cache at the dirstate level, which would seed the bisect search.
> So when iter_changes finds something interesting about ('foo', 'bar',
> 'baz') it can then set the cache hint, so that a lookup of ('foo',
> 'bar') knows right where to find the object. I would also design it so
> that it grabs that entry, checks if it is correct. If not, it
> invalidates the entry (possibly the whole cache?) and then does the
> normal bisect.

I agree with that too, and the 4-point plan.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

Also, if it works in isolation, can I suggest that you merge the first
part of this patch without the cache?

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Martin Pool wrote:
>> If you feel the overhead is in 'bisect' time, then we should look at
>> where that bisect is being done. I would much rather have a 'location
>> hint' cache at the dirstate level, which would seed the bisect search.
>> So when iter_changes finds something interesting about ('foo', 'bar',
>> 'baz') it can then set the cache hint, so that a lookup of ('foo',
>> 'bar') knows right where to find the object. I would also design it so
>> that it grabs that entry, checks if it is correct. If not, it
>> invalidates the entry (possibly the whole cache?) and then does the
>> normal bisect.
>>
>
> I agree with that too, and the 4-point plan.
>
Me too. I think John's approach rocks.

Here's what I'd like to do:

1. back out the cache at the tree level
2. land the rest of this patch - the non-contentious bit
3. implement John's suggestion in a separate branch.

Sound ok?

Ian C.

Revision history for this message
Martin Pool (mbp) wrote :

2009/5/28 Ian Clatworthy <email address hidden>:
> 1. back out the cache at the tree level
> 2. land the rest of this patch - the non-contentious bit
> 3. implement John's suggestion in a separate branch.

Great.

--
Martin <http://launchpad.net/~mbp/>

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'NEWS'
--- NEWS 2009-05-28 17:14:38 +0000
+++ NEWS 2009-05-28 18:35:44 +0000
@@ -23,6 +23,8 @@
2323
24* ``bzr branch --notree`` is now faster. (Ian Clatworthy)24* ``bzr branch --notree`` is now faster. (Ian Clatworthy)
2525
26* ``bzr diff`` is now faster on large trees. (Ian Clatworthy)
27
26Bug Fixes28Bug Fixes
27*********29*********
2830
2931
=== modified file 'bzrlib/diff.py'
--- bzrlib/diff.py 2009-03-28 02:10:54 +0000
+++ bzrlib/diff.py 2009-05-28 18:35:44 +0000
@@ -620,9 +620,11 @@
620 return self.CANNOT_DIFF620 return self.CANNOT_DIFF
621 from_label = '%s%s\t%s' % (self.old_label, old_path, old_date)621 from_label = '%s%s\t%s' % (self.old_label, old_path, old_date)
622 to_label = '%s%s\t%s' % (self.new_label, new_path, new_date)622 to_label = '%s%s\t%s' % (self.new_label, new_path, new_date)
623 return self.diff_text(from_file_id, to_file_id, from_label, to_label)623 return self.diff_text(from_file_id, to_file_id, from_label, to_label,
624 old_path, new_path)
624625
625 def diff_text(self, from_file_id, to_file_id, from_label, to_label):626 def diff_text(self, from_file_id, to_file_id, from_label, to_label,
627 from_path=None, to_path=None):
626 """Diff the content of given files in two trees628 """Diff the content of given files in two trees
627629
628 :param from_file_id: The id of the file in the from tree. If None,630 :param from_file_id: The id of the file in the from tree. If None,
@@ -630,15 +632,17 @@
630 :param to_file_id: The id of the file in the to tree. This may refer632 :param to_file_id: The id of the file in the to tree. This may refer
631 to a different file from from_file_id. If None,633 to a different file from from_file_id. If None,
632 the file is not present in the to tree.634 the file is not present in the to tree.
635 :param from_path: The path in the from tree or None if unknown.
636 :param to_path: The path in the to tree or None if unknown.
633 """637 """
634 def _get_text(tree, file_id):638 def _get_text(tree, file_id, path):
635 if file_id is not None:639 if file_id is not None:
636 return tree.get_file(file_id).readlines()640 return tree.get_file(file_id, path).readlines()
637 else:641 else:
638 return []642 return []
639 try:643 try:
640 from_text = _get_text(self.old_tree, from_file_id)644 from_text = _get_text(self.old_tree, from_file_id, from_path)
641 to_text = _get_text(self.new_tree, to_file_id)645 to_text = _get_text(self.new_tree, to_file_id, to_path)
642 self.text_differ(from_label, from_text, to_label, to_text,646 self.text_differ(from_label, from_text, to_label, to_text,
643 self.to_file)647 self.to_file)
644 except errors.BinaryFile:648 except errors.BinaryFile:
@@ -882,7 +886,7 @@
882 self.to_file.write("=== modified %s '%s'%s\n" % (kind[0],886 self.to_file.write("=== modified %s '%s'%s\n" % (kind[0],
883 newpath_encoded, prop_str))887 newpath_encoded, prop_str))
884 if changed_content:888 if changed_content:
885 self.diff(file_id, oldpath, newpath)889 self._diff(file_id, oldpath, newpath, kind[0], kind[1])
886 has_changes = 1890 has_changes = 1
887 if renamed:891 if renamed:
888 has_changes = 1892 has_changes = 1
@@ -903,7 +907,10 @@
903 new_kind = self.new_tree.kind(file_id)907 new_kind = self.new_tree.kind(file_id)
904 except (errors.NoSuchId, errors.NoSuchFile):908 except (errors.NoSuchId, errors.NoSuchFile):
905 new_kind = None909 new_kind = None
906910 self._diff(file_id, old_path, new_path, old_kind, new_kind)
911
912
913 def _diff(self, file_id, old_path, new_path, old_kind, new_kind):
907 result = DiffPath._diff_many(self.differs, file_id, old_path,914 result = DiffPath._diff_many(self.differs, file_id, old_path,
908 new_path, old_kind, new_kind)915 new_path, old_kind, new_kind)
909 if result is DiffPath.CANNOT_DIFF:916 if result is DiffPath.CANNOT_DIFF: